AskDocs

A bilingual (UK/EN) Retrieval-Augmented Generation system for document analysis: ask questions about any uploaded files and get grounded, cited answers. Hybrid search, prompt-injection & PII guardrails, token-budgeted context, and a streaming single-page UI.

This is a personal portfolio project. It is intentionally compact but built with production-minded engineering: retry/back-off, caching, rate limiting, input/output guardrails, and graceful degradation.

What it does

Upload documents, ask questions in natural language, and get answers grounded only in your own corpus, with mandatory source citations. The assistant refuses to answer when the context does not contain the answer (hallucination mitigation). It is domain-agnostic — contracts, reports, manuals, research papers, notes, spreadsheets, anything with text.

Supported formats: PDF, TXT, MD, DOCX, XLSX, PPTX, CSV.

Architecture

flowchart TD
    U["Browser — vanilla-JS SPA"] -->|SSE /api/query/stream| MW["FastAPI<br/>Request-ID · CORS · Rate-limit"]

    subgraph SEC["Input guardrails"]
        G1[check_length]
        G2["check_injection<br/>EN + UK patterns"]
        G3["mask_pii (optional)"]
    end

    subgraph RAG["RAG pipeline"]
        E1["embed_query<br/>OpenAI + LRU cache + retry"]
        H["Hybrid search<br/>0.6·vector + 0.4·BM25"]
        R["Cross-encoder rerank<br/>(optional)"]
        O["Reorder — Lost-in-the-Middle"]
        B["Token-budget guard"]
    end

    MW --> SEC --> RAG
    E1 --> H --> R --> O --> B --> LLM["OpenAI chat<br/>token streaming"]
    H <-->|vectors + metadata| VS[("ChromaDB")]
    LLM --> OG["check_output_safety"] --> UM["unmask_pii"] --> SESS["Session history<br/>in-memory · TTL 1h"]
    UM -->|SSE tokens| U

    subgraph IDX["Indexing"]
        L["Loader<br/>PDF/DOCX/XLSX/PPTX/TXT/MD/CSV"] --> C["Sentence-aware<br/>chunker + overlap"] --> E2[embed_texts] --> VS
    end
    U -->|POST /api/documents/upload| L

Request flow

POST /api/query/stream
  → rate-limit (per-IP, sliding window)
  → input guardrails (length, injection EN/UK, optional PII masking)
  → session get/create  → embed(question)  [LRU cache + exponential back-off]
  → ChromaDB vector search → BM25 re-score → hybrid top-K
  → [optional] cross-encoder rerank → reorder (Lost-in-the-Middle) → token-budget trim
  → OpenAI chat (streamed) ──SSE──> tokens to browser
  → output guardrail (prompt-leak / suspicious-URL) → PII unmask → persist to session

Quick start

pip install -r requirements.txt

cp .env.example .env          # then set OPENAI_API_KEY in .env
python main.py                # or: python start.py  (runs pre-flight checks)

Open http://localhost:8000. Sample documents in data/sample_docs/ are auto-indexed on first run.

Optional cross-encoder re-ranker (heavy, pulls in torch):

pip install -r requirements-optional.txt   # then set USE_RERANKER=true in .env

Configuration (`.env`)

Variable	Default	Description
`OPENAI_API_KEY`	—	Required. OpenAI API key
`APP_HOST` / `APP_PORT`	`0.0.0.0` / `8000`	Bind address
`DEBUG`	`false`	Enables auto-reload + debug logs
`ALLOWED_ORIGINS`	`http://localhost:8000`	Comma-separated CORS origins
`CHAT_MODEL`	`gpt-4o-mini`	Chat model. Legacy models use `temperature`/`max_tokens`; newer ones use `max_completion_tokens`
`EMBEDDING_MODEL`	`text-embedding-3-small`	Embedding model
`CHUNK_SIZE`	`512`	Chunk size (characters)
`CHUNK_OVERLAP`	`100`	Overlap between chunks (~20%)
`TOP_K`	`5`	Chunks retrieved per query
`MAX_CONTEXT_TOKENS`	`6000`	Token budget for RAG context
`USE_RERANKER`	`false`	Enable cross-encoder re-ranking
`MAX_INPUT_CHARS`	`10000`	Max question length
`RATE_LIMIT_PER_MINUTE`	`30`	Requests per IP per minute
`OPENAI_MAX_RETRIES`	`3`	OpenAI client retry attempts
`OPENAI_TIMEOUT_SECONDS`	`60`	OpenAI request timeout
`CHROMA_PATH`	`./data/chroma_db`	ChromaDB persistence path

API

Method	Path	Description
GET	`/api/health`	Health check
GET	`/api/stats`	DB + session statistics
POST	`/api/query`	Ask a question (non-streaming)
POST	`/api/query/stream`	Ask a question (SSE streaming)
DELETE	`/api/sessions/{id}`	Clear a conversation session
POST	`/api/documents/upload`	Upload & index a document
GET	`/api/documents`	List indexed documents
DELETE	`/api/documents/{name}`	Delete a document and its chunks

Interactive API docs: /api/docs (Swagger), /api/redoc.

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is this document about?", "language": "en", "mask_pii": false}'

RAG features

Hybrid search — 0.6 · cosine + 0.4 · BM25, BM25 rescored over the vector candidate set (keeps it O(candidates), not O(corpus)).
Cross-encoder re-ranking (optional) — offloaded to a thread pool so the async event loop is never blocked; degrades gracefully if unavailable.
Lost-in-the-Middle mitigation — top-ranked chunks anchored at the start and end of the context (Liu et al., 2023).
Token-budget guard — context trimmed to MAX_CONTEXT_TOKENS (tiktoken).
Sentence-aware chunking — respects paragraph/section boundaries and common abbreviations; configurable overlap.
Embedding cache — SHA-256-keyed, size-capped LRU.
Resilience — exponential back-off on transient OpenAI errors.
Re-indexing — re-uploading a document replaces its old chunks.

Security features

Prompt-injection detection — 40+ regex patterns, English and Ukrainian, risk-scored (BLOCK ≥ 0.9, WARN ≥ 0.6), case-insensitive.
PII masking / unmasking — email, UA phone, tax ID, IBAN, card; masked before the LLM, restored in the response.
Output guardrail — system-prompt-leak and suspicious-URL detection (advisory; logged).
File magic-byte validation — rejects extension-spoofed uploads.
Filename sanitization — Path().name + allow-list regex; prevents path traversal and null-byte/RTL tricks.
Decompression-bomb guard — OOXML (DOCX/XLSX/PPTX) archives are inspected and rejected if the inflated size / compression ratio is implausible.
Streamed-size enforcement — upload capped without loading it all in RAM.
Per-IP rate limiting — sliding window with periodic purge.
No tracebacks to clients — errors return generic localized messages; attacker payloads are never logged verbatim.
Dependency auditing — pip-audit is part of the dev tooling; the pinned runtime tree is CVE-clean (pip-audit -r requirements.txt).

Residual risks (by design, documented — not silently ignored)

No authentication / multi-tenancy. Anyone with network access can query, upload and delete, and drive OpenAI cost. Intended for local / single-user portfolio use; production would add auth + quotas.
Indirect prompt injection. Injection scoring runs on the question, not on retrieved document text. A hostile uploaded document could try to steer the model; the system prompt mitigates but does not eliminate this.
PII in document context still reaches the LLM provider (only the question is masked) — expected for a RAG, but a privacy consideration.
Rate limiting keys on request.client.host. Behind a reverse proxy all clients share an IP unless the proxy is configured; X-Forwarded-For is intentionally not trusted (spoofable). Deploy accordingly.
Regex-based prompt-injection detection is defense-in-depth, not a guarantee — it raises the bar, it is not a security boundary.

Tests

pip install -r requirements-dev.txt
pytest -q

Covers chunking, PII mask/unmask round-trip, injection detection (EN/UK), length guard, and file magic-byte validation.

Audit dependencies for known CVEs:

pip-audit -r requirements.txt

Project structure

askdocs/
├── main.py                 FastAPI app entry point + lifespan
├── start.py                Pre-flight checks + launcher
├── config.py               pydantic-settings configuration
├── backend/
│   ├── rag/
│   │   ├── loader.py       Multi-format document loaders
│   │   ├── chunker.py      Sentence-aware chunking
│   │   ├── embeddings.py   OpenAI embeddings + LRU cache + retry
│   │   ├── vector_store.py ChromaDB + hybrid (vector + BM25) search
│   │   ├── reranker.py     Optional cross-encoder re-ranking
│   │   ├── session.py      In-memory session manager (TTL)
│   │   └── pipeline.py     RAG orchestration (sync + streaming)
│   ├── security/guards.py  Injection / PII / output / file guards
│   └── api/routes.py       FastAPI routes
├── frontend/               Vanilla-JS SPA (SSE streaming, i18n)
├── tests/                  pytest suite
└── data/sample_docs/       Auto-indexed samples

Limitations & future work

These are conscious trade-offs for a single-process portfolio project:

Sessions & rate limiting are in-memory — not shared across workers and reset on restart. Production would use Redis.
No authentication / multi-tenancy — anyone with network access can query.
No OCR — scanned / image-only PDFs have no text layer and are rejected with a clear message; only PDFs with embedded text are supported.
Document identity is the filename stem — two different files with the same stem would collide in the vector store.
Chunk char offsets are approximate (stored for debugging only).
Single-node ChromaDB — fine for thousands of chunks, not millions.

Possible next steps: Redis-backed sessions/rate-limiting, auth, evaluation harness (retrieval precision / answer faithfulness), Dockerfile + CI.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AskDocs

What it does

Architecture

Quick start

Configuration (`.env`)

API

RAG features

Security features

Residual risks (by design, documented — not silently ignored)

Tests

Project structure

Limitations & future work

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
backend		backend
data/sample_docs		data/sample_docs
frontend		frontend
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-optional.txt		requirements-optional.txt
requirements.txt		requirements.txt
start.py		start.py

Folders and files

Latest commit

History

Repository files navigation

AskDocs

What it does

Architecture

Quick start

Configuration (.env)

API

RAG features

Security features

Residual risks (by design, documented — not silently ignored)

Tests

Project structure

Limitations & future work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

Configuration (`.env`)