A bilingual (UK/EN) Retrieval-Augmented Generation system for document analysis: ask questions about any uploaded files and get grounded, cited answers. Hybrid search, prompt-injection & PII guardrails, token-budgeted context, and a streaming single-page UI.
This is a personal portfolio project. It is intentionally compact but built with production-minded engineering: retry/back-off, caching, rate limiting, input/output guardrails, and graceful degradation.
Upload documents, ask questions in natural language, and get answers grounded only in your own corpus, with mandatory source citations. The assistant refuses to answer when the context does not contain the answer (hallucination mitigation). It is domain-agnostic — contracts, reports, manuals, research papers, notes, spreadsheets, anything with text.
Supported formats: PDF, TXT, MD, DOCX, XLSX, PPTX, CSV.
flowchart TD
U["Browser — vanilla-JS SPA"] -->|SSE /api/query/stream| MW["FastAPI<br/>Request-ID · CORS · Rate-limit"]
subgraph SEC["Input guardrails"]
G1[check_length]
G2["check_injection<br/>EN + UK patterns"]
G3["mask_pii (optional)"]
end
subgraph RAG["RAG pipeline"]
E1["embed_query<br/>OpenAI + LRU cache + retry"]
H["Hybrid search<br/>0.6·vector + 0.4·BM25"]
R["Cross-encoder rerank<br/>(optional)"]
O["Reorder — Lost-in-the-Middle"]
B["Token-budget guard"]
end
MW --> SEC --> RAG
E1 --> H --> R --> O --> B --> LLM["OpenAI chat<br/>token streaming"]
H <-->|vectors + metadata| VS[("ChromaDB")]
LLM --> OG["check_output_safety"] --> UM["unmask_pii"] --> SESS["Session history<br/>in-memory · TTL 1h"]
UM -->|SSE tokens| U
subgraph IDX["Indexing"]
L["Loader<br/>PDF/DOCX/XLSX/PPTX/TXT/MD/CSV"] --> C["Sentence-aware<br/>chunker + overlap"] --> E2[embed_texts] --> VS
end
U -->|POST /api/documents/upload| L
Request flow
POST /api/query/stream
→ rate-limit (per-IP, sliding window)
→ input guardrails (length, injection EN/UK, optional PII masking)
→ session get/create → embed(question) [LRU cache + exponential back-off]
→ ChromaDB vector search → BM25 re-score → hybrid top-K
→ [optional] cross-encoder rerank → reorder (Lost-in-the-Middle) → token-budget trim
→ OpenAI chat (streamed) ──SSE──> tokens to browser
→ output guardrail (prompt-leak / suspicious-URL) → PII unmask → persist to session
pip install -r requirements.txt
cp .env.example .env # then set OPENAI_API_KEY in .env
python main.py # or: python start.py (runs pre-flight checks)Open http://localhost:8000. Sample documents in data/sample_docs/ are
auto-indexed on first run.
Optional cross-encoder re-ranker (heavy, pulls in torch):
pip install -r requirements-optional.txt # then set USE_RERANKER=true in .env| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
— | Required. OpenAI API key |
APP_HOST / APP_PORT |
0.0.0.0 / 8000 |
Bind address |
DEBUG |
false |
Enables auto-reload + debug logs |
ALLOWED_ORIGINS |
http://localhost:8000 |
Comma-separated CORS origins |
CHAT_MODEL |
gpt-4o-mini |
Chat model. Legacy models use temperature/max_tokens; newer ones use max_completion_tokens |
EMBEDDING_MODEL |
text-embedding-3-small |
Embedding model |
CHUNK_SIZE |
512 |
Chunk size (characters) |
CHUNK_OVERLAP |
100 |
Overlap between chunks (~20%) |
TOP_K |
5 |
Chunks retrieved per query |
MAX_CONTEXT_TOKENS |
6000 |
Token budget for RAG context |
USE_RERANKER |
false |
Enable cross-encoder re-ranking |
MAX_INPUT_CHARS |
10000 |
Max question length |
RATE_LIMIT_PER_MINUTE |
30 |
Requests per IP per minute |
OPENAI_MAX_RETRIES |
3 |
OpenAI client retry attempts |
OPENAI_TIMEOUT_SECONDS |
60 |
OpenAI request timeout |
CHROMA_PATH |
./data/chroma_db |
ChromaDB persistence path |
| Method | Path | Description |
|---|---|---|
| GET | /api/health |
Health check |
| GET | /api/stats |
DB + session statistics |
| POST | /api/query |
Ask a question (non-streaming) |
| POST | /api/query/stream |
Ask a question (SSE streaming) |
| DELETE | /api/sessions/{id} |
Clear a conversation session |
| POST | /api/documents/upload |
Upload & index a document |
| GET | /api/documents |
List indexed documents |
| DELETE | /api/documents/{name} |
Delete a document and its chunks |
Interactive API docs: /api/docs (Swagger), /api/redoc.
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{"question": "What is this document about?", "language": "en", "mask_pii": false}'- Hybrid search —
0.6 · cosine + 0.4 · BM25, BM25 rescored over the vector candidate set (keeps itO(candidates), notO(corpus)). - Cross-encoder re-ranking (optional) — offloaded to a thread pool so the async event loop is never blocked; degrades gracefully if unavailable.
- Lost-in-the-Middle mitigation — top-ranked chunks anchored at the start and end of the context (Liu et al., 2023).
- Token-budget guard — context trimmed to
MAX_CONTEXT_TOKENS(tiktoken). - Sentence-aware chunking — respects paragraph/section boundaries and common abbreviations; configurable overlap.
- Embedding cache — SHA-256-keyed, size-capped LRU.
- Resilience — exponential back-off on transient OpenAI errors.
- Re-indexing — re-uploading a document replaces its old chunks.
- Prompt-injection detection — 40+ regex patterns, English and Ukrainian, risk-scored (BLOCK ≥ 0.9, WARN ≥ 0.6), case-insensitive.
- PII masking / unmasking — email, UA phone, tax ID, IBAN, card; masked before the LLM, restored in the response.
- Output guardrail — system-prompt-leak and suspicious-URL detection (advisory; logged).
- File magic-byte validation — rejects extension-spoofed uploads.
- Filename sanitization —
Path().name+ allow-list regex; prevents path traversal and null-byte/RTL tricks. - Decompression-bomb guard — OOXML (DOCX/XLSX/PPTX) archives are inspected and rejected if the inflated size / compression ratio is implausible.
- Streamed-size enforcement — upload capped without loading it all in RAM.
- Per-IP rate limiting — sliding window with periodic purge.
- No tracebacks to clients — errors return generic localized messages; attacker payloads are never logged verbatim.
- Dependency auditing —
pip-auditis part of the dev tooling; the pinned runtime tree is CVE-clean (pip-audit -r requirements.txt).
- No authentication / multi-tenancy. Anyone with network access can query, upload and delete, and drive OpenAI cost. Intended for local / single-user portfolio use; production would add auth + quotas.
- Indirect prompt injection. Injection scoring runs on the question, not on retrieved document text. A hostile uploaded document could try to steer the model; the system prompt mitigates but does not eliminate this.
- PII in document context still reaches the LLM provider (only the question is masked) — expected for a RAG, but a privacy consideration.
- Rate limiting keys on
request.client.host. Behind a reverse proxy all clients share an IP unless the proxy is configured;X-Forwarded-Foris intentionally not trusted (spoofable). Deploy accordingly. - Regex-based prompt-injection detection is defense-in-depth, not a guarantee — it raises the bar, it is not a security boundary.
pip install -r requirements-dev.txt
pytest -qCovers chunking, PII mask/unmask round-trip, injection detection (EN/UK), length guard, and file magic-byte validation.
Audit dependencies for known CVEs:
pip-audit -r requirements.txtaskdocs/
├── main.py FastAPI app entry point + lifespan
├── start.py Pre-flight checks + launcher
├── config.py pydantic-settings configuration
├── backend/
│ ├── rag/
│ │ ├── loader.py Multi-format document loaders
│ │ ├── chunker.py Sentence-aware chunking
│ │ ├── embeddings.py OpenAI embeddings + LRU cache + retry
│ │ ├── vector_store.py ChromaDB + hybrid (vector + BM25) search
│ │ ├── reranker.py Optional cross-encoder re-ranking
│ │ ├── session.py In-memory session manager (TTL)
│ │ └── pipeline.py RAG orchestration (sync + streaming)
│ ├── security/guards.py Injection / PII / output / file guards
│ └── api/routes.py FastAPI routes
├── frontend/ Vanilla-JS SPA (SSE streaming, i18n)
├── tests/ pytest suite
└── data/sample_docs/ Auto-indexed samples
These are conscious trade-offs for a single-process portfolio project:
- Sessions & rate limiting are in-memory — not shared across workers and reset on restart. Production would use Redis.
- No authentication / multi-tenancy — anyone with network access can query.
- No OCR — scanned / image-only PDFs have no text layer and are rejected with a clear message; only PDFs with embedded text are supported.
- Document identity is the filename stem — two different files with the same stem would collide in the vector store.
- Chunk char offsets are approximate (stored for debugging only).
- Single-node ChromaDB — fine for thousands of chunks, not millions.
Possible next steps: Redis-backed sessions/rate-limiting, auth, evaluation harness (retrieval precision / answer faithfulness), Dockerfile + CI.
MIT — see LICENSE.