Skip to content

Drizztovski/document-qa-rag

Repository files navigation

Document QA RAG — Policy Assistant with Semantic Search & AI Answers

A production-quality RAG (Retrieval-Augmented Generation) pipeline that answers natural language questions about company policy documents — finding the exact relevant text, citing the source, and refusing to guess when the answer isn't there. Backed by a systematic 18-configuration evaluation across 25 grounded test questions.

Built with Python, LlamaIndex, HuggingFace sentence-transformers, Google Gemini, and Streamlit.


What It Does

Question → Embed → Vector Search → Retrieve Top-K Chunks → Format Context → Gemini → Answer + Citations

Instead of sending all documents to an LLM and hoping it finds the right answer, RAG retrieves only the most relevant chunks first — keeping costs low, answers focused, and citations accurate. If the context doesn't contain the answer, the system says so rather than hallucinating.


Evaluation Results

Most RAG projects skip systematic evaluation. This one doesn't.

18 configurations tested across 2 embedding models, 3 chunk sizes (256/512/1024 tokens), and 3 top-K values (2/3/5), evaluated against 25 grounded test questions with known expected source documents.

Metric Default Config Result
Hit Rate @3 100%
MRR 1.000
Questions evaluated 25

Key finding: All configurations at chunk=512 or chunk=1024 achieved 100% hit rate regardless of embedding model or top-K value. The only failure occurred at chunk=256/top_k=2, where the tuition reimbursement question was missed.

Root cause: The $5,250 figure was split away from its surrounding policy context at 256-token chunk size. With only 2 results retrieved, the chunk containing the fact scored below the relevant threshold. This directly validates the 512-token default — policy documents contain specific numerical facts that need enough surrounding context to score well in retrieval.

Conclusion: bge-small-en-v1.5 + chunk=512 + overlap=50 + top_k=3 is the optimal configuration for this dataset. It achieves perfect retrieval at the lowest computational cost.

Semantic Robustness Testing

To confirm the pipeline is doing genuine semantic matching rather than keyword overlap, the test set was expanded to 100 questions — the original 25 plus 75 Gemini-generated paraphrased variations (3 per question). A query like "How many vacation days do I get?" was rephrased as "What is the PTO accrual rate?", "How much paid leave am I entitled to?", and so on.

Metric Baseline (25 questions) Robustness (100 questions)
Hit Rate 100% 100%
MRR 1.000 0.973

The slight MRR drop from 1.000 to 0.973 on the expanded set is expected — some rephrased questions naturally retrieve the correct document at rank 2 instead of rank 1, which is still a hit. Zero misses across 100 questions confirms the embeddings are performing genuine meaning-based matching.


Features

  • Semantic Search — Meaning-based retrieval using local HuggingFace embeddings. Finds "PTO accrual" when you ask about "vacation days" — no keyword matching required
  • Source Citations — Every answer cites the exact policy document it came from using [Source: filename]
  • Similarity Cutoff — Returns "I don't have information about that" rather than a low-confidence hallucination when no relevant context is found
  • Retrieval Inspector — Dedicated tab to see raw search results without LLM generation, showing exactly which chunks would inform an answer and their similarity scores
  • Document Explorer — Visual breakdown of how each policy document was chunked, with per-document chunk counts and document previews
  • Configurable Pipeline — Swap embedding models, chunk sizes, overlap, top-K, and similarity cutoff directly from the sidebar without touching code
  • No-API-Key Mode — Retrieval still works without a Gemini key — returns top chunks with scores instead of a generated answer
  • Systematic Evaluationqa_eval.py runs hit rate and MRR comparisons across any configuration set

Tech Stack

Layer Technology
RAG Framework LlamaIndex (llama-index-core, llama-index-embeddings-huggingface)
Embeddings HuggingFace bge-small-en-v1.5 — runs locally, no API cost
LLM / Generation Google Gemini (gemini-2.5-flash) via google.genai directly
Web App Streamlit
Evaluation Custom hit rate + MRR framework (qa_eval.py)
Visualization Seaborn + matplotlib
Environment python-dotenv

Note: LlamaIndex's Google integration has migrated from llama-index-llms-gemini to llama-index-llms-google-genai. This project uses google.genai directly instead of either wrapper — for full control, easier debugging, and consistency across the portfolio.


The Documents

6 Nexus Technologies company policy documents (~700-900 words each):

Document Topics Covered
pto_policy.txt Vacation accrual by tenure, sick days, parental leave, bereavement, jury duty
benefits_policy.txt Medical plans, HSA contributions, 401k match, wellness stipend, tuition reimbursement
expense_policy.txt Hotel limits, meal limits, mileage rates, professional development budget, submission deadlines
remote_work_policy.txt Eligibility, home office stipend, internet requirements, core hours, equipment
code_of_conduct.txt Gift thresholds, ethics hotline, conflict of interest, reporting violations
data_security_policy.txt Password requirements, rotation policy, security update timelines, breach response

Project Structure

document-qa-rag/
├── qa_engine.py             ← Full RAG pipeline: load, chunk, embed, index, retrieve, generate
├── qa_eval.py               ← Evaluation framework: hit rate, MRR, configuration comparison
├── qa_app.py                ← Streamlit dashboard
├── Project_5_Document_QA.ipynb  ← Step-by-step build and evaluation notebook
├── requirements.txt
├── .env.example             ← Documents required environment variables
├── .env                     ← API keys (not committed)
├── .gitignore
│
├── data/
│   └── policies/            ← 6 Nexus Technologies policy documents
│
└── .streamlit/
    └── config.toml          ← Custom dark theme configuration

Getting Started

1. Clone the Repository

git clone https://github.com/Drizztovski/document-qa-rag.git
cd document-qa-rag

2. Install Dependencies

pip install -r requirements.txt

The first run downloads the embedding model (~130MB). Subsequent runs load it from cache.

3. Configure API Key (optional)

cp .env.example .env

Open .env and add your key:

GOOGLE_API_KEY=your_gemini_api_key_here

Get a free key at aistudio.google.com/apikey. The app runs fully without it — only answer generation requires a key. Retrieval, document exploration, and the search inspector all work without one.

4. Run the App

streamlit run qa_app.py

Opens at http://localhost:8501

5. Run the Evaluation Framework

python qa_eval.py

Runs the baseline evaluation against 25 test questions and prints hit rate and MRR.


How It Works

1. Document Loading

SimpleDirectoryReader loads all .txt, .md, and .pdf files from data/policies/, filtering out system files like desktop.ini. Each document retains its filename as metadata — this is what powers source citations downstream.

2. Chunking

SentenceSplitter breaks each document into 512-token chunks with 50-token overlap. The overlap prevents important sentences from disappearing at chunk boundaries. Policy documents average 2-3 chunks each at this size.

3. Embedding

HuggingFaceEmbedding converts each chunk into a 384-dimension vector using bge-small-en-v1.5, running entirely locally. No API cost for indexing. The same model embeds queries at retrieval time — consistent embedding space is critical for accurate similarity scoring.

4. Retrieval

The query is embedded and compared against all chunk vectors by cosine similarity. The top-K most similar chunks are returned, then filtered by a similarity cutoff (default 0.3). Chunks below the threshold are dropped — better to return nothing than a low-confidence result.

5. Context Formatting

Retrieved chunks are formatted with [Source: filename] labels so the LLM can cite them accurately. This is what makes citations possible — the source information travels with the content into the prompt.

6. Generation

Gemini receives a structured prompt with the formatted context and a strict instruction: answer using only the provided context, cite every fact, and refuse to answer if the context doesn't contain the information. This prevents hallucination while maintaining a natural conversational tone.


Screenshots

Landing Page — Configuration Guide

Welcome screen showing the styled header, configuration guide cards, and clickable sample questions.

Landing Page


Chat — Answer with Source Citations

Full RAG pipeline in action — natural language question, Gemini-generated answer, and source citations showing exactly which policy document each fact came from.

Chat with Citations


Retrieval Inspector — Raw Semantic Search

Raw retrieval results without LLM generation — similarity score bar chart and ranked chunks. Shows exactly which document chunks would inform an answer and how confidently each one scored.

Retrieval Inspector


Document Explorer — Index Breakdown

Chunk distribution across all 6 policy documents with per-document previews and index statistics.

Document Explorer


Sidebar — Index Ready

Green status card showing the active configuration after a successful index build.

Sidebar Ready


Key Technical Decisions

Local embeddings, API only for generation — Embedding 17 chunks takes milliseconds locally. Routing every embedding through an API would add latency and cost with no quality benefit for a fixed document set.

Similarity cutoff at 0.3 — Without a cutoff, every query returns results even when nothing relevant exists. A threshold of 0.3 on this dataset filters out off-topic chunks while retaining all relevant ones.

Anchor chunk overlap at 10% of chunk size — The compare_configurations() function sets overlap to max(chunk_size // 10, 20). This keeps overlap proportional to chunk size rather than using a fixed value across all configurations.

google.genai directly, not the LlamaIndex wrapper — LlamaIndex's Google integration migrated from llama-index-llms-gemini to llama-index-llms-google-genai. Rather than using either wrapper, this project calls google.genai directly — for full control, easier debugging, and consistency across the portfolio. The tradeoff was a deliberate architectural choice:

Category google-genai (Direct) LlamaIndex Wrapper
Setup speed Slower Faster
Control Full control Limited by abstraction
Flexibility Very high Medium
Debugging Easier (you see everything) Harder (hidden layers)
Ecosystem Just Google Full LlamaIndex ecosystem
RAG pipelines Manual build Built-in tools
Long-term stability High (official SDK) Medium (depends on LlamaIndex updates)
Vendor lock-in Low Higher
Boilerplate More Less

For a hand-rolled RAG pipeline where control, debuggability, and stability matter more than convenience, google.genai directly is the right call.

Evaluation cutoff set to 0.0 — When running evaluate_retrieval(), the similarity cutoff is set to 0.0 so the postprocessor doesn't filter results before we can measure hit rate. Production and evaluation use different cutoff values intentionally.

Robustness testing with Gemini-generated paraphrases — Rather than manually writing 100 test questions, Gemini was used to generate 3 semantically equivalent variations of each original question. This tests whether the pipeline relies on keyword matching or genuine semantic understanding — a meaningful distinction that most RAG evaluations skip entirely.

Grounded test set from actual document content — Every evaluation question was written against specific verifiable facts in the policy documents (exact dollar amounts, specific day counts, precise thresholds). This prevents the test set from being gamed by surface-level retrieval and ensures hit rate measurements reflect real-world accuracy.


Author

AJ Amatrudo — IT professional transitioning to data science and business intelligence.

  • GitHub: github.com/Drizztovski
  • Certifications: Python 3, SQL, Git & GitHub (Codecademy)
  • Training: Data Scientist: Analytics Specialist (Codecademy) + Data Science with AI Bootcamp

About

A production-quality RAG (Retrieval-Augmented Generation) pipeline that answers natural language questions about company policy documents — finding the exact relevant text, citing the source, and refusing to guess when the answer isn't there. Backed by a systematic 18-configuration evaluation across 25 grounded test questions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors