Document QA RAG — Policy Assistant with Semantic Search & AI Answers

A production-quality RAG (Retrieval-Augmented Generation) pipeline that answers natural language questions about company policy documents — finding the exact relevant text, citing the source, and refusing to guess when the answer isn't there. Backed by a systematic 18-configuration evaluation across 25 grounded test questions.

Built with Python, LlamaIndex, HuggingFace sentence-transformers, Google Gemini, and Streamlit.

What It Does

Question → Embed → Vector Search → Retrieve Top-K Chunks → Format Context → Gemini → Answer + Citations

Instead of sending all documents to an LLM and hoping it finds the right answer, RAG retrieves only the most relevant chunks first — keeping costs low, answers focused, and citations accurate. If the context doesn't contain the answer, the system says so rather than hallucinating.

Evaluation Results

Most RAG projects skip systematic evaluation. This one doesn't.

18 configurations tested across 2 embedding models, 3 chunk sizes (256/512/1024 tokens), and 3 top-K values (2/3/5), evaluated against 25 grounded test questions with known expected source documents.

Metric	Default Config Result
Hit Rate @3	100%
MRR	1.000
Questions evaluated	25

Key finding: All configurations at chunk=512 or chunk=1024 achieved 100% hit rate regardless of embedding model or top-K value. The only failure occurred at chunk=256/top_k=2, where the tuition reimbursement question was missed.

Root cause: The $5,250 figure was split away from its surrounding policy context at 256-token chunk size. With only 2 results retrieved, the chunk containing the fact scored below the relevant threshold. This directly validates the 512-token default — policy documents contain specific numerical facts that need enough surrounding context to score well in retrieval.

Conclusion: bge-small-en-v1.5 + chunk=512 + overlap=50 + top_k=3 is the optimal configuration for this dataset. It achieves perfect retrieval at the lowest computational cost.

Semantic Robustness Testing

To confirm the pipeline is doing genuine semantic matching rather than keyword overlap, the test set was expanded to 100 questions — the original 25 plus 75 Gemini-generated paraphrased variations (3 per question). A query like "How many vacation days do I get?" was rephrased as "What is the PTO accrual rate?", "How much paid leave am I entitled to?", and so on.

Metric	Baseline (25 questions)	Robustness (100 questions)
Hit Rate	100%	100%
MRR	1.000	0.973

The slight MRR drop from 1.000 to 0.973 on the expanded set is expected — some rephrased questions naturally retrieve the correct document at rank 2 instead of rank 1, which is still a hit. Zero misses across 100 questions confirms the embeddings are performing genuine meaning-based matching.

Features

Semantic Search — Meaning-based retrieval using local HuggingFace embeddings. Finds "PTO accrual" when you ask about "vacation days" — no keyword matching required
Source Citations — Every answer cites the exact policy document it came from using [Source: filename]
Similarity Cutoff — Returns "I don't have information about that" rather than a low-confidence hallucination when no relevant context is found
Retrieval Inspector — Dedicated tab to see raw search results without LLM generation, showing exactly which chunks would inform an answer and their similarity scores
Document Explorer — Visual breakdown of how each policy document was chunked, with per-document chunk counts and document previews
Configurable Pipeline — Swap embedding models, chunk sizes, overlap, top-K, and similarity cutoff directly from the sidebar without touching code
No-API-Key Mode — Retrieval still works without a Gemini key — returns top chunks with scores instead of a generated answer
Systematic Evaluation — qa_eval.py runs hit rate and MRR comparisons across any configuration set

Tech Stack

Layer	Technology
RAG Framework	LlamaIndex (llama-index-core, llama-index-embeddings-huggingface)
Embeddings	HuggingFace `bge-small-en-v1.5` — runs locally, no API cost
LLM / Generation	Google Gemini (`gemini-2.5-flash`) via `google.genai` directly
Web App	Streamlit
Evaluation	Custom hit rate + MRR framework (`qa_eval.py`)
Visualization	Seaborn + matplotlib
Environment	python-dotenv

Note: LlamaIndex's Google integration has migrated from llama-index-llms-gemini to llama-index-llms-google-genai. This project uses google.genai directly instead of either wrapper — for full control, easier debugging, and consistency across the portfolio.

The Documents

6 Nexus Technologies company policy documents (~700-900 words each):

Document	Topics Covered
`pto_policy.txt`	Vacation accrual by tenure, sick days, parental leave, bereavement, jury duty
`benefits_policy.txt`	Medical plans, HSA contributions, 401k match, wellness stipend, tuition reimbursement
`expense_policy.txt`	Hotel limits, meal limits, mileage rates, professional development budget, submission deadlines
`remote_work_policy.txt`	Eligibility, home office stipend, internet requirements, core hours, equipment
`code_of_conduct.txt`	Gift thresholds, ethics hotline, conflict of interest, reporting violations
`data_security_policy.txt`	Password requirements, rotation policy, security update timelines, breach response

Project Structure

document-qa-rag/
├── qa_engine.py             ← Full RAG pipeline: load, chunk, embed, index, retrieve, generate
├── qa_eval.py               ← Evaluation framework: hit rate, MRR, configuration comparison
├── qa_app.py                ← Streamlit dashboard
├── Project_5_Document_QA.ipynb  ← Step-by-step build and evaluation notebook
├── requirements.txt
├── .env.example             ← Documents required environment variables
├── .env                     ← API keys (not committed)
├── .gitignore
│
├── data/
│   └── policies/            ← 6 Nexus Technologies policy documents
│
└── .streamlit/
    └── config.toml          ← Custom dark theme configuration

Getting Started

1. Clone the Repository

git clone https://github.com/Drizztovski/document-qa-rag.git
cd document-qa-rag

2. Install Dependencies

pip install -r requirements.txt

The first run downloads the embedding model (~130MB). Subsequent runs load it from cache.

3. Configure API Key (optional)

cp .env.example .env

Open .env and add your key:

GOOGLE_API_KEY=your_gemini_api_key_here

Get a free key at aistudio.google.com/apikey. The app runs fully without it — only answer generation requires a key. Retrieval, document exploration, and the search inspector all work without one.

4. Run the App

streamlit run qa_app.py

Opens at http://localhost:8501

5. Run the Evaluation Framework

python qa_eval.py

Runs the baseline evaluation against 25 test questions and prints hit rate and MRR.

How It Works

1. Document Loading

SimpleDirectoryReader loads all .txt, .md, and .pdf files from data/policies/, filtering out system files like desktop.ini. Each document retains its filename as metadata — this is what powers source citations downstream.

2. Chunking

SentenceSplitter breaks each document into 512-token chunks with 50-token overlap. The overlap prevents important sentences from disappearing at chunk boundaries. Policy documents average 2-3 chunks each at this size.

3. Embedding

HuggingFaceEmbedding converts each chunk into a 384-dimension vector using bge-small-en-v1.5, running entirely locally. No API cost for indexing. The same model embeds queries at retrieval time — consistent embedding space is critical for accurate similarity scoring.

4. Retrieval

The query is embedded and compared against all chunk vectors by cosine similarity. The top-K most similar chunks are returned, then filtered by a similarity cutoff (default 0.3). Chunks below the threshold are dropped — better to return nothing than a low-confidence result.

5. Context Formatting

Retrieved chunks are formatted with [Source: filename] labels so the LLM can cite them accurately. This is what makes citations possible — the source information travels with the content into the prompt.

6. Generation

Gemini receives a structured prompt with the formatted context and a strict instruction: answer using only the provided context, cite every fact, and refuse to answer if the context doesn't contain the information. This prevents hallucination while maintaining a natural conversational tone.

Screenshots

Landing Page — Configuration Guide

Welcome screen showing the styled header, configuration guide cards, and clickable sample questions.

Chat — Answer with Source Citations

Full RAG pipeline in action — natural language question, Gemini-generated answer, and source citations showing exactly which policy document each fact came from.

Retrieval Inspector — Raw Semantic Search

Raw retrieval results without LLM generation — similarity score bar chart and ranked chunks. Shows exactly which document chunks would inform an answer and how confidently each one scored.

Document Explorer — Index Breakdown

Chunk distribution across all 6 policy documents with per-document previews and index statistics.

Sidebar — Index Ready

Green status card showing the active configuration after a successful index build.

Key Technical Decisions

Local embeddings, API only for generation — Embedding 17 chunks takes milliseconds locally. Routing every embedding through an API would add latency and cost with no quality benefit for a fixed document set.

Similarity cutoff at 0.3 — Without a cutoff, every query returns results even when nothing relevant exists. A threshold of 0.3 on this dataset filters out off-topic chunks while retaining all relevant ones.

Anchor chunk overlap at 10% of chunk size — The compare_configurations() function sets overlap to max(chunk_size // 10, 20). This keeps overlap proportional to chunk size rather than using a fixed value across all configurations.

google.genai directly, not the LlamaIndex wrapper — LlamaIndex's Google integration migrated from llama-index-llms-gemini to llama-index-llms-google-genai. Rather than using either wrapper, this project calls google.genai directly — for full control, easier debugging, and consistency across the portfolio. The tradeoff was a deliberate architectural choice:

Category	`google-genai` (Direct)	LlamaIndex Wrapper
Setup speed	Slower	Faster
Control	Full control	Limited by abstraction
Flexibility	Very high	Medium
Debugging	Easier (you see everything)	Harder (hidden layers)
Ecosystem	Just Google	Full LlamaIndex ecosystem
RAG pipelines	Manual build	Built-in tools
Long-term stability	High (official SDK)	Medium (depends on LlamaIndex updates)
Vendor lock-in	Low	Higher
Boilerplate	More	Less

For a hand-rolled RAG pipeline where control, debuggability, and stability matter more than convenience, google.genai directly is the right call.

Evaluation cutoff set to 0.0 — When running evaluate_retrieval(), the similarity cutoff is set to 0.0 so the postprocessor doesn't filter results before we can measure hit rate. Production and evaluation use different cutoff values intentionally.

Robustness testing with Gemini-generated paraphrases — Rather than manually writing 100 test questions, Gemini was used to generate 3 semantically equivalent variations of each original question. This tests whether the pipeline relies on keyword matching or genuine semantic understanding — a meaningful distinction that most RAG evaluations skip entirely.

Grounded test set from actual document content — Every evaluation question was written against specific verifiable facts in the policy documents (exact dollar amounts, specific day counts, precise thresholds). This prevents the test set from being gamed by surface-level retrieval and ensures hit rate measurements reflect real-world accuracy.

Author

AJ Amatrudo — IT professional transitioning to data science and business intelligence.

GitHub: github.com/Drizztovski
Certifications: Python 3, SQL, Git & GitHub (Codecademy)
Training: Data Scientist: Analytics Specialist (Codecademy) + Data Science with AI Bootcamp

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.streamlit		.streamlit
data/policies		data/policies
screenshots		screenshots
.env.example		.env.example
.gitignore		.gitignore
Project_5_Document_QA.ipynb		Project_5_Document_QA.ipynb
README.md		README.md
qa_app.py		qa_app.py
qa_engine.py		qa_engine.py
qa_eval.py		qa_eval.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Document QA RAG — Policy Assistant with Semantic Search & AI Answers

What It Does

Evaluation Results

Semantic Robustness Testing

Features

Tech Stack

The Documents

Project Structure

Getting Started

1. Clone the Repository

2. Install Dependencies

3. Configure API Key (optional)

4. Run the App

5. Run the Evaluation Framework

How It Works

1. Document Loading

2. Chunking

3. Embedding

4. Retrieval

5. Context Formatting

6. Generation

Screenshots

Landing Page — Configuration Guide

Chat — Answer with Source Citations

Retrieval Inspector — Raw Semantic Search

Document Explorer — Index Breakdown

Sidebar — Index Ready

Key Technical Decisions

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages