Click to expand / collapse
Retrieval Augmented Generation (RAG) is a technique that supercharges Large Language Models by giving them access to your own knowledge base — preventing hallucinations and keeping answers grounded in real data.
Without RAG: User Question → LLM (limited knowledge) → Possibly wrong answer ❌
With RAG: User Question → Vector Search → Relevant Context → LLM → Accurate answer ✅
Instead of retraining an expensive LLM on your data, RAG retrieves the most relevant document chunks at query time and injects them as context. The LLM then generates answers based on that real, up-to-date information.
|
|
┌─────────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ docs/*.txt │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ DirectoryLoader │───▶│ CharacterTextSplitter│ │
│ │ (TextLoader) │ │ chunk_size=1000 │ │
│ └─────────────────┘ │ chunk_overlap=200 │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ HuggingFaceEmbeddings│ │
│ │ all-MiniLM-L6-v2 │ │
│ │ (384-dim vectors) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ ChromaDB │ │
│ │ HNSW cosine index │ │
│ │ db/chroma_db/ │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ RETRIEVAL PIPELINE │
│ │
│ User Query │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ HuggingFaceEmbeddings│───▶│ ChromaDB │ │
│ │ (same model) │ │ similarity_search │ │
│ └──────────────────────┘ │ top-k chunks │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Retrieved Context │ │
│ │ (ready for LLM) │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
| Layer | Tool | Version | Purpose |
|---|---|---|---|
| 🔗 Orchestration | LangChain | 0.3.25 |
Pipeline glue — loaders, splitters, chains |
| 🔗 Community Tools | langchain-community | 0.3.24 |
DirectoryLoader, TextLoader |
| ✂️ Text Splitting | langchain-text-splitters | 0.3.8 |
CharacterTextSplitter |
| 🤗 Embeddings | langchain-huggingface | 1.2.1 |
Bridge to HuggingFace models |
| 💾 Vector Store | ChromaDB | 0.6.3 |
Persistent local vector database |
| 🔗 Chroma Bridge | langchain-chroma | 1.1.0 |
LangChain ↔ ChromaDB integration |
| 🤖 Sentence Model | sentence-transformers | 3.4.1 |
Loads and runs embedding models |
| 🌐 LLM (optional) | Google Gemini | via langchain-google-genai 2.1.4 |
Generation layer (Q&A) |
| ⚙️ Env Config | python-dotenv | 1.1.0 |
Loads API keys from .env |
| 🐍 Runtime | Python | 3.10+ |
Core language |
Model: sentence-transformers/all-MiniLM-L6-v2
Input Text → Tokenizer → MiniLM-L6 (6-layer transformer) → 384-dim vector
| Property | Value |
|---|---|
| Architecture | Sentence-BERT (SBERT) |
| Layers | 6 transformer layers |
| Output Dimensions | 384 |
| Max Input Tokens | 256 tokens |
| Model Size | ~80 MB |
| License | Apache 2.0 |
| Download | Automatic on first run (HuggingFace Hub) |
| Requires API Key | ❌ No — fully local |
Why this model?
- Lightweight and fast — perfect for learning and local dev
- Strong semantic similarity performance (trained on 1B+ sentence pairs)
- No GPU required — runs well on CPU
Vector Similarity Metric used: Cosine Similarity (configured via hnsw:space: cosine in ChromaDB)
Retrieval_Augmented_Generation/
│
├── 📂 docs/ # Source documents for ingestion
│ ├── 📄 google.txt # Google company overview
│ ├── 📄 microsoft.txt # Microsoft company overview
│ └── 📄 nvidia.txt # Nvidia company overview
│
├── 💾 db/
│ └── chroma_db/ # Auto-created persistent vector store
│ ├── chroma.sqlite3 # ChromaDB metadata & collections
│ └── <uuid>/ # HNSW index binary files
│ ├── data_level0.bin
│ ├── header.bin
│ ├── length.bin
│ └── link_lists.bin
│
├── 🐍 ingestion_pipeline.py # Stage 1: Load → Chunk → Embed → Store
├── 🔍 retrieval_pipeline.py # Stage 2: Query → Search → Return chunks
├── 📋 requirements.txt # Pinned Python dependencies
├── 🔑 .env # API keys (gitignored)
├── 🚫 .gitignore # Protects secrets & large files
└── 📖 README.md # You are here!
- Python
3.10+ pip- ~500 MB free disk space (for embedding model download)
git clone https://github.com/yourusername/Retrieval_Augmented_Generation.git
cd Retrieval_Augmented_Generation# Windows
python -m venv venv
venv\Scripts\activate
# macOS / Linux
python -m venv venv
source venv/bin/activatepip install -r requirements.txt💡 On first run,
sentence-transformerswill auto-downloadall-MiniLM-L6-v2(~80MB) from HuggingFace Hub. No account or API key needed.
# .env is already set up — add your Gemini key to enable LLM generation
GOOGLE_API_KEY=your_gemini_api_key_herepython ingestion_pipeline.pyExpected output:
=== RAG Document Ingestion Pipeline ===
Loading documents from 'docs'...
Loaded 3 document(s).
Splitting documents into chunks...
Created 9 chunk(s).
Creating embeddings and storing in ChromaDB...
Building vector store — this may take a moment...
Vector store saved to 'db/chroma_db' (9 vectors).
✅ Ingestion complete! Documents are ready for RAG queries.
python retrieval_pipeline.py📂 Stage 1 — Document Loading
loader = DirectoryLoader(
path="docs",
glob="*.txt",
loader_cls=TextLoader,
loader_kwargs={"encoding": "utf-8"}, # Windows-safe
)
documents = loader.load()- Scans the
docs/folder for all.txtfiles - Each file becomes a
Documentobject withpage_content+metadata(source path) - UTF-8 encoding specified explicitly to handle Windows codec issues
✂️ Stage 2 — Text Chunking
text_splitter = CharacterTextSplitter(
chunk_size=1000, # max characters per chunk
chunk_overlap=200, # overlap between consecutive chunks
separator="\n", # split on newlines first
)
chunks = text_splitter.split_documents(documents)Why overlap?
If a key sentence spans a chunk boundary, the chunk_overlap=200 ensures it appears in both chunks — so retrieval never misses critical context.
Document: [-------- chunk 1 --------][--- overlap ---][-------- chunk 2 --------]
🔢 Stage 3 — Embedding Generation
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")Each chunk's text is passed through the all-MiniLM-L6-v2 transformer model.
Output: a 384-dimensional float vector that encodes the semantic meaning of the text.
Similar content → similar vectors → close in vector space.
💾 Stage 4 — Vector Store Persistence
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory="db/chroma_db",
collection_metadata={"hnsw:space": "cosine"},
)- ChromaDB stores vectors in an HNSW (Hierarchical Navigable Small World) index
- HNSW enables lightning-fast approximate nearest-neighbour search
- Data is persisted to disk — survives restarts without re-embedding
- Cosine similarity metric used (angle between vectors, ideal for text)
🔍 Stage 5 — Retrieval (Similarity Search)
# Query is embedded with the SAME model used during ingestion
results = vectorstore.similarity_search(query, k=3)At query time:
- Your query string is embedded → 384-dim vector
- ChromaDB searches for the
kmost similar vectors in the HNSW index - Returns the corresponding document chunks as context
- These chunks can be passed to an LLM (e.g. Gemini) for answer generation
The docs/ folder ships with three tech company overviews — perfect for testing retrieval:
| File | Topic | Key Facts |
|---|---|---|
google.txt |
Google LLC | Founded 1998 by Larry Page & Sergey Brin, Stanford PhD students |
microsoft.txt |
Microsoft Corp | Founded 1975 by Bill Gates & Paul Allen, acquired GitHub 2018 |
nvidia.txt |
Nvidia Corp | Founded 1993 by Jensen Huang, GPU leader, CUDA platform |
Sample queries you can test:
"Who founded Google?"
"When did Microsoft acquire GitHub?"
"What is CUDA?"
"Which company focuses on GPU computing?"
Edit in ingestion_pipeline.py → split_documents():
| Parameter | Default | Effect |
|---|---|---|
chunk_size |
1000 |
Larger = more context per chunk, fewer chunks |
chunk_overlap |
200 |
Larger = less chance of missing boundary info |
separator |
"\n" |
Character used to prefer split points |
Edit in retrieval_pipeline.py → similarity_search():
| Parameter | Default | Effect |
|---|---|---|
k |
3 |
Number of chunks returned per query |
# Faster, smaller (for prototyping)
HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") # 384-dim, ~80MB
# Higher quality (for production)
HuggingFaceEmbeddings(model_name="all-mpnet-base-v2") # 768-dim, ~420MB
# Multilingual
HuggingFaceEmbeddings(model_name="paraphrase-multilingual-MiniLM-L12-v2")
⚠️ If you change the embedding model, deletedb/chroma_db/and re-run ingestion — dimensions must match!
- Drop any
.txtfile intodocs/ - Delete the existing vector store:
rmdir /s /q db\chroma_db(Windows) - Re-run:
python ingestion_pipeline.py
- Document ingestion pipeline (load → chunk → embed → store)
- Retrieval pipeline (query → similarity search → top-k chunks)
- Persistent ChromaDB vector store
- Smart cache detection (skip re-ingestion)
- Full Q&A chain with Google Gemini
- Streamlit / Gradio web UI
- PDF & DOCX document support
- Multi-query retrieval with query expansion
- Reranking with Cross-Encoder models
- Evaluation metrics (MRR, NDCG, faithfulness)
Pull requests are welcome! For major changes, please open an issue first to discuss what you'd like to change.
# Fork → Clone → Branch → Commit → Push → PR
git checkout -b feature/your-feature-name
git commit -m "feat: add your feature"
git push origin feature/your-feature-name