GitHub - ABHAY627/RAG_Theory_Implementation: Two-stage RAG pipeline using LangChain, ChromaDB & HuggingFace embeddings — load, chunk, embed, and retrieve docs locally

📖 Table of Contents

Click to expand / collapse

🧠 What is RAG?
✨ Features
🏗️ Architecture
🛠️ Tech Stack
📦 Embedding Model
📁 Project Structure
⚡ Quick Start
🔄 Pipeline Walkthrough
📄 Sample Documents
⚙️ Configuration
🗺️ Roadmap
🤝 Contributing

🧠 What is RAG?

Retrieval Augmented Generation (RAG) is a technique that supercharges Large Language Models by giving them access to your own knowledge base — preventing hallucinations and keeping answers grounded in real data.

Without RAG:  User Question  →  LLM (limited knowledge)  →  Possibly wrong answer ❌
With RAG:     User Question  →  Vector Search  →  Relevant Context  →  LLM  →  Accurate answer ✅

Instead of retraining an expensive LLM on your data, RAG retrieves the most relevant document chunks at query time and injects them as context. The LLM then generates answers based on that real, up-to-date information.

✨ Features

📂 Auto Document Loading — scans an entire directory of .txt files
🧩 Intelligent Chunking — overlapping windows preserve cross-chunk context
🔢 Local Embeddings — runs 100% offline, no API key needed for ingestion
💾 Persistent Vector Store — ChromaDB saves embeddings to disk; skip re-ingestion on restart

🔍 Cosine Similarity Search — HNSW index for fast approximate nearest-neighbour retrieval
♻️ Smart Cache Check — automatically detects existing vector store and skips re-processing
🔌 LLM Ready — plug in Google Gemini (or any LangChain LLM) for full Q&A generation
🪟 Windows Friendly — UTF-8 loader config handles Windows encoding edge cases

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        INGESTION PIPELINE                           │
│                                                                     │
│  docs/*.txt                                                         │
│      │                                                              │
│      ▼                                                              │
│  ┌─────────────────┐    ┌──────────────────────┐                   │
│  │  DirectoryLoader │───▶│  CharacterTextSplitter│                  │
│  │  (TextLoader)    │    │  chunk_size=1000      │                  │
│  └─────────────────┘    │  chunk_overlap=200    │                  │
│                          └──────────┬───────────┘                  │
│                                     │                               │
│                                     ▼                               │
│                          ┌──────────────────────┐                  │
│                          │  HuggingFaceEmbeddings│                  │
│                          │  all-MiniLM-L6-v2     │                  │
│                          │  (384-dim vectors)    │                  │
│                          └──────────┬───────────┘                  │
│                                     │                               │
│                                     ▼                               │
│                          ┌──────────────────────┐                  │
│                          │      ChromaDB         │                  │
│                          │  HNSW cosine index   │                  │
│                          │  db/chroma_db/        │                  │
│                          └──────────────────────┘                  │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                        RETRIEVAL PIPELINE                           │
│                                                                     │
│  User Query                                                         │
│      │                                                              │
│      ▼                                                              │
│  ┌──────────────────────┐    ┌──────────────────────┐              │
│  │  HuggingFaceEmbeddings│───▶│   ChromaDB            │             │
│  │  (same model)         │    │   similarity_search   │             │
│  └──────────────────────┘    │   top-k chunks        │             │
│                               └──────────┬───────────┘             │
│                                          │                          │
│                                          ▼                          │
│                               ┌──────────────────────┐             │
│                               │  Retrieved Context    │             │
│                               │  (ready for LLM)      │             │
│                               └──────────────────────┘             │
└─────────────────────────────────────────────────────────────────────┘

🛠️ Tech Stack

Layer	Tool	Version	Purpose
🔗 Orchestration	LangChain	`0.3.25`	Pipeline glue — loaders, splitters, chains
🔗 Community Tools	langchain-community	`0.3.24`	`DirectoryLoader`, `TextLoader`
✂️ Text Splitting	langchain-text-splitters	`0.3.8`	`CharacterTextSplitter`
🤗 Embeddings	langchain-huggingface	`1.2.1`	Bridge to HuggingFace models
💾 Vector Store	ChromaDB	`0.6.3`	Persistent local vector database
🔗 Chroma Bridge	langchain-chroma	`1.1.0`	LangChain ↔ ChromaDB integration
🤖 Sentence Model	sentence-transformers	`3.4.1`	Loads and runs embedding models
🌐 LLM (optional)	Google Gemini	via `langchain-google-genai 2.1.4`	Generation layer (Q&A)
⚙️ Env Config	python-dotenv	`1.1.0`	Loads API keys from `.env`
🐍 Runtime	Python	`3.10+`	Core language

📦 Embedding Model

Model: sentence-transformers/all-MiniLM-L6-v2

Input Text  →  Tokenizer  →  MiniLM-L6 (6-layer transformer)  →  384-dim vector

Property	Value
Architecture	Sentence-BERT (SBERT)
Layers	6 transformer layers
Output Dimensions	384
Max Input Tokens	256 tokens
Model Size	~80 MB
License	Apache 2.0
Download	Automatic on first run (HuggingFace Hub)
Requires API Key	❌ No — fully local

Why this model?

Lightweight and fast — perfect for learning and local dev
Strong semantic similarity performance (trained on 1B+ sentence pairs)
No GPU required — runs well on CPU

Vector Similarity Metric used: Cosine Similarity (configured via hnsw:space: cosine in ChromaDB)

📁 Project Structure

Retrieval_Augmented_Generation/
│
├── 📂 docs/                        # Source documents for ingestion
│   ├── 📄 google.txt               # Google company overview
│   ├── 📄 microsoft.txt            # Microsoft company overview
│   └── 📄 nvidia.txt               # Nvidia company overview
│
├── 💾 db/
│   └── chroma_db/                  # Auto-created persistent vector store
│       ├── chroma.sqlite3          # ChromaDB metadata & collections
│       └── <uuid>/                 # HNSW index binary files
│           ├── data_level0.bin
│           ├── header.bin
│           ├── length.bin
│           └── link_lists.bin
│
├── 🐍 ingestion_pipeline.py        # Stage 1: Load → Chunk → Embed → Store
├── 🔍 retrieval_pipeline.py        # Stage 2: Query → Search → Return chunks
├── 📋 requirements.txt             # Pinned Python dependencies
├── 🔑 .env                         # API keys (gitignored)
├── 🚫 .gitignore                   # Protects secrets & large files
└── 📖 README.md                    # You are here!

⚡ Quick Start

Prerequisites

Python 3.10+
pip
~500 MB free disk space (for embedding model download)

1️⃣ Clone the repository

git clone https://github.com/yourusername/Retrieval_Augmented_Generation.git
cd Retrieval_Augmented_Generation

2️⃣ Create & activate virtual environment

# Windows
python -m venv venv
venv\Scripts\activate

# macOS / Linux
python -m venv venv
source venv/bin/activate

3️⃣ Install dependencies

pip install -r requirements.txt

💡 On first run, sentence-transformers will auto-download all-MiniLM-L6-v2 (~80MB) from HuggingFace Hub. No account or API key needed.

4️⃣ Configure environment (optional)

# .env is already set up — add your Gemini key to enable LLM generation
GOOGLE_API_KEY=your_gemini_api_key_here

5️⃣ Run ingestion pipeline

python ingestion_pipeline.py

Expected output:

=== RAG Document Ingestion Pipeline ===

Loading documents from 'docs'...
Loaded 3 document(s).

Splitting documents into chunks...
Created 9 chunk(s).

Creating embeddings and storing in ChromaDB...
Building vector store — this may take a moment...
Vector store saved to 'db/chroma_db' (9 vectors).

✅ Ingestion complete! Documents are ready for RAG queries.

6️⃣ Run retrieval pipeline

python retrieval_pipeline.py

🔄 Pipeline Walkthrough

📂 Stage 1 — Document Loading

loader = DirectoryLoader(
    path="docs",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},  # Windows-safe
)
documents = loader.load()

Scans the docs/ folder for all .txt files
Each file becomes a Document object with page_content + metadata (source path)
UTF-8 encoding specified explicitly to handle Windows codec issues

✂️ Stage 2 — Text Chunking

text_splitter = CharacterTextSplitter(
    chunk_size=1000,     # max characters per chunk
    chunk_overlap=200,   # overlap between consecutive chunks
    separator="\n",      # split on newlines first
)
chunks = text_splitter.split_documents(documents)

Why overlap? If a key sentence spans a chunk boundary, the chunk_overlap=200 ensures it appears in both chunks — so retrieval never misses critical context.

Document:  [-------- chunk 1 --------][--- overlap ---][-------- chunk 2 --------]

🔢 Stage 3 — Embedding Generation

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Each chunk's text is passed through the all-MiniLM-L6-v2 transformer model. Output: a 384-dimensional float vector that encodes the semantic meaning of the text.

Similar content → similar vectors → close in vector space.

💾 Stage 4 — Vector Store Persistence

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="db/chroma_db",
    collection_metadata={"hnsw:space": "cosine"},
)

ChromaDB stores vectors in an HNSW (Hierarchical Navigable Small World) index
HNSW enables lightning-fast approximate nearest-neighbour search
Data is persisted to disk — survives restarts without re-embedding
Cosine similarity metric used (angle between vectors, ideal for text)

🔍 Stage 5 — Retrieval (Similarity Search)

# Query is embedded with the SAME model used during ingestion
results = vectorstore.similarity_search(query, k=3)

At query time:

Your query string is embedded → 384-dim vector
ChromaDB searches for the k most similar vectors in the HNSW index
Returns the corresponding document chunks as context
These chunks can be passed to an LLM (e.g. Gemini) for answer generation

📄 Sample Documents

The docs/ folder ships with three tech company overviews — perfect for testing retrieval:

File	Topic	Key Facts
`google.txt`	Google LLC	Founded 1998 by Larry Page & Sergey Brin, Stanford PhD students
`microsoft.txt`	Microsoft Corp	Founded 1975 by Bill Gates & Paul Allen, acquired GitHub 2018
`nvidia.txt`	Nvidia Corp	Founded 1993 by Jensen Huang, GPU leader, CUDA platform

Sample queries you can test:

"Who founded Google?"
"When did Microsoft acquire GitHub?"
"What is CUDA?"
"Which company focuses on GPU computing?"

⚙️ Configuration

Chunking Parameters

Edit in ingestion_pipeline.py → split_documents():

Parameter	Default	Effect
`chunk_size`	`1000`	Larger = more context per chunk, fewer chunks
`chunk_overlap`	`200`	Larger = less chance of missing boundary info
`separator`	`"\n"`	Character used to prefer split points

Retrieval Parameters

Edit in retrieval_pipeline.py → similarity_search():

Parameter	Default	Effect
`k`	`3`	Number of chunks returned per query

Switching Embedding Models

# Faster, smaller (for prototyping)
HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")      # 384-dim, ~80MB

# Higher quality (for production)
HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")      # 768-dim, ~420MB

# Multilingual
HuggingFaceEmbeddings(model_name="paraphrase-multilingual-MiniLM-L12-v2")

⚠️ If you change the embedding model, delete db/chroma_db/ and re-run ingestion — dimensions must match!

Adding Your Own Documents

Drop any .txt file into docs/
Delete the existing vector store: rmdir /s /q db\chroma_db (Windows)
Re-run: python ingestion_pipeline.py

🗺️ Roadmap

🤝 Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss what you'd like to change.

# Fork → Clone → Branch → Commit → Push → PR
git checkout -b feature/your-feature-name
git commit -m "feat: add your feature"
git push origin feature/your-feature-name

Built with ❤️ to learn RAG from the ground up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📖 Table of Contents

🧠 What is RAG?

✨ Features

🏗️ Architecture

🛠️ Tech Stack

📦 Embedding Model

📁 Project Structure

⚡ Quick Start

Prerequisites

1️⃣ Clone the repository

2️⃣ Create & activate virtual environment

3️⃣ Install dependencies

4️⃣ Configure environment (optional)

5️⃣ Run ingestion pipeline

6️⃣ Run retrieval pipeline

🔄 Pipeline Walkthrough

📄 Sample Documents

⚙️ Configuration

Chunking Parameters

Retrieval Parameters

Switching Embedding Models

Adding Your Own Documents

🗺️ Roadmap

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Chunking Strategies		Chunking Strategies
db/chroma_db		db/chroma_db
docs		docs
.env		.env
.gitignore		.gitignore
README.md		README.md
generation_pipeline.py		generation_pipeline.py
ingestion_pipeline.py		ingestion_pipeline.py
requirements.txt		requirements.txt
retrieval_pipeline.py		retrieval_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

📖 Table of Contents

🧠 What is RAG?

✨ Features

🏗️ Architecture

🛠️ Tech Stack

📦 Embedding Model

📁 Project Structure

⚡ Quick Start

Prerequisites

1️⃣ Clone the repository

2️⃣ Create & activate virtual environment

3️⃣ Install dependencies

4️⃣ Configure environment (optional)

5️⃣ Run ingestion pipeline

6️⃣ Run retrieval pipeline

🔄 Pipeline Walkthrough

📄 Sample Documents

⚙️ Configuration

Chunking Parameters

Retrieval Parameters

Switching Embedding Models

Adding Your Own Documents

🗺️ Roadmap

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages