Retrieval-Augmented Generation (RAG) is the most practical way to build AI systems that work with your own data. But the gap between a basic RAG demo and a production system is enormous. This repo documents the patterns that actually work.
Large language models are powerful but they hallucinate, have knowledge cutoffs, and cannot access your private data. RAG solves all three problems by retrieving relevant context before generating responses.
The challenge is that naive RAG — chunk documents, embed them, retrieve top-k, generate — works great in demos but fails in production. Real-world documents are messy, queries are ambiguous, and users expect accurate answers.
| Pattern | Complexity | Best For |
|---|---|---|
| Naive RAG | Low | Simple Q&A, documentation |
| Sentence Window | Medium | Precise answers from long docs |
| Parent-Child | Medium | Hierarchical documents |
| Hybrid Search | Medium | Mixed query types |
| Agentic RAG | High | Complex multi-step queries |
| Graph RAG | High | Connected knowledge bases |
The starting point. Chunk documents → embed → store in vector DB → retrieve top-k → generate.
When it works: Simple documentation search, FAQ systems, single-topic knowledge bases.
When it fails: Complex queries, documents with tables/images, queries requiring reasoning across multiple documents.
Instead of retrieving entire chunks, retrieve the most relevant sentence and expand the context window around it. This gives the LLM precise context without noise.
Embed small chunks for precise retrieval, but pass the parent chunk (larger context) to the LLM. Best of both worlds: precise retrieval + sufficient context.
Combine vector similarity search with keyword search (BM25). Vector search handles semantic queries; keyword search handles exact matches, names, and codes.
Use an AI agent to plan the retrieval strategy. The agent decides: which sources to search, how to reformulate the query, whether to do multiple searches, and how to synthesize results.
This is the pattern we use most at ShiftAI for enterprise deployments because real-world queries rarely map cleanly to a single retrieval step.
Building RAG for production requires attention to:
- Chunking strategy — one size does not fit all. Tables, code, and prose need different chunking
- Embedding model selection — balance quality vs latency vs cost
- Reranking — retrieve more, rerank to top-k. Dramatically improves relevance
- Evaluation — you need automated eval pipelines, not just vibes
- Monitoring — track retrieval quality, hallucination rates, user satisfaction
- Caching — cache embeddings and common queries to reduce latency and cost
For a deeper dive into production RAG architecture, see our RAG implementation guide.
patterns/
naive-rag.md # Basic RAG walkthrough
sentence-window.md # Sentence window retrieval
parent-child.md # Hierarchical chunking
hybrid-search.md # Vector + keyword search
agentic-rag.md # Agent-driven retrieval
examples/
README.md # Code examples index
PRs welcome. Share your production RAG patterns.
MIT — see LICENSE.
Built by ShiftAI — we build production RAG systems that actually work.