An experimental comparison of three Retrieval-Augmented Generation (RAG) approaches to understand the effectiveness of HyDE (Hypothetical Document Embeddings) versus traditional RAG techniques.
This project evaluates three different RAG implementations:
- Vanilla RAG - Direct embedding of user queries for semantic search
- Two-Step RAG - Query rewriting followed by semantic search
- HyDE RAG - Hypothetical answer generation followed by semantic search
The experiment uses the Ragas Airlines FAQ dataset with a synthetic ground truth dataset generated using RAGAS to measure and compare performance across multiple dimensions.
HyDE (Hypothetical Document Embeddings) is a retrieval technique that generates a hypothetical answer to the user's question before performing semantic search. Instead of embedding the question directly, the system:
- Generates a plausible answer using an LLM
- Embeds this hypothetical answer
- Uses these embeddings to search for similar documents
The hypothesis is that answers are semantically closer to other answers in the vector space, potentially improving retrieval accuracy compared to searching with questions.
- Framework: LangGraph for orchestrating RAG workflows
- Vector Store: Qdrant for document storage and retrieval
- Embeddings: Ollama (mxbai-embed-large) for local embedding generation
- LLM: Ollama Qwen3 for query rewriting, hypothetical answer generation, and final response generation; OpenAI GPT-4o-mini for evaluation
- Evaluation: RAGAS framework with metrics for accuracy, recall, precision, and groundedness
- Tracking: LangSmith for observability and tracing
- Python 3.13+
- Docker (for Qdrant)
- Ollama with qwen3 & mxbai-embed-large model
- Clone the repository:
git clone "https://github.com/thapar25/HyDE"
cd HyDE- Install dependencies using uv:
uv sync- Start Qdrant using Docker:
docker-compose up -d- Configure environment variables:
Create a
.envfile with the following:
QDRANT_HOST=localhost
QDRANT_PORT=6333
COLLECTION_NAME=faqs
OPENAI_API_KEY=your_openai_api_key
LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGSMITH_PROJECT=hyde-rag
LANGSMITH_API_KEY=your_langsmith_api_key- Pull the Ollama models:
ollama pull qwen3
ollama pull mxbai-embed-large- Generate synthetic ground truth dataset:
jupyter notebook notebooks/1-generate_ground_truth.ipynb- Ingest data into Qdrant:
jupyter notebook notebooks/2-ingest_data.ipynbExecute the evaluation notebook to compare all three RAG variants:
jupyter notebook notebooks/3-evals.ipynbThe evaluation measures:
- Answer Accuracy: How correct is the generated answer compared to ground truth
- Context Recall: How much relevant context was retrieved
- Context Precision: How precise is the retrieved context
- Response Groundedness: How well the response is grounded in retrieved context
Analyze results:
jupyter notebook notebooks/4-analyze.ipynbThe experiment revealed mixed results across different RAG approaches:
- Different variants excelled at different metrics
- Performance varied based on query complexity and context requirements
- Trade-offs exist between retrieval quality and computational overhead
Detailed analysis and visualizations can be found in the artifacts/ directory after running the evaluation notebooks.
Since the graphs are written using LangGraph, the workflows can be tested on Langsmith UI.
langgraph devHyDE/
└── src/
| └── simple_rag.py # Vanilla RAG implementation
| └── two_step.py # Query rewriting RAG implementation
| └── hyde_rag.py # HyDE RAG implementation
| └── utils.py # Shared utilities and state definitions
| └── prompts.yaml # Prompt templates
└── notebooks/
| └── 1-generate_ground_truth.ipynb
| └── 2-ingest_data.ipynb
| └── 3-evals.ipynb
| └── 4-analyze.ipynb
└── artifacts/ # Generated datasets and results
└── docker-compose.yaml # Qdrant configuration
└── pyproject.toml # Project dependencies

