Agentic RAG Benchmark

Benchmarking AgenticRAG systems and its viability in the face of long context optimized recursive methods like Recursive RAG and Recursive LMs.

Target: ACL/NAACL 2026 publication

Current Results

Vanilla RAG Baseline (HotpotQA Full Validation - 7,405 questions)

Retriever	Exact Match	F1 Score	Cost
Dense	45.0%	59.5%	$0.79
Hybrid	44.1%	58.6%	$0.76
BM25	38.2%	51.5%	$0.77

Model: gpt-4o-mini | Dense retrieval uses text-embedding-3-small

ReAct RAG (HotpotQA Full Validation)

Results pending. Run the ReAct configs in configs/react_*_full.yaml to populate this table.

By Question Type (Dense Retriever)

Type	Count	Exact Match	F1
Bridge	5,918	39.6%	55.4%
Comparison	1,487	66.3%	75.9%

Quick Start

1. Setup Environment

# Create virtual environment
python -m venv .venv

# Activate (Windows)
.venv\Scripts\activate

# Activate (Unix/Mac)
source .venv/bin/activate

# Install dependencies
pip install -e ".[dev]"

2. Configure API Keys

# Copy example env file
cp .env.example .env

# Edit .env and add your OpenAI API key
# OPENAI_API_KEY=sk-your-key-here

3. Validate Pipeline

python scripts/test_pipeline.py

4. Run Baseline Experiment

# Run with Dense retriever (recommended - best performance)
python scripts/run_experiment.py --config configs/vanilla_dense_full.yaml

# Or run a quick test with 100 questions
python scripts/run_experiment.py --config configs/vanilla_dense.yaml

5. Analyze Results

# View summary of a specific run
python scripts/analyze_results.py --results results/<run_id> --breakdown

# Compare multiple runs
python scripts/analyze_results.py --results results --compare

Project Structure

agentic_rag_benchmark/
├── src/
│   ├── core/           # Base abstractions
│   ├── architectures/  # RAG implementations
│   ├── retrieval/      # BM25, Dense, Hybrid
│   ├── data/           # Dataset loaders
│   ├── evaluation/     # Metrics
│   └── utils/          # Cache, logging
├── configs/            # YAML configurations
├── prompts/            # Prompt templates
├── scripts/            # Experiment runners
└── tests/              # Unit tests

Implemented Architectures

Architecture	Type	Status
Vanilla RAG	Baseline	✅ Complete
ReAct RAG	Agentic	✅ Implemented (results pending)
Self-RAG	Agentic	🔲 Planned
Planner RAG	Agentic	🔲 Planned
IRCoT	Recursive	🔲 Planned
REAP	Recursive	🔲 Planned
Recursive LM	RLM	🔲 Planned

Datasets

HotpotQA (implemented) - Multi-hop QA with bridge/comparison questions
MuSiQue (planned) - Multi-hop with explicit decomposition
2WikiMultiHopQA (planned) - Wikipedia-based reasoning

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
prompts		prompts
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
NEXT_STEPS.md		NEXT_STEPS.md
README.md		README.md
ResearchPaper_to_read.md		ResearchPaper_to_read.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic RAG Benchmark

Current Results

Vanilla RAG Baseline (HotpotQA Full Validation - 7,405 questions)

ReAct RAG (HotpotQA Full Validation)

By Question Type (Dense Retriever)

Quick Start

1. Setup Environment

2. Configure API Keys

3. Validate Pipeline

4. Run Baseline Experiment

5. Analyze Results

Project Structure

Implemented Architectures

Datasets

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic RAG Benchmark

Current Results

Vanilla RAG Baseline (HotpotQA Full Validation - 7,405 questions)

ReAct RAG (HotpotQA Full Validation)

By Question Type (Dense Retriever)

Quick Start

1. Setup Environment

2. Configure API Keys

3. Validate Pipeline

4. Run Baseline Experiment

5. Analyze Results

Project Structure

Implemented Architectures

Datasets

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages