Benchmarking AgenticRAG systems and its viability in the face of long context optimized recursive methods like Recursive RAG and Recursive LMs.
Target: ACL/NAACL 2026 publication
| Retriever | Exact Match | F1 Score | Cost |
|---|---|---|---|
| Dense | 45.0% | 59.5% | $0.79 |
| Hybrid | 44.1% | 58.6% | $0.76 |
| BM25 | 38.2% | 51.5% | $0.77 |
Model: gpt-4o-mini | Dense retrieval uses text-embedding-3-small
Results pending. Run the ReAct configs in configs/react_*_full.yaml to populate this table.
| Type | Count | Exact Match | F1 |
|---|---|---|---|
| Bridge | 5,918 | 39.6% | 55.4% |
| Comparison | 1,487 | 66.3% | 75.9% |
# Create virtual environment
python -m venv .venv
# Activate (Windows)
.venv\Scripts\activate
# Activate (Unix/Mac)
source .venv/bin/activate
# Install dependencies
pip install -e ".[dev]"# Copy example env file
cp .env.example .env
# Edit .env and add your OpenAI API key
# OPENAI_API_KEY=sk-your-key-herepython scripts/test_pipeline.py# Run with Dense retriever (recommended - best performance)
python scripts/run_experiment.py --config configs/vanilla_dense_full.yaml
# Or run a quick test with 100 questions
python scripts/run_experiment.py --config configs/vanilla_dense.yaml# View summary of a specific run
python scripts/analyze_results.py --results results/<run_id> --breakdown
# Compare multiple runs
python scripts/analyze_results.py --results results --compareagentic_rag_benchmark/
├── src/
│ ├── core/ # Base abstractions
│ ├── architectures/ # RAG implementations
│ ├── retrieval/ # BM25, Dense, Hybrid
│ ├── data/ # Dataset loaders
│ ├── evaluation/ # Metrics
│ └── utils/ # Cache, logging
├── configs/ # YAML configurations
├── prompts/ # Prompt templates
├── scripts/ # Experiment runners
└── tests/ # Unit tests
| Architecture | Type | Status |
|---|---|---|
| Vanilla RAG | Baseline | ✅ Complete |
| ReAct RAG | Agentic | ✅ Implemented (results pending) |
| Self-RAG | Agentic | 🔲 Planned |
| Planner RAG | Agentic | 🔲 Planned |
| IRCoT | Recursive | 🔲 Planned |
| REAP | Recursive | 🔲 Planned |
| Recursive LM | RLM | 🔲 Planned |
- HotpotQA (implemented) - Multi-hop QA with bridge/comparison questions
- MuSiQue (planned) - Multi-hop with explicit decomposition
- 2WikiMultiHopQA (planned) - Wikipedia-based reasoning
MIT License - see LICENSE for details.