This project implements a quality-gated multi-hop retrieval pipeline designed to improve retrieval performance on difficult BM25 failure cases using:
- episodic memory
- multi-agent reasoning
- verification-aware aggregation
- distilled reranking
The system is trained on HotpotQA distractor samples and focuses specifically on hard retrieval failures where BM25 retrieves incorrect top-1 results.
The architecture combines:
- Gemini Flash agents
- Gemini Pro aggregation
- MongoDB Atlas episodic memory
- cross-encoder reranking
- verification-based filtering
| Metric | Value |
|---|---|
| BM25 Failure Rate | 26.9% |
| Hard Failures Extracted | 26,353 |
| Baseline Hits@1 | 0.7307 |
| Target Ground Rate | 93.4% |
The pipeline is divided into two stages:
- Offline Training Pipeline
- Online Inference Pipeline
The system starts with:
- 97,852 records
- bridge + comparison questions
Dataset characteristics:
- multi-hop reasoning heavy
- retrieval-intensive
- suitable for hard negative mining
A BM25 retrieval pass is executed over the dataset.
Purpose:
- identify retrieval misses
- extract top-1 wrong predictions
- build episodic failure memory
Runtime:
- ~30 seconds
- 4 CPU cores
The pipeline extracts:
- queries where BM25 top-1 prediction is incorrect
- gold-vs-wrong retrieval mismatches
Extracted failures:
- 26,353 hard failures
Stored as:
- structured JSON records
Early queries bypass episodic memory retrieval.
Purpose:
- prevent noisy memory injection
- stabilize early retrieval behavior
Rule:
- queries 1–5 skip episodic memory usage
The architecture uses three specialized agents.
Each agent:
- receives a shared
context.md - performs constrained reasoning
- outputs compressed summaries
This design minimizes:
- token overflow
- context fragmentation
- memory growth
Purpose:
- detect entity mismatches between:
- gold documents
- BM25 wrong retrievals
Output:
EntitySummary- ~60 tokens
Focus:
- bridge entities
- overlap reasoning
- retrieval mismatch diagnosis
Purpose:
- infer missing multi-hop reasoning chains
- reconstruct bridge relationships
Output:
ChainSummary- ~60 tokens
Focus:
- reasoning continuity
- bridge-hop alignment
- latent retrieval paths
Purpose:
- validate supporting evidence chunks
- compress useful retrieval evidence
Output:
ChunkSummary- ~80 tokens
Focus:
- chunk relevance
- support quality
- hallucination reduction
Before aggregation, outputs pass through verification checks:
@VERIFY_CHAIN@VERIFY_ENTITY@CROSS_CHECK
Purpose:
- reject weak reasoning traces
- ensure structural consistency
- improve memory quality
The aggregator:
- combines reasoning summaries
- validates cross-agent agreement
- labels output quality
Responsibilities:
- routing
- quality scoring
- structured synthesis
Output budget:
- ~260 tokens
Only high-confidence outputs are persisted.
Condition:
Q_final > 0.5 AND resolvedPurpose:
- prevent noisy episodic memories
- maintain retrieval precision
- improve distillation quality
Stores:
- verified reasoning chains only
Condition:
Q_final > 0.5
Used for:
- episodic retrieval augmentation
- future reranking hints
Stores:
- temporary summaries
- failed attempts
- active reasoning traces
Purpose:
- lightweight short-term coordination
The inference pipeline uses both:
- retrieval
- episodic memory augmentation
Combines:
- BM25
- Vertex AI retrieval
Strategy:
- top-100 retrieval
- RRF fusion
Purpose:
- improve recall
- reduce lexical-only failures
Uses:
- MongoDB Atlas Vector Search
Retrieves:
- top-3 memory hints
Purpose:
- inject historical failure corrections
- recover missing bridge reasoning
Final reranking stage:
- distilled from episodic memory traces
Purpose:
- produce final prediction
- improve ranking precision
The system uses:
- structured markdown prompting
- constrained output schemas
- compressed reasoning summaries
Inference-time controls:
max_output_tokenstop_ktop_pstop_sequences
This improves:
- output consistency
- token efficiency
- reasoning stability
| Component | Technology |
|---|---|
| LLM Agents | Gemini Flash |
| Aggregator | Gemini Pro |
| Dev Teacher Model | Groq 32B |
| Vector Memory | MongoDB Atlas |
| Retrieval | BM25 + Vertex AI |
| Reranking | BERT Cross-Encoder |
| Storage | MongoDB Atlas M0 |
| Dataset | HotpotQA Distractor |
- Improve multi-hop retrieval
- Recover BM25 hard failures
- Build reusable episodic memory
- Maintain bounded token usage
- Enable scalable agent orchestration
- Support retrieval distillation pipelines
- adaptive memory pruning
- reinforcement-style retrieval correction
- dynamic agent routing
- graph-based bridge reasoning
- retrieval confidence calibration
- memory-aware reranker finetuning
- HotpotQA
- Retrieval-Augmented Generation (RAG)
- Self-RAG inspired summarization
- Episodic memory architectures
- Verification-aware retrieval systems
- Multi-agent reasoning pipelines
