A LangGraph implementation of the MetaRAG paper with a three-phase metacognitive regulation loop and hybrid denseβsparse retrieval β evaluated on HotpotQA and 2WikiMultiHopQA against all paper baselines.
Multi-hop question answering over retrieved documents requires more than a single retrieval-generation pass: models must recognize when their own answer is inadequate and understand why.
This work implements and extends the MetaRAG framework (Yao et al., 2024), which introduces a three-phase metacognitive regulation loop β monitoring, evaluating, and planning β directly into the RAG pipeline. The system classifies answer failures into four knowledge-deficit categories (insufficient knowledge, internal-only, external-only, reasoning error) and applies category-specific remediation strategies including targeted re-retrieval and constrained writing directives.
Evaluated on 500 examples from HotpotQA and 2WikiMultiHopQA under the gold-context protocol with DeepSeek V3.2, the implementation achieves 59.0 EM / 75.1 F1 on HotpotQA and 66.2 EM / 74.2 F1 on 2WikiMultiHopQA, surpassing the paper's reported MetaRAG results (37.8 / 49.9 and 42.8 / 50.8 respectively) by 21β23 EM points and 23β25 F1 points.
The performance differential is attributable to DeepSeek V3.2's substantially greater reasoning capacity relative to the GPT-3.5-turbo baseline used in the paper; the metacognitive loop design follows the paper's specification faithfully.
| Technology | Version | Responsibility |
|---|---|---|
| Python | β₯ 3.13 | Runtime |
| LangGraph | β₯ 0.2 | Agent graph orchestration; stateful multi-node DAG with conditional edges |
| LangChain | β₯ 0.3 | LLM abstraction and message formatting |
| FastAPI | β₯ 0.115 | REST API server for document ingestion and query endpoints |
| Qdrant | β₯ 1.12 | Vector store for dense (embedding) retrieval |
| Elasticsearch | 8.17 | Inverted-index BM25 sparse retrieval |
| PostgreSQL 16 | via asyncpg | Document metadata and analytics storage |
| sentence-transformers | β₯ 3.3 | Local embedding (BAAI/bge-small-en-v1.5) and reranking (BAAI/bge-reranker-base) |
| PyMuPDF | β₯ 1.25 | PDF ingestion and text extraction |
| DeepSeek V3.2 | deepseek-chat |
Primary and sole LLM (OpenAI-compatible API) |
| Pydantic | β₯ 2.10 | Configuration management and data validation |
| Docker Compose | β | Infrastructure orchestration (Qdrant, Elasticsearch, PostgreSQL, API) |
The system implements a LangGraph state machine with eleven processing nodes. In the default parallel mode, the post-write verification stack (claim extraction, citation verification, evidence graph, and LLM evaluation) collapses into a single parallel_verify_and_evaluate_node. In sequential mode (when DISABLE_PARALLEL=true), each step runs as a separate node. Each query traverses the graph as a typed AgentState dictionary, with conditional edges controlling multi-hop retrieval and the metacognitive remediation loop.
| Module | Location | Responsibility |
|---|---|---|
AgentState |
backend/app/agent/graph.py |
Typed state dictionary shared across all graph nodes |
plan_node |
backend/app/agent/graph.py |
Classifies query into simple/complex/multi-hop; selects retrieval routing config |
retrieve_node |
backend/app/agent/graph.py |
Dispatches hybrid RRF search; applies cross-encoder reranking and retrieval guardrails |
read_node |
backend/app/agent/graph.py |
Extracts evidence spans from retrieved documents |
write_node |
backend/app/agent/graph.py |
Generates the grounded answer; appends writing_directive on metacognitive remediation rounds |
parallel_verify_and_evaluate_node |
backend/app/agent/graph.py |
Concurrently runs claim extraction, citation verification, evidence graph, and LLM evaluation |
diagnose_node |
backend/app/agent/graph.py |
Invokes the metacognitive monitor and evaluator-critic; populates diagnosis state |
remediate_node |
backend/app/agent/graph.py |
Applies targeted remediation: targeted re-retrieval for INSUFFICIENT, writing directives for others |
claim_extract_node |
backend/app/agent/graph.py |
Segments the generated answer into individual factual claims (sequential mode only) |
citation_verify_node |
backend/app/agent/graph.py |
Computes citation precision, unsupported claim rate, and evidence alignment score (sequential mode only) |
evidence_graph_node |
backend/app/agent/graph.py |
Constructs claimβdocument support graph for interpretability (sequential mode only) |
evaluate_node |
backend/app/agent/graph.py |
LLM-based faithfulness, completeness, and confidence scoring (sequential mode only) |
diagnose_answer |
backend/app/agent/metacognitive.py |
Three-phase metacognitive regulation: monitoring gate β evaluator-critic LLM β heuristic fallback |
build_writing_directive |
backend/app/agent/metacognitive.py |
Maps diagnosis category to a concrete writing instruction |
check_convergence |
backend/app/agent/metacognitive.py |
Jaccard similarity convergence detection between consecutive answers |
plan_query |
backend/app/agent/planner.py |
LLM-based query complexity classification (simple/complex/multi-hop) with rule-based fallback |
write_answer |
backend/app/agent/writer.py |
Grounded answer generation with inline citations and optional writing directive injection |
read_documents |
backend/app/agent/reader.py |
Evidence extraction from retrieved documents |
hybrid_search |
backend/app/retrieval/hybrid.py |
Reciprocal Rank Fusion of dense (Qdrant) and sparse (BM25/Elasticsearch) results |
dense_search |
backend/app/retrieval/dense.py |
Qdrant vector search with BAAI/bge-small-en-v1.5 embeddings and document_id filtering |
bm25_search |
backend/app/retrieval/bm25_retrieval.py |
Elasticsearch BM25 keyword search with document_id filtering |
rerank |
backend/app/retrieval/reranker.py |
Cross-encoder reranking via BAAI/bge-reranker-base |
filter_retrieved_docs |
backend/app/retrieval/guardrails.py |
Filters prompt injection patterns, adversarial content, and low-quality chunks |
compute_retrieval_diagnostics |
backend/app/retrieval/diagnostics.py |
Observability metrics: query coverage, document diversity, retrieval redundancy, recall proxy |
verify_citations |
backend/app/verification/citation_verifier.py |
Computes citation precision, unsupported claim rate, and evidence alignment score |
extract_claims |
backend/app/verification/claim_extractor.py |
Segments the generated answer into individual factual claims |
evaluate_answer |
backend/app/optimization/evaluator.py |
LLM-based faithfulness, completeness, and confidence scoring |
build_evidence_graph |
backend/app/research/evidence_graph.py |
Lightweight claimβdocument support graph using token-overlap heuristic |
log_run |
backend/app/memory/strategy_memory.py |
Persists run records to PostgreSQL for analytics (faithfulness, cost, latency, utility) |
log_retrieval_diagnostics |
backend/app/memory/strategy_memory.py |
Persists retrieval observability metrics per query |
log_provenance_snapshot |
backend/app/memory/strategy_memory.py |
Persists full answer provenance snapshots including diagnosis, citations, and evidence graph |
| Ingestion pipeline | backend/app/ingestion/pipeline.py |
PDF/DOCX/HTML/TXT chunking β Qdrant dense upsert + Elasticsearch BM25 index |
| Settings | backend/app/config.py |
Pydantic-settings configuration with .env override |
| LLM abstraction | backend/app/llm.py |
DeepSeek LLM abstraction via OpenAI-compatible API |
| Cost tracking | backend/app/cost.py |
Token cost estimation and DeepSeek balance monitoring |
| API routes | backend/app/api/routes.py |
FastAPI routers: query, stream, document management, ingestion, and analytics endpoints |
| Benchmark runner | backend/benchmarks/runner.py |
Async parallel evaluation with gold-context and open-domain modes |
The metacognitive loop is the core algorithmic contribution. It wraps the standard RAG generation step with a three-phase regulation pipeline that decides whether an answer is acceptable, diagnoses the specific failure mode if not, and applies a targeted corrective strategy.
After the LLM produces an initial answer, the system evaluates three scalar metrics via an LLM judge: faithfulness, answer completeness, and citation precision. The monitoring gate (is_answer_satisfactory) compares these against configured thresholds:
faithfulness β₯ 0.70 AND completeness β₯ 0.60 AND citation_precision β₯ 0.40
If all three conditions are met, the answer is classified as satisfactory and the metacognitive loop terminates without an additional LLM call. In the benchmark evaluation, 46% of HotpotQA queries and 66% of 2WikiMultiHopQA queries passed this gate on the first attempt.
An optional confidence-based fast path (fast_path_threshold) can additionally skip the expensive evaluator-critic invocation when the evaluator's self-reported confidence is high and metrics are within 0.15 of their thresholds. This optimization is disabled by default for reproducibility.
When the monitoring gate fails, the system invokes an LLM evaluator-critic with a structured prompt that performs two simultaneous assessments:
Procedural knowledge classification β independent assessment of two knowledge sources:
- Internal sufficiency: could the LLM answer reliably from training knowledge alone?
- External sufficiency: do the retrieved documents contain adequate, relevant information?
The combination of these two boolean signals produces one of four knowledge categories:
| Category | Internal | External | Condition |
|---|---|---|---|
insufficient_knowledge |
β | β | Neither source is adequate; targeted re-retrieval is required |
internal_knowledge_only |
β | β | LLM knows the answer but retrieved docs are noisy or misleading |
external_knowledge_only |
β | β | Docs contain the answer but the LLM's own knowledge is wrong or absent |
reasoning_error |
β | β | Both sources are adequate; the failure is in reasoning or structure |
Declarative knowledge classification β detection of three common error patterns:
| Error Type | Description |
|---|---|
incomplete_reasoning |
Missed reasoning steps, incomplete chain of thought, ignored evidence |
answer_redundance |
Repetitive or verbose; similar points stated multiple times |
ambiguity_understanding |
Misread the query's intent; answered a related but different question |
When the LLM evaluator-critic fails (network error, malformed JSON), a rule-based heuristic fallback classifies the diagnosis from the raw metric values.
The diagnosis category maps to a specific remediation strategy in remediate_node:
| Category | Remediation Action |
|---|---|
insufficient_knowledge |
Issues a targeted sub-query (suggested_query from the evaluator) to retrieve additional documents; augments the evidence pool before re-generation |
internal_knowledge_only |
Prepends directive: "Rely primarily on your own knowledge; only cite a reference if it clearly and directly supports a claim" |
external_knowledge_only |
Prepends directive: "Answer ONLY using information explicitly stated in the provided source documents; every factual claim MUST have an inline citation" |
reasoning_error |
Prepends error-specific guidance targeting the detected error type(s); falls back to chain-of-thought instruction if no specific type matches |
The writing_directive is passed as an additional system-level instruction into the next write_node invocation, shaping how the LLM re-generates the answer without altering the underlying evidence pool (except for insufficient_knowledge).
To prevent the loop from cycling without improvement, check_convergence computes the Jaccard similarity between the token sets of consecutive answers:
J(A_{t-1}, A_t) = |tokens(A_{t-1}) β© tokens(A_t)| / |tokens(A_{t-1}) βͺ tokens(A_t)|
If J β₯ 0.85 (configurable via metacognitive_convergence_threshold), the loop terminates: the model has stabilized and further rounds would not improve the answer. The iteration limit (max_metacognitive_rounds = 3) provides a hard upper bound.
Standard RAG performs one retrieval pass and one generation pass with no quality feedback. ReAct interleaves reasoning and retrieval via a thoughtβactionβobservation loop but has no explicit model of why a given answer is weak. This system differs in three ways:
- Typed failure diagnosis: rather than re-running retrieval blindly, the system determines whether the failure is a knowledge-access problem (
insufficient,external_only) or a reasoning/generation problem (internal_only,reasoning_error) and applies different interventions for each. - Writing directives as soft constraints: the planning phase modifies the generation prompt rather than the retrieval strategy for non-knowledge-gap failures, targeting the root cause without expensive additional retrieval.
- Convergence-aware iteration: the Jaccard check prevents over-thinking, consistent with Paper Β§6.5's empirical finding that performance degrades when the model can no longer extract useful improvements from additional rounds.
All results use the gold-context evaluation protocol: the dataset's supporting documents are provided directly to the pipeline, bypassing open-domain retrieval. This protocol isolates reasoning and metacognitive regulation quality from retrieval quality.
Evaluation settings: 500 examples, seed=42, concurrency=20, DeepSeek V3.2 (deepseek-chat), max 3 metacognitive rounds, convergence threshold 0.85.
| Model | EM | F1 | Precision | Recall |
|---|---|---|---|---|
| Standard RAG | 24.6 | 33.0 | 34.1 | 34.5 |
| ReAct | 24.8 | 41.7 | 42.6 | 44.7 |
| Flare | 29.2 | 42.4 | 42.8 | 43.0 |
| IR-CoT | 31.4 | 40.3 | 41.6 | 41.2 |
| Self-Ask | 28.2 | 43.1 | 43.4 | 44.8 |
| Reflexion | 30.0 | 43.4 | 43.2 | 44.3 |
| MetaRAG (paper, GPT-3.5-turbo) | 37.8 | 49.9 | 52.1 | 50.9 |
| Ours (DeepSeek V3.2) | 59.0 | 75.1 | 77.1 | 78.2 |
| Model | EM | F1 | Precision | Recall |
|---|---|---|---|---|
| Standard RAG | 18.8 | 25.2 | 25.6 | 26.2 |
| ReAct | 21.0 | 28.0 | 27.6 | 30.0 |
| Flare | 28.2 | 39.8 | 40.0 | 40.8 |
| IR-CoT | 30.8 | 42.6 | 42.3 | 40.9 |
| Self-Ask | 28.6 | 37.5 | 36.5 | 42.8 |
| Reflexion | 31.8 | 41.7 | 40.6 | 44.2 |
| MetaRAG (paper, GPT-3.5-turbo) | 42.8 | 50.8 | 50.7 | 52.2 |
| Ours (DeepSeek V3.2) | 66.2 | 74.2 | 72.7 | 78.4 |
| Dataset | EM (Ours / Paper) | F1 (Ours / Paper) | ΞEM | ΞF1 |
|---|---|---|---|---|
| HotpotQA | 59.0 / 37.8 | 75.1 / 49.9 | +21.2 | +25.2 |
| 2WikiMultiHopQA | 66.2 / 42.8 | 74.2 / 50.8 | +23.4 | +23.4 |
The +21β23 EM improvement over the paper's MetaRAG baseline reflects the LLM capability gap between GPT-3.5-turbo (2024) and DeepSeek V3.2: a stronger base model generates higher-quality initial answers and responds more accurately to metacognitive writing directives. The metacognitive statistics reveal the loop's efficiency: 46% of HotpotQA and 66% of 2WikiMultiHopQA queries required zero remediation rounds, confirming the fast-path gate correctly identifies adequate answers. 20% of HotpotQA and 16% of 2WikiMultiHopQA queries converged before the three-round limit, indicating the Jaccard threshold (0.85) is appropriately calibrated. HotpotQA's harder comparative multi-hop questions drove a higher rate of queries exhausting all three rounds (34% vs. 18%), consistent with the dataset's greater reasoning complexity. Zero errors across 1,000 queries confirms pipeline stability under concurrent execution.
| Metric | HotpotQA | 2WikiMultiHopQA |
|---|---|---|
| Avg rounds per query | 1.32 | 0.75 |
| No remediation (round = 0) | 230 / 500 (46.0%) | 332 / 500 (66.4%) |
| Converged early (0 < rounds < 3) | 101 / 500 (20.2%) | 78 / 500 (15.6%) |
| Hit max rounds (3) | 169 / 500 (33.8%) | 90 / 500 (18.0%) |
| Avg latency per query | 34,831 ms | 23,427 ms |
| Errors | 0 | 0 |
Meta-RAG/
βββ backend/
β βββ app/
β β βββ agent/
β β β βββ graph.py # LangGraph DAG: all nodes, edges, and conditional routing
β β β βββ metacognitive.py # Three-phase metacognitive regulation (monitoring, evaluating, planning)
β β β βββ planner.py # LLM-based query complexity classification
β β β βββ reader.py # Evidence extraction from retrieved documents
β β β βββ writer.py # Answer generation with optional writing directive
β β βββ retrieval/
β β β βββ hybrid.py # Reciprocal Rank Fusion of dense + BM25 results
β β β βββ dense.py # Qdrant vector search with BGE-small embeddings
β β β βββ bm25_retrieval.py # Elasticsearch BM25 keyword search
β β β βββ reranker.py # Cross-encoder reranking (BGE-reranker-base)
β β β βββ guardrails.py # Chunk filtering: safety, quality, deduplication
β β β βββ diagnostics.py # Retrieval quality metrics (coverage, diversity)
β β βββ optimization/
β β β βββ evaluator.py # LLM faithfulness + completeness scoring
β β βββ verification/
β β β βββ citation_verifier.py # Citation precision and evidence alignment
β β β βββ claim_extractor.py # Claim segmentation from generated answers
β β βββ research/
β β β βββ evidence_graph.py # Claimβdocument support graph construction
β β βββ memory/
β β β βββ strategy_memory.py # PostgreSQL analytics persistence
β β βββ ingestion/
β β β βββ pipeline.py # Document chunking β Qdrant + Elasticsearch indexing
β β βββ api/
β β β βββ routes.py # FastAPI routers: query, stream, documents, ingest, analytics
β β βββ models/
β β β βββ schemas.py # Pydantic request/response schemas
β β β βββ db_models.py # SQLAlchemy ORM models
β β βββ config.py # Pydantic-settings configuration
β β βββ llm.py # DeepSeek LLM abstraction via OpenAI-compatible API
β β βββ cost.py # Token cost tracking
β β βββ database.py # Async SQLAlchemy session management
β β βββ text_utils.py # Shared text utilities
β β βββ main.py # FastAPI application factory
β βββ benchmarks/
β β βββ runner.py # Async benchmark runner with gold and open-domain modes
β β βββ datasets.py # HotpotQA and 2WikiMultiHopQA data loaders
β β βββ metrics.py # EM, F1, Precision, Recall computation
β β βββ compare.py # Paper baseline comparison utilities
β β βββ results/
β β βββ reports/
β β βββ final_report.md # Official benchmark results
β βββ tests/ # pytest test suite
β βββ pyproject.toml # Project metadata and dependencies (uv)
β βββ Dockerfile # Backend container
βββ frontend/ # Web UI
βββ docs/
β βββ diagrams/ # draw.io architecture diagrams (4 pipelines)
β βββ papers/ # Source paper (MetaRAG, arXiv:2402.11626)
βββ docker-compose.yml # Qdrant + Elasticsearch + PostgreSQL + API
βββ .env.example # Configuration template
βββ README.md
- uv (Python package manager)
- Docker and Docker Compose
- A DeepSeek API key (platform.deepseek.com)
git clone <repository-url>
cd Meta-RAG
cp .env.example .env
# Edit .env: set DEEPSEEK_API_KEYdocker compose up -d qdrant elasticsearch postgresWait for Elasticsearch to become healthy (10β20 seconds):
docker compose ps # all three services should show "healthy" or "running"cd backend
uv syncuv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000The API is available at http://localhost:8000. Interactive docs: http://localhost:8000/docs.
Upload PDFs or other documents via the /ingest endpoint, or use the ingestion API directly. For benchmark evaluation, ingestion is handled automatically.
Benchmarks require the infrastructure services (qdrant, elasticsearch, postgres) to be running. All commands are run from the backend/ directory.
# HotpotQA, 500 examples, 20 concurrent queries
uv run python -m benchmarks.runner --dataset hotpotqa --mode gold --n 500 --concurrency 20
# 2WikiMultiHopQA, 500 examples
uv run python -m benchmarks.runner --dataset 2wikimultihopqa --mode gold --n 500 --concurrency 20
# Both datasets sequentially
uv run python -m benchmarks.runner --dataset both --mode gold --n 500 --concurrency 20# Ingests the benchmark corpus into Qdrant + Elasticsearch, then runs end-to-end
uv run python -m benchmarks.runner --dataset hotpotqa --mode open_domain --n 500 --concurrency 5--dataset hotpotqa | 2wikimultihopqa | both
--mode gold | open_domain
--n Number of examples (default: 500; paper uses 500)
--seed Random seed for subsampling (default: 42)
--concurrency Parallel queries (default: 1; use 10β25 for speed)
--provider deepseek (overrides LLM_PROVIDER in .env)
--convergence-threshold Jaccard threshold for convergence detection (default: 0.85)
--fast-path-threshold Evaluator confidence for fast-path skip (default: disabled)
--disable-parallel Run post-write verification steps sequentially
--output-dir Results directory (default: benchmarks/results)
Results are saved as timestamped JSON files in benchmarks/results/evaluations/. Cost logs are written to benchmarks/results/cost_logs/cost_log.jsonl.
uv run pytestdocker compose up --buildAll parameters are set via environment variables or the .env file (see .env.example). The Settings class in backend/app/config.py provides typed access with defaults.
| Parameter | Default | Description |
|---|---|---|
LLM_PROVIDER |
deepseek |
LLM provider (set to deepseek; no alternatives) |
DEEPSEEK_API_KEY |
β | DeepSeek API key (required) |
DEEPSEEK_MODEL |
deepseek-chat |
DeepSeek model identifier |
DEEPSEEK_BASE_URL |
https://api.deepseek.com |
DeepSeek API base URL |
DATABASE_URL |
postgresql+asyncpg://ara:ara_secret@localhost:5432/ara_db |
PostgreSQL connection string |
QDRANT_HOST |
localhost |
Qdrant server hostname |
QDRANT_PORT |
6333 |
Qdrant gRPC port |
QDRANT_COLLECTION |
research_papers |
Qdrant collection name for document chunks |
ELASTICSEARCH_URL |
http://localhost:9200 |
Elasticsearch endpoint |
ELASTICSEARCH_INDEX |
research_chunks |
Elasticsearch index name |
EMBEDDING_MODEL |
BAAI/bge-small-en-v1.5 |
HuggingFace model for dense embeddings |
RERANKER_MODEL |
BAAI/bge-reranker-base |
HuggingFace cross-encoder for reranking |
MAX_METACOGNITIVE_ROUNDS |
3 |
Hard cap on metacognitive loop iterations per query (Paper Β§6.5) |
METACOGNITIVE_CONVERGENCE_THRESHOLD |
0.85 |
Jaccard similarity threshold for early loop termination (Paper Β§6.5) |
FAST_PATH_THRESHOLD |
None |
Evaluator confidence threshold to skip the LLM diagnosis call; None disables (Phase 3 optimization) |
FAITHFULNESS_THRESHOLD |
0.7 |
Minimum faithfulness score to pass the monitoring gate |
COMPLETENESS_THRESHOLD |
0.6 |
Minimum completeness score to pass the monitoring gate |
CITATION_PRECISION_THRESHOLD |
0.4 |
Minimum citation precision to pass the monitoring gate |
EVIDENCE_COVERAGE_THRESHOLD |
0.45 |
Minimum evidence coverage to suppress additional retrieval hops |
MIN_RETRIEVAL_DIVERSITY |
0.25 |
Minimum document diversity required in the retrieved set |
EVALUATOR_CONFIDENCE_THRESHOLD |
0.4 |
Minimum evaluator confidence to influence routing decisions |
DISABLE_PARALLEL |
False |
Run claim extraction / citation verification / evaluation sequentially instead of concurrently |
The planner classifies each query into one of three complexity levels and maps it to a fixed retrieval configuration (top-k, reranking, and multi-hop limits):
| Complexity | top_k | Rerank | Multi-hop limit |
|---|---|---|---|
simple |
5 | β | 1 |
complex |
10 | β | 1 |
multi-hop |
10 | β | 3 |
If this implementation is useful for your research, please cite the original MetaRAG paper:
@article{yao2024metacognitive,
title = {Metacognitive Retrieval-Augmented Large Language Models},
author = {Yao, Yujia and He, Tianyi and Xu, Zheng and Zhao, Tuo},
journal = {arXiv preprint arXiv:2402.11626},
year = {2024},
url = {https://arxiv.org/abs/2402.11626}
}