Meta-RAG: Metacognitive Retrieval-Augmented Generation

A LangGraph implementation of the MetaRAG paper with a three-phase metacognitive regulation loop and hybrid dense–sparse retrieval — evaluated on HotpotQA and 2WikiMultiHopQA against all paper baselines.

1. Abstract

Multi-hop question answering over retrieved documents requires more than a single retrieval-generation pass: models must recognize when their own answer is inadequate and understand why.

This work implements and extends the MetaRAG framework (Yao et al., 2024), which introduces a three-phase metacognitive regulation loop — monitoring, evaluating, and planning — directly into the RAG pipeline. The system classifies answer failures into four knowledge-deficit categories (insufficient knowledge, internal-only, external-only, reasoning error) and applies category-specific remediation strategies including targeted re-retrieval and constrained writing directives.

Evaluated on 500 examples from HotpotQA and 2WikiMultiHopQA under the gold-context protocol with DeepSeek V3.2, the implementation achieves 59.0 EM / 75.1 F1 on HotpotQA and 66.2 EM / 74.2 F1 on 2WikiMultiHopQA, surpassing the paper's reported MetaRAG results (37.8 / 49.9 and 42.8 / 50.8 respectively) by 21–23 EM points and 23–25 F1 points.

The performance differential is attributable to DeepSeek V3.2's substantially greater reasoning capacity relative to the GPT-3.5-turbo baseline used in the paper; the metacognitive loop design follows the paper's specification faithfully.

2. Tech Stack

Technology	Version	Responsibility
Python	≥ 3.13	Runtime
LangGraph	≥ 0.2	Agent graph orchestration; stateful multi-node DAG with conditional edges
LangChain	≥ 0.3	LLM abstraction and message formatting
FastAPI	≥ 0.115	REST API server for document ingestion and query endpoints
Qdrant	≥ 1.12	Vector store for dense (embedding) retrieval
Elasticsearch	8.17	Inverted-index BM25 sparse retrieval
PostgreSQL 16	via asyncpg	Document metadata and analytics storage
sentence-transformers	≥ 3.3	Local embedding (`BAAI/bge-small-en-v1.5`) and reranking (`BAAI/bge-reranker-base`)
PyMuPDF	≥ 1.25	PDF ingestion and text extraction
DeepSeek V3.2	`deepseek-chat`	Primary and sole LLM (OpenAI-compatible API)
Pydantic	≥ 2.10	Configuration management and data validation
Docker Compose	—	Infrastructure orchestration (Qdrant, Elasticsearch, PostgreSQL, API)

3. Architecture Overview

The system implements a LangGraph state machine with eleven processing nodes. In the default parallel mode, the post-write verification stack (claim extraction, citation verification, evidence graph, and LLM evaluation) collapses into a single parallel_verify_and_evaluate_node. In sequential mode (when DISABLE_PARALLEL=true), each step runs as a separate node. Each query traverses the graph as a typed AgentState dictionary, with conditional edges controlling multi-hop retrieval and the metacognitive remediation loop.

End-to-End Pipeline

Component Table

Module	Location	Responsibility
`AgentState`	`backend/app/agent/graph.py`	Typed state dictionary shared across all graph nodes
`plan_node`	`backend/app/agent/graph.py`	Classifies query into simple/complex/multi-hop; selects retrieval routing config
`retrieve_node`	`backend/app/agent/graph.py`	Dispatches hybrid RRF search; applies cross-encoder reranking and retrieval guardrails
`read_node`	`backend/app/agent/graph.py`	Extracts evidence spans from retrieved documents
`write_node`	`backend/app/agent/graph.py`	Generates the grounded answer; appends `writing_directive` on metacognitive remediation rounds
`parallel_verify_and_evaluate_node`	`backend/app/agent/graph.py`	Concurrently runs claim extraction, citation verification, evidence graph, and LLM evaluation
`diagnose_node`	`backend/app/agent/graph.py`	Invokes the metacognitive monitor and evaluator-critic; populates diagnosis state
`remediate_node`	`backend/app/agent/graph.py`	Applies targeted remediation: targeted re-retrieval for `INSUFFICIENT`, writing directives for others
`claim_extract_node`	`backend/app/agent/graph.py`	Segments the generated answer into individual factual claims (sequential mode only)
`citation_verify_node`	`backend/app/agent/graph.py`	Computes citation precision, unsupported claim rate, and evidence alignment score (sequential mode only)
`evidence_graph_node`	`backend/app/agent/graph.py`	Constructs claim–document support graph for interpretability (sequential mode only)
`evaluate_node`	`backend/app/agent/graph.py`	LLM-based faithfulness, completeness, and confidence scoring (sequential mode only)
`diagnose_answer`	`backend/app/agent/metacognitive.py`	Three-phase metacognitive regulation: monitoring gate → evaluator-critic LLM → heuristic fallback
`build_writing_directive`	`backend/app/agent/metacognitive.py`	Maps diagnosis category to a concrete writing instruction
`check_convergence`	`backend/app/agent/metacognitive.py`	Jaccard similarity convergence detection between consecutive answers
`plan_query`	`backend/app/agent/planner.py`	LLM-based query complexity classification (simple/complex/multi-hop) with rule-based fallback
`write_answer`	`backend/app/agent/writer.py`	Grounded answer generation with inline citations and optional writing directive injection
`read_documents`	`backend/app/agent/reader.py`	Evidence extraction from retrieved documents
`hybrid_search`	`backend/app/retrieval/hybrid.py`	Reciprocal Rank Fusion of dense (Qdrant) and sparse (BM25/Elasticsearch) results
`dense_search`	`backend/app/retrieval/dense.py`	Qdrant vector search with `BAAI/bge-small-en-v1.5` embeddings and document_id filtering
`bm25_search`	`backend/app/retrieval/bm25_retrieval.py`	Elasticsearch BM25 keyword search with document_id filtering
`rerank`	`backend/app/retrieval/reranker.py`	Cross-encoder reranking via `BAAI/bge-reranker-base`
`filter_retrieved_docs`	`backend/app/retrieval/guardrails.py`	Filters prompt injection patterns, adversarial content, and low-quality chunks
`compute_retrieval_diagnostics`	`backend/app/retrieval/diagnostics.py`	Observability metrics: query coverage, document diversity, retrieval redundancy, recall proxy
`verify_citations`	`backend/app/verification/citation_verifier.py`	Computes citation precision, unsupported claim rate, and evidence alignment score
`extract_claims`	`backend/app/verification/claim_extractor.py`	Segments the generated answer into individual factual claims
`evaluate_answer`	`backend/app/optimization/evaluator.py`	LLM-based faithfulness, completeness, and confidence scoring
`build_evidence_graph`	`backend/app/research/evidence_graph.py`	Lightweight claim–document support graph using token-overlap heuristic
`log_run`	`backend/app/memory/strategy_memory.py`	Persists run records to PostgreSQL for analytics (faithfulness, cost, latency, utility)
`log_retrieval_diagnostics`	`backend/app/memory/strategy_memory.py`	Persists retrieval observability metrics per query
`log_provenance_snapshot`	`backend/app/memory/strategy_memory.py`	Persists full answer provenance snapshots including diagnosis, citations, and evidence graph
Ingestion pipeline	`backend/app/ingestion/pipeline.py`	PDF/DOCX/HTML/TXT chunking → Qdrant dense upsert + Elasticsearch BM25 index
Settings	`backend/app/config.py`	Pydantic-settings configuration with `.env` override
LLM abstraction	`backend/app/llm.py`	DeepSeek LLM abstraction via OpenAI-compatible API
Cost tracking	`backend/app/cost.py`	Token cost estimation and DeepSeek balance monitoring
API routes	`backend/app/api/routes.py`	FastAPI routers: query, stream, document management, ingestion, and analytics endpoints
Benchmark runner	`backend/benchmarks/runner.py`	Async parallel evaluation with gold-context and open-domain modes

4. Metacognitive Loop

The metacognitive loop is the core algorithmic contribution. It wraps the standard RAG generation step with a three-phase regulation pipeline that decides whether an answer is acceptable, diagnoses the specific failure mode if not, and applies a targeted corrective strategy.

Phase 1 — Monitoring: Fast-Path Satisfaction Gate (Paper §4.1)

After the LLM produces an initial answer, the system evaluates three scalar metrics via an LLM judge: faithfulness, answer completeness, and citation precision. The monitoring gate (is_answer_satisfactory) compares these against configured thresholds:

faithfulness ≥ 0.70  AND  completeness ≥ 0.60  AND  citation_precision ≥ 0.40

If all three conditions are met, the answer is classified as satisfactory and the metacognitive loop terminates without an additional LLM call. In the benchmark evaluation, 46% of HotpotQA queries and 66% of 2WikiMultiHopQA queries passed this gate on the first attempt.

An optional confidence-based fast path (fast_path_threshold) can additionally skip the expensive evaluator-critic invocation when the evaluator's self-reported confidence is high and metrics are within 0.15 of their thresholds. This optimization is disabled by default for reproducibility.

Phase 2 — Evaluating: Evaluator-Critic Diagnosis (Paper §4.2)

When the monitoring gate fails, the system invokes an LLM evaluator-critic with a structured prompt that performs two simultaneous assessments:

Procedural knowledge classification — independent assessment of two knowledge sources:

Internal sufficiency: could the LLM answer reliably from training knowledge alone?
External sufficiency: do the retrieved documents contain adequate, relevant information?

The combination of these two boolean signals produces one of four knowledge categories:

Category	Internal	External	Condition
`insufficient_knowledge`	✗	✗	Neither source is adequate; targeted re-retrieval is required
`internal_knowledge_only`	✓	✗	LLM knows the answer but retrieved docs are noisy or misleading
`external_knowledge_only`	✗	✓	Docs contain the answer but the LLM's own knowledge is wrong or absent
`reasoning_error`	✓	✓	Both sources are adequate; the failure is in reasoning or structure

Declarative knowledge classification — detection of three common error patterns:

Error Type	Description
`incomplete_reasoning`	Missed reasoning steps, incomplete chain of thought, ignored evidence
`answer_redundance`	Repetitive or verbose; similar points stated multiple times
`ambiguity_understanding`	Misread the query's intent; answered a related but different question

When the LLM evaluator-critic fails (network error, malformed JSON), a rule-based heuristic fallback classifies the diagnosis from the raw metric values.

Phase 3 — Planning: Targeted Remediation (Paper §4.3)

The diagnosis category maps to a specific remediation strategy in remediate_node:

Category	Remediation Action
`insufficient_knowledge`	Issues a targeted sub-query (`suggested_query` from the evaluator) to retrieve additional documents; augments the evidence pool before re-generation
`internal_knowledge_only`	Prepends directive: "Rely primarily on your own knowledge; only cite a reference if it clearly and directly supports a claim"
`external_knowledge_only`	Prepends directive: "Answer ONLY using information explicitly stated in the provided source documents; every factual claim MUST have an inline citation"
`reasoning_error`	Prepends error-specific guidance targeting the detected error type(s); falls back to chain-of-thought instruction if no specific type matches

The writing_directive is passed as an additional system-level instruction into the next write_node invocation, shaping how the LLM re-generates the answer without altering the underlying evidence pool (except for insufficient_knowledge).

Convergence Detection (Paper §6.5)

To prevent the loop from cycling without improvement, check_convergence computes the Jaccard similarity between the token sets of consecutive answers:

J(A_{t-1}, A_t) = |tokens(A_{t-1}) ∩ tokens(A_t)| / |tokens(A_{t-1}) ∪ tokens(A_t)|

If J ≥ 0.85 (configurable via metacognitive_convergence_threshold), the loop terminates: the model has stabilized and further rounds would not improve the answer. The iteration limit (max_metacognitive_rounds = 3) provides a hard upper bound.

Why This Differs from Standard RAG and ReAct

Standard RAG performs one retrieval pass and one generation pass with no quality feedback. ReAct interleaves reasoning and retrieval via a thought–action–observation loop but has no explicit model of why a given answer is weak. This system differs in three ways:

Typed failure diagnosis: rather than re-running retrieval blindly, the system determines whether the failure is a knowledge-access problem (insufficient, external_only) or a reasoning/generation problem (internal_only, reasoning_error) and applies different interventions for each.
Writing directives as soft constraints: the planning phase modifies the generation prompt rather than the retrieval strategy for non-knowledge-gap failures, targeting the root cause without expensive additional retrieval.
Convergence-aware iteration: the Jaccard check prevents over-thinking, consistent with Paper §6.5's empirical finding that performance degrades when the model can no longer extract useful improvements from additional rounds.

5. Benchmark Results

All results use the gold-context evaluation protocol: the dataset's supporting documents are provided directly to the pipeline, bypassing open-domain retrieval. This protocol isolates reasoning and metacognitive regulation quality from retrieval quality.

Evaluation settings: 500 examples, seed=42, concurrency=20, DeepSeek V3.2 (deepseek-chat), max 3 metacognitive rounds, convergence threshold 0.85.

HotpotQA

Model	EM	F1	Precision	Recall
Standard RAG	24.6	33.0	34.1	34.5
ReAct	24.8	41.7	42.6	44.7
Flare	29.2	42.4	42.8	43.0
IR-CoT	31.4	40.3	41.6	41.2
Self-Ask	28.2	43.1	43.4	44.8
Reflexion	30.0	43.4	43.2	44.3
MetaRAG (paper, GPT-3.5-turbo)	37.8	49.9	52.1	50.9
Ours (DeepSeek V3.2)	59.0	75.1	77.1	78.2

2WikiMultiHopQA

Model	EM	F1	Precision	Recall
Standard RAG	18.8	25.2	25.6	26.2
ReAct	21.0	28.0	27.6	30.0
Flare	28.2	39.8	40.0	40.8
IR-CoT	30.8	42.6	42.3	40.9
Self-Ask	28.6	37.5	36.5	42.8
Reflexion	31.8	41.7	40.6	44.2
MetaRAG (paper, GPT-3.5-turbo)	42.8	50.8	50.7	52.2
Ours (DeepSeek V3.2)	66.2	74.2	72.7	78.4

Combined Summary

Dataset	EM (Ours / Paper)	F1 (Ours / Paper)	ΔEM	ΔF1
HotpotQA	59.0 / 37.8	75.1 / 49.9	+21.2	+25.2
2WikiMultiHopQA	66.2 / 42.8	74.2 / 50.8	+23.4	+23.4

Analysis

The +21–23 EM improvement over the paper's MetaRAG baseline reflects the LLM capability gap between GPT-3.5-turbo (2024) and DeepSeek V3.2: a stronger base model generates higher-quality initial answers and responds more accurately to metacognitive writing directives. The metacognitive statistics reveal the loop's efficiency: 46% of HotpotQA and 66% of 2WikiMultiHopQA queries required zero remediation rounds, confirming the fast-path gate correctly identifies adequate answers. 20% of HotpotQA and 16% of 2WikiMultiHopQA queries converged before the three-round limit, indicating the Jaccard threshold (0.85) is appropriately calibrated. HotpotQA's harder comparative multi-hop questions drove a higher rate of queries exhausting all three rounds (34% vs. 18%), consistent with the dataset's greater reasoning complexity. Zero errors across 1,000 queries confirms pipeline stability under concurrent execution.

Metacognitive Loop Statistics

Metric	HotpotQA	2WikiMultiHopQA
Avg rounds per query	1.32	0.75
No remediation (round = 0)	230 / 500 (46.0%)	332 / 500 (66.4%)
Converged early (0 < rounds < 3)	101 / 500 (20.2%)	78 / 500 (15.6%)
Hit max rounds (3)	169 / 500 (33.8%)	90 / 500 (18.0%)
Avg latency per query	34,831 ms	23,427 ms
Errors	0	0

6. Project Structure

Meta-RAG/
├── backend/
│   ├── app/
│   │   ├── agent/
│   │   │   ├── graph.py           # LangGraph DAG: all nodes, edges, and conditional routing
│   │   │   ├── metacognitive.py  # Three-phase metacognitive regulation (monitoring, evaluating, planning)
│   │   │   ├── planner.py         # LLM-based query complexity classification
│   │   │   ├── reader.py          # Evidence extraction from retrieved documents
│   │   │   └── writer.py         # Answer generation with optional writing directive
│   │   ├── retrieval/
│   │   │   ├── hybrid.py          # Reciprocal Rank Fusion of dense + BM25 results
│   │   │   ├── dense.py          # Qdrant vector search with BGE-small embeddings
│   │   │   ├── bm25_retrieval.py # Elasticsearch BM25 keyword search
│   │   │   ├── reranker.py        # Cross-encoder reranking (BGE-reranker-base)
│   │   │   ├── guardrails.py      # Chunk filtering: safety, quality, deduplication
│   │   │   └── diagnostics.py    # Retrieval quality metrics (coverage, diversity)
│   │   ├── optimization/
│   │   │   └── evaluator.py      # LLM faithfulness + completeness scoring
│   │   ├── verification/
│   │   │   ├── citation_verifier.py  # Citation precision and evidence alignment
│   │   │   └── claim_extractor.py    # Claim segmentation from generated answers
│   │   ├── research/
│   │   │   └── evidence_graph.py    # Claim–document support graph construction
│   │   ├── memory/
│   │   │   └── strategy_memory.py  # PostgreSQL analytics persistence
│   │   ├── ingestion/
│   │   │   └── pipeline.py          # Document chunking → Qdrant + Elasticsearch indexing
│   │   ├── api/
│   │   │   └── routes.py            # FastAPI routers: query, stream, documents, ingest, analytics
│   │   ├── models/
│   │   │   ├── schemas.py          # Pydantic request/response schemas
│   │   │   └── db_models.py        # SQLAlchemy ORM models
│   │   ├── config.py               # Pydantic-settings configuration
│   │   ├── llm.py                  # DeepSeek LLM abstraction via OpenAI-compatible API
│   │   ├── cost.py                 # Token cost tracking
│   │   ├── database.py              # Async SQLAlchemy session management
│   │   ├── text_utils.py           # Shared text utilities
│   │   └── main.py                 # FastAPI application factory
│   ├── benchmarks/
│   │   ├── runner.py               # Async benchmark runner with gold and open-domain modes
│   │   ├── datasets.py             # HotpotQA and 2WikiMultiHopQA data loaders
│   │   ├── metrics.py              # EM, F1, Precision, Recall computation
│   │   ├── compare.py              # Paper baseline comparison utilities
│   │   └── results/
│   │       └── reports/
│   │           └── final_report.md # Official benchmark results
│   ├── tests/                      # pytest test suite
│   ├── pyproject.toml              # Project metadata and dependencies (uv)
│   └── Dockerfile                  # Backend container
├── frontend/                       # Web UI
├── docs/
│   ├── diagrams/                  # draw.io architecture diagrams (4 pipelines)
│   └── papers/                    # Source paper (MetaRAG, arXiv:2402.11626)
├── docker-compose.yml               # Qdrant + Elasticsearch + PostgreSQL + API
├── .env.example                    # Configuration template
└── README.md

7. Quickstart

Prerequisites

uv (Python package manager)
Docker and Docker Compose
A DeepSeek API key (platform.deepseek.com)

1. Clone and configure

git clone <repository-url>
cd Meta-RAG
cp .env.example .env
# Edit .env: set DEEPSEEK_API_KEY

2. Start infrastructure

docker compose up -d qdrant elasticsearch postgres

Wait for Elasticsearch to become healthy (10–20 seconds):

docker compose ps   # all three services should show "healthy" or "running"

3. Install backend dependencies

cd backend
uv sync

4. Start the API server

uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

The API is available at http://localhost:8000. Interactive docs: http://localhost:8000/docs.

5. Ingest documents (optional for API usage)

Upload PDFs or other documents via the /ingest endpoint, or use the ingestion API directly. For benchmark evaluation, ingestion is handled automatically.

8. Running Benchmarks

Benchmarks require the infrastructure services (qdrant, elasticsearch, postgres) to be running. All commands are run from the backend/ directory.

Gold-context evaluation (recommended — isolates reasoning quality)

# HotpotQA, 500 examples, 20 concurrent queries
uv run python -m benchmarks.runner --dataset hotpotqa --mode gold --n 500 --concurrency 20

# 2WikiMultiHopQA, 500 examples
uv run python -m benchmarks.runner --dataset 2wikimultihopqa --mode gold --n 500 --concurrency 20

# Both datasets sequentially
uv run python -m benchmarks.runner --dataset both --mode gold --n 500 --concurrency 20

Open-domain evaluation (full retrieval pipeline)

# Ingests the benchmark corpus into Qdrant + Elasticsearch, then runs end-to-end
uv run python -m benchmarks.runner --dataset hotpotqa --mode open_domain --n 500 --concurrency 5

Full argument reference

--dataset       hotpotqa | 2wikimultihopqa | both
--mode          gold | open_domain
--n             Number of examples (default: 500; paper uses 500)
--seed          Random seed for subsampling (default: 42)
--concurrency   Parallel queries (default: 1; use 10–25 for speed)
--provider      deepseek (overrides LLM_PROVIDER in .env)
--convergence-threshold   Jaccard threshold for convergence detection (default: 0.85)
--fast-path-threshold     Evaluator confidence for fast-path skip (default: disabled)
--disable-parallel        Run post-write verification steps sequentially
--output-dir    Results directory (default: benchmarks/results)

Results are saved as timestamped JSON files in benchmarks/results/evaluations/. Cost logs are written to benchmarks/results/cost_logs/cost_log.jsonl.

Running tests

uv run pytest

Docker (full stack including API)

docker compose up --build

9. Configuration Reference

All parameters are set via environment variables or the .env file (see .env.example). The Settings class in backend/app/config.py provides typed access with defaults.

Parameter	Default	Description
`LLM_PROVIDER`	`deepseek`	LLM provider (set to `deepseek`; no alternatives)
`DEEPSEEK_API_KEY`	—	DeepSeek API key (required)
`DEEPSEEK_MODEL`	`deepseek-chat`	DeepSeek model identifier
`DEEPSEEK_BASE_URL`	`https://api.deepseek.com`	DeepSeek API base URL
`DATABASE_URL`	`postgresql+asyncpg://ara:ara_secret@localhost:5432/ara_db`	PostgreSQL connection string
`QDRANT_HOST`	`localhost`	Qdrant server hostname
`QDRANT_PORT`	`6333`	Qdrant gRPC port
`QDRANT_COLLECTION`	`research_papers`	Qdrant collection name for document chunks
`ELASTICSEARCH_URL`	`http://localhost:9200`	Elasticsearch endpoint
`ELASTICSEARCH_INDEX`	`research_chunks`	Elasticsearch index name
`EMBEDDING_MODEL`	`BAAI/bge-small-en-v1.5`	HuggingFace model for dense embeddings
`RERANKER_MODEL`	`BAAI/bge-reranker-base`	HuggingFace cross-encoder for reranking
`MAX_METACOGNITIVE_ROUNDS`	`3`	Hard cap on metacognitive loop iterations per query (Paper §6.5)
`METACOGNITIVE_CONVERGENCE_THRESHOLD`	`0.85`	Jaccard similarity threshold for early loop termination (Paper §6.5)
`FAST_PATH_THRESHOLD`	`None`	Evaluator confidence threshold to skip the LLM diagnosis call; `None` disables (Phase 3 optimization)
`FAITHFULNESS_THRESHOLD`	`0.7`	Minimum faithfulness score to pass the monitoring gate
`COMPLETENESS_THRESHOLD`	`0.6`	Minimum completeness score to pass the monitoring gate
`CITATION_PRECISION_THRESHOLD`	`0.4`	Minimum citation precision to pass the monitoring gate
`EVIDENCE_COVERAGE_THRESHOLD`	`0.45`	Minimum evidence coverage to suppress additional retrieval hops
`MIN_RETRIEVAL_DIVERSITY`	`0.25`	Minimum document diversity required in the retrieved set
`EVALUATOR_CONFIDENCE_THRESHOLD`	`0.4`	Minimum evaluator confidence to influence routing decisions
`DISABLE_PARALLEL`	`False`	Run claim extraction / citation verification / evaluation sequentially instead of concurrently

Retrieval Routing Configurations

The planner classifies each query into one of three complexity levels and maps it to a fixed retrieval configuration (top-k, reranking, and multi-hop limits):

Complexity	top_k	Rerank	Multi-hop limit
`simple`	5	✗	1
`complex`	10	✓	1
`multi-hop`	10	✓	3

10. Citation

If this implementation is useful for your research, please cite the original MetaRAG paper:

@article{yao2024metacognitive,
  title   = {Metacognitive Retrieval-Augmented Large Language Models},
  author  = {Yao, Yujia and He, Tianyi and Xu, Zheng and Zhao, Tuo},
  journal = {arXiv preprint arXiv:2402.11626},
  year    = {2024},
  url     = {https://arxiv.org/abs/2402.11626}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
backend		backend
docs		docs
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Meta-RAG: Metacognitive Retrieval-Augmented Generation

1. Abstract

2. Tech Stack

3. Architecture Overview

End-to-End Pipeline

Component Table

4. Metacognitive Loop

Phase 1 — Monitoring: Fast-Path Satisfaction Gate (Paper §4.1)

Phase 2 — Evaluating: Evaluator-Critic Diagnosis (Paper §4.2)

Phase 3 — Planning: Targeted Remediation (Paper §4.3)

Convergence Detection (Paper §6.5)

Why This Differs from Standard RAG and ReAct

5. Benchmark Results

HotpotQA

2WikiMultiHopQA

Combined Summary

Analysis

Metacognitive Loop Statistics

6. Project Structure

7. Quickstart

Prerequisites

1. Clone and configure

2. Start infrastructure

3. Install backend dependencies

4. Start the API server

5. Ingest documents (optional for API usage)

8. Running Benchmarks

Gold-context evaluation (recommended — isolates reasoning quality)

Open-domain evaluation (full retrieval pipeline)

Full argument reference

Running tests

Docker (full stack including API)

9. Configuration Reference

Retrieval Routing Configurations

10. Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages