Skip to content

YuITC/Metacognitive-Agentic-RAG-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Meta-RAG: Metacognitive Retrieval-Augmented Generation

A LangGraph implementation of the MetaRAG paper with a three-phase metacognitive regulation loop and hybrid dense–sparse retrieval β€” evaluated on HotpotQA and 2WikiMultiHopQA against all paper baselines.

Python License arXiv LangGraph


1. Abstract

Multi-hop question answering over retrieved documents requires more than a single retrieval-generation pass: models must recognize when their own answer is inadequate and understand why.

This work implements and extends the MetaRAG framework (Yao et al., 2024), which introduces a three-phase metacognitive regulation loop β€” monitoring, evaluating, and planning β€” directly into the RAG pipeline. The system classifies answer failures into four knowledge-deficit categories (insufficient knowledge, internal-only, external-only, reasoning error) and applies category-specific remediation strategies including targeted re-retrieval and constrained writing directives.

Evaluated on 500 examples from HotpotQA and 2WikiMultiHopQA under the gold-context protocol with DeepSeek V3.2, the implementation achieves 59.0 EM / 75.1 F1 on HotpotQA and 66.2 EM / 74.2 F1 on 2WikiMultiHopQA, surpassing the paper's reported MetaRAG results (37.8 / 49.9 and 42.8 / 50.8 respectively) by 21–23 EM points and 23–25 F1 points.

The performance differential is attributable to DeepSeek V3.2's substantially greater reasoning capacity relative to the GPT-3.5-turbo baseline used in the paper; the metacognitive loop design follows the paper's specification faithfully.


2. Tech Stack

Technology Version Responsibility
Python β‰₯ 3.13 Runtime
LangGraph β‰₯ 0.2 Agent graph orchestration; stateful multi-node DAG with conditional edges
LangChain β‰₯ 0.3 LLM abstraction and message formatting
FastAPI β‰₯ 0.115 REST API server for document ingestion and query endpoints
Qdrant β‰₯ 1.12 Vector store for dense (embedding) retrieval
Elasticsearch 8.17 Inverted-index BM25 sparse retrieval
PostgreSQL 16 via asyncpg Document metadata and analytics storage
sentence-transformers β‰₯ 3.3 Local embedding (BAAI/bge-small-en-v1.5) and reranking (BAAI/bge-reranker-base)
PyMuPDF β‰₯ 1.25 PDF ingestion and text extraction
DeepSeek V3.2 deepseek-chat Primary and sole LLM (OpenAI-compatible API)
Pydantic β‰₯ 2.10 Configuration management and data validation
Docker Compose β€” Infrastructure orchestration (Qdrant, Elasticsearch, PostgreSQL, API)

3. Architecture Overview

The system implements a LangGraph state machine with eleven processing nodes. In the default parallel mode, the post-write verification stack (claim extraction, citation verification, evidence graph, and LLM evaluation) collapses into a single parallel_verify_and_evaluate_node. In sequential mode (when DISABLE_PARALLEL=true), each step runs as a separate node. Each query traverses the graph as a typed AgentState dictionary, with conditional edges controlling multi-hop retrieval and the metacognitive remediation loop.

End-to-End Pipeline

Pipeline

Component Table

Module Location Responsibility
AgentState backend/app/agent/graph.py Typed state dictionary shared across all graph nodes
plan_node backend/app/agent/graph.py Classifies query into simple/complex/multi-hop; selects retrieval routing config
retrieve_node backend/app/agent/graph.py Dispatches hybrid RRF search; applies cross-encoder reranking and retrieval guardrails
read_node backend/app/agent/graph.py Extracts evidence spans from retrieved documents
write_node backend/app/agent/graph.py Generates the grounded answer; appends writing_directive on metacognitive remediation rounds
parallel_verify_and_evaluate_node backend/app/agent/graph.py Concurrently runs claim extraction, citation verification, evidence graph, and LLM evaluation
diagnose_node backend/app/agent/graph.py Invokes the metacognitive monitor and evaluator-critic; populates diagnosis state
remediate_node backend/app/agent/graph.py Applies targeted remediation: targeted re-retrieval for INSUFFICIENT, writing directives for others
claim_extract_node backend/app/agent/graph.py Segments the generated answer into individual factual claims (sequential mode only)
citation_verify_node backend/app/agent/graph.py Computes citation precision, unsupported claim rate, and evidence alignment score (sequential mode only)
evidence_graph_node backend/app/agent/graph.py Constructs claim–document support graph for interpretability (sequential mode only)
evaluate_node backend/app/agent/graph.py LLM-based faithfulness, completeness, and confidence scoring (sequential mode only)
diagnose_answer backend/app/agent/metacognitive.py Three-phase metacognitive regulation: monitoring gate β†’ evaluator-critic LLM β†’ heuristic fallback
build_writing_directive backend/app/agent/metacognitive.py Maps diagnosis category to a concrete writing instruction
check_convergence backend/app/agent/metacognitive.py Jaccard similarity convergence detection between consecutive answers
plan_query backend/app/agent/planner.py LLM-based query complexity classification (simple/complex/multi-hop) with rule-based fallback
write_answer backend/app/agent/writer.py Grounded answer generation with inline citations and optional writing directive injection
read_documents backend/app/agent/reader.py Evidence extraction from retrieved documents
hybrid_search backend/app/retrieval/hybrid.py Reciprocal Rank Fusion of dense (Qdrant) and sparse (BM25/Elasticsearch) results
dense_search backend/app/retrieval/dense.py Qdrant vector search with BAAI/bge-small-en-v1.5 embeddings and document_id filtering
bm25_search backend/app/retrieval/bm25_retrieval.py Elasticsearch BM25 keyword search with document_id filtering
rerank backend/app/retrieval/reranker.py Cross-encoder reranking via BAAI/bge-reranker-base
filter_retrieved_docs backend/app/retrieval/guardrails.py Filters prompt injection patterns, adversarial content, and low-quality chunks
compute_retrieval_diagnostics backend/app/retrieval/diagnostics.py Observability metrics: query coverage, document diversity, retrieval redundancy, recall proxy
verify_citations backend/app/verification/citation_verifier.py Computes citation precision, unsupported claim rate, and evidence alignment score
extract_claims backend/app/verification/claim_extractor.py Segments the generated answer into individual factual claims
evaluate_answer backend/app/optimization/evaluator.py LLM-based faithfulness, completeness, and confidence scoring
build_evidence_graph backend/app/research/evidence_graph.py Lightweight claim–document support graph using token-overlap heuristic
log_run backend/app/memory/strategy_memory.py Persists run records to PostgreSQL for analytics (faithfulness, cost, latency, utility)
log_retrieval_diagnostics backend/app/memory/strategy_memory.py Persists retrieval observability metrics per query
log_provenance_snapshot backend/app/memory/strategy_memory.py Persists full answer provenance snapshots including diagnosis, citations, and evidence graph
Ingestion pipeline backend/app/ingestion/pipeline.py PDF/DOCX/HTML/TXT chunking β†’ Qdrant dense upsert + Elasticsearch BM25 index
Settings backend/app/config.py Pydantic-settings configuration with .env override
LLM abstraction backend/app/llm.py DeepSeek LLM abstraction via OpenAI-compatible API
Cost tracking backend/app/cost.py Token cost estimation and DeepSeek balance monitoring
API routes backend/app/api/routes.py FastAPI routers: query, stream, document management, ingestion, and analytics endpoints
Benchmark runner backend/benchmarks/runner.py Async parallel evaluation with gold-context and open-domain modes

4. Metacognitive Loop

The metacognitive loop is the core algorithmic contribution. It wraps the standard RAG generation step with a three-phase regulation pipeline that decides whether an answer is acceptable, diagnoses the specific failure mode if not, and applies a targeted corrective strategy.

Phase 1 β€” Monitoring: Fast-Path Satisfaction Gate (Paper Β§4.1)

After the LLM produces an initial answer, the system evaluates three scalar metrics via an LLM judge: faithfulness, answer completeness, and citation precision. The monitoring gate (is_answer_satisfactory) compares these against configured thresholds:

faithfulness β‰₯ 0.70  AND  completeness β‰₯ 0.60  AND  citation_precision β‰₯ 0.40

If all three conditions are met, the answer is classified as satisfactory and the metacognitive loop terminates without an additional LLM call. In the benchmark evaluation, 46% of HotpotQA queries and 66% of 2WikiMultiHopQA queries passed this gate on the first attempt.

An optional confidence-based fast path (fast_path_threshold) can additionally skip the expensive evaluator-critic invocation when the evaluator's self-reported confidence is high and metrics are within 0.15 of their thresholds. This optimization is disabled by default for reproducibility.

Phase 2 β€” Evaluating: Evaluator-Critic Diagnosis (Paper Β§4.2)

When the monitoring gate fails, the system invokes an LLM evaluator-critic with a structured prompt that performs two simultaneous assessments:

Procedural knowledge classification β€” independent assessment of two knowledge sources:

  • Internal sufficiency: could the LLM answer reliably from training knowledge alone?
  • External sufficiency: do the retrieved documents contain adequate, relevant information?

The combination of these two boolean signals produces one of four knowledge categories:

Category Internal External Condition
insufficient_knowledge βœ— βœ— Neither source is adequate; targeted re-retrieval is required
internal_knowledge_only βœ“ βœ— LLM knows the answer but retrieved docs are noisy or misleading
external_knowledge_only βœ— βœ“ Docs contain the answer but the LLM's own knowledge is wrong or absent
reasoning_error βœ“ βœ“ Both sources are adequate; the failure is in reasoning or structure

Declarative knowledge classification β€” detection of three common error patterns:

Error Type Description
incomplete_reasoning Missed reasoning steps, incomplete chain of thought, ignored evidence
answer_redundance Repetitive or verbose; similar points stated multiple times
ambiguity_understanding Misread the query's intent; answered a related but different question

When the LLM evaluator-critic fails (network error, malformed JSON), a rule-based heuristic fallback classifies the diagnosis from the raw metric values.

Phase 3 β€” Planning: Targeted Remediation (Paper Β§4.3)

The diagnosis category maps to a specific remediation strategy in remediate_node:

Category Remediation Action
insufficient_knowledge Issues a targeted sub-query (suggested_query from the evaluator) to retrieve additional documents; augments the evidence pool before re-generation
internal_knowledge_only Prepends directive: "Rely primarily on your own knowledge; only cite a reference if it clearly and directly supports a claim"
external_knowledge_only Prepends directive: "Answer ONLY using information explicitly stated in the provided source documents; every factual claim MUST have an inline citation"
reasoning_error Prepends error-specific guidance targeting the detected error type(s); falls back to chain-of-thought instruction if no specific type matches

The writing_directive is passed as an additional system-level instruction into the next write_node invocation, shaping how the LLM re-generates the answer without altering the underlying evidence pool (except for insufficient_knowledge).

Convergence Detection (Paper Β§6.5)

To prevent the loop from cycling without improvement, check_convergence computes the Jaccard similarity between the token sets of consecutive answers:

J(A_{t-1}, A_t) = |tokens(A_{t-1}) ∩ tokens(A_t)| / |tokens(A_{t-1}) βˆͺ tokens(A_t)|

If J β‰₯ 0.85 (configurable via metacognitive_convergence_threshold), the loop terminates: the model has stabilized and further rounds would not improve the answer. The iteration limit (max_metacognitive_rounds = 3) provides a hard upper bound.

Why This Differs from Standard RAG and ReAct

Standard RAG performs one retrieval pass and one generation pass with no quality feedback. ReAct interleaves reasoning and retrieval via a thought–action–observation loop but has no explicit model of why a given answer is weak. This system differs in three ways:

  1. Typed failure diagnosis: rather than re-running retrieval blindly, the system determines whether the failure is a knowledge-access problem (insufficient, external_only) or a reasoning/generation problem (internal_only, reasoning_error) and applies different interventions for each.
  2. Writing directives as soft constraints: the planning phase modifies the generation prompt rather than the retrieval strategy for non-knowledge-gap failures, targeting the root cause without expensive additional retrieval.
  3. Convergence-aware iteration: the Jaccard check prevents over-thinking, consistent with Paper Β§6.5's empirical finding that performance degrades when the model can no longer extract useful improvements from additional rounds.

5. Benchmark Results

All results use the gold-context evaluation protocol: the dataset's supporting documents are provided directly to the pipeline, bypassing open-domain retrieval. This protocol isolates reasoning and metacognitive regulation quality from retrieval quality.

Evaluation settings: 500 examples, seed=42, concurrency=20, DeepSeek V3.2 (deepseek-chat), max 3 metacognitive rounds, convergence threshold 0.85.

HotpotQA

Model EM F1 Precision Recall
Standard RAG 24.6 33.0 34.1 34.5
ReAct 24.8 41.7 42.6 44.7
Flare 29.2 42.4 42.8 43.0
IR-CoT 31.4 40.3 41.6 41.2
Self-Ask 28.2 43.1 43.4 44.8
Reflexion 30.0 43.4 43.2 44.3
MetaRAG (paper, GPT-3.5-turbo) 37.8 49.9 52.1 50.9
Ours (DeepSeek V3.2) 59.0 75.1 77.1 78.2

2WikiMultiHopQA

Model EM F1 Precision Recall
Standard RAG 18.8 25.2 25.6 26.2
ReAct 21.0 28.0 27.6 30.0
Flare 28.2 39.8 40.0 40.8
IR-CoT 30.8 42.6 42.3 40.9
Self-Ask 28.6 37.5 36.5 42.8
Reflexion 31.8 41.7 40.6 44.2
MetaRAG (paper, GPT-3.5-turbo) 42.8 50.8 50.7 52.2
Ours (DeepSeek V3.2) 66.2 74.2 72.7 78.4

Combined Summary

Dataset EM (Ours / Paper) F1 (Ours / Paper) Ξ”EM Ξ”F1
HotpotQA 59.0 / 37.8 75.1 / 49.9 +21.2 +25.2
2WikiMultiHopQA 66.2 / 42.8 74.2 / 50.8 +23.4 +23.4

Analysis

The +21–23 EM improvement over the paper's MetaRAG baseline reflects the LLM capability gap between GPT-3.5-turbo (2024) and DeepSeek V3.2: a stronger base model generates higher-quality initial answers and responds more accurately to metacognitive writing directives. The metacognitive statistics reveal the loop's efficiency: 46% of HotpotQA and 66% of 2WikiMultiHopQA queries required zero remediation rounds, confirming the fast-path gate correctly identifies adequate answers. 20% of HotpotQA and 16% of 2WikiMultiHopQA queries converged before the three-round limit, indicating the Jaccard threshold (0.85) is appropriately calibrated. HotpotQA's harder comparative multi-hop questions drove a higher rate of queries exhausting all three rounds (34% vs. 18%), consistent with the dataset's greater reasoning complexity. Zero errors across 1,000 queries confirms pipeline stability under concurrent execution.

Metacognitive Loop Statistics

Metric HotpotQA 2WikiMultiHopQA
Avg rounds per query 1.32 0.75
No remediation (round = 0) 230 / 500 (46.0%) 332 / 500 (66.4%)
Converged early (0 < rounds < 3) 101 / 500 (20.2%) 78 / 500 (15.6%)
Hit max rounds (3) 169 / 500 (33.8%) 90 / 500 (18.0%)
Avg latency per query 34,831 ms 23,427 ms
Errors 0 0

6. Project Structure

Meta-RAG/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ agent/
β”‚   β”‚   β”‚   β”œβ”€β”€ graph.py           # LangGraph DAG: all nodes, edges, and conditional routing
β”‚   β”‚   β”‚   β”œβ”€β”€ metacognitive.py  # Three-phase metacognitive regulation (monitoring, evaluating, planning)
β”‚   β”‚   β”‚   β”œβ”€β”€ planner.py         # LLM-based query complexity classification
β”‚   β”‚   β”‚   β”œβ”€β”€ reader.py          # Evidence extraction from retrieved documents
β”‚   β”‚   β”‚   └── writer.py         # Answer generation with optional writing directive
β”‚   β”‚   β”œβ”€β”€ retrieval/
β”‚   β”‚   β”‚   β”œβ”€β”€ hybrid.py          # Reciprocal Rank Fusion of dense + BM25 results
β”‚   β”‚   β”‚   β”œβ”€β”€ dense.py          # Qdrant vector search with BGE-small embeddings
β”‚   β”‚   β”‚   β”œβ”€β”€ bm25_retrieval.py # Elasticsearch BM25 keyword search
β”‚   β”‚   β”‚   β”œβ”€β”€ reranker.py        # Cross-encoder reranking (BGE-reranker-base)
β”‚   β”‚   β”‚   β”œβ”€β”€ guardrails.py      # Chunk filtering: safety, quality, deduplication
β”‚   β”‚   β”‚   └── diagnostics.py    # Retrieval quality metrics (coverage, diversity)
β”‚   β”‚   β”œβ”€β”€ optimization/
β”‚   β”‚   β”‚   └── evaluator.py      # LLM faithfulness + completeness scoring
β”‚   β”‚   β”œβ”€β”€ verification/
β”‚   β”‚   β”‚   β”œβ”€β”€ citation_verifier.py  # Citation precision and evidence alignment
β”‚   β”‚   β”‚   └── claim_extractor.py    # Claim segmentation from generated answers
β”‚   β”‚   β”œβ”€β”€ research/
β”‚   β”‚   β”‚   └── evidence_graph.py    # Claim–document support graph construction
β”‚   β”‚   β”œβ”€β”€ memory/
β”‚   β”‚   β”‚   └── strategy_memory.py  # PostgreSQL analytics persistence
β”‚   β”‚   β”œβ”€β”€ ingestion/
β”‚   β”‚   β”‚   └── pipeline.py          # Document chunking β†’ Qdrant + Elasticsearch indexing
β”‚   β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”‚   └── routes.py            # FastAPI routers: query, stream, documents, ingest, analytics
β”‚   β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”‚   β”œβ”€β”€ schemas.py          # Pydantic request/response schemas
β”‚   β”‚   β”‚   └── db_models.py        # SQLAlchemy ORM models
β”‚   β”‚   β”œβ”€β”€ config.py               # Pydantic-settings configuration
β”‚   β”‚   β”œβ”€β”€ llm.py                  # DeepSeek LLM abstraction via OpenAI-compatible API
β”‚   β”‚   β”œβ”€β”€ cost.py                 # Token cost tracking
β”‚   β”‚   β”œβ”€β”€ database.py              # Async SQLAlchemy session management
β”‚   β”‚   β”œβ”€β”€ text_utils.py           # Shared text utilities
β”‚   β”‚   └── main.py                 # FastAPI application factory
β”‚   β”œβ”€β”€ benchmarks/
β”‚   β”‚   β”œβ”€β”€ runner.py               # Async benchmark runner with gold and open-domain modes
β”‚   β”‚   β”œβ”€β”€ datasets.py             # HotpotQA and 2WikiMultiHopQA data loaders
β”‚   β”‚   β”œβ”€β”€ metrics.py              # EM, F1, Precision, Recall computation
β”‚   β”‚   β”œβ”€β”€ compare.py              # Paper baseline comparison utilities
β”‚   β”‚   └── results/
β”‚   β”‚       └── reports/
β”‚   β”‚           └── final_report.md # Official benchmark results
β”‚   β”œβ”€β”€ tests/                      # pytest test suite
β”‚   β”œβ”€β”€ pyproject.toml              # Project metadata and dependencies (uv)
β”‚   └── Dockerfile                  # Backend container
β”œβ”€β”€ frontend/                       # Web UI
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ diagrams/                  # draw.io architecture diagrams (4 pipelines)
β”‚   └── papers/                    # Source paper (MetaRAG, arXiv:2402.11626)
β”œβ”€β”€ docker-compose.yml               # Qdrant + Elasticsearch + PostgreSQL + API
β”œβ”€β”€ .env.example                    # Configuration template
└── README.md

7. Quickstart

Prerequisites

1. Clone and configure

git clone <repository-url>
cd Meta-RAG
cp .env.example .env
# Edit .env: set DEEPSEEK_API_KEY

2. Start infrastructure

docker compose up -d qdrant elasticsearch postgres

Wait for Elasticsearch to become healthy (10–20 seconds):

docker compose ps   # all three services should show "healthy" or "running"

3. Install backend dependencies

cd backend
uv sync

4. Start the API server

uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

The API is available at http://localhost:8000. Interactive docs: http://localhost:8000/docs.

5. Ingest documents (optional for API usage)

Upload PDFs or other documents via the /ingest endpoint, or use the ingestion API directly. For benchmark evaluation, ingestion is handled automatically.


8. Running Benchmarks

Benchmarks require the infrastructure services (qdrant, elasticsearch, postgres) to be running. All commands are run from the backend/ directory.

Gold-context evaluation (recommended β€” isolates reasoning quality)

# HotpotQA, 500 examples, 20 concurrent queries
uv run python -m benchmarks.runner --dataset hotpotqa --mode gold --n 500 --concurrency 20

# 2WikiMultiHopQA, 500 examples
uv run python -m benchmarks.runner --dataset 2wikimultihopqa --mode gold --n 500 --concurrency 20

# Both datasets sequentially
uv run python -m benchmarks.runner --dataset both --mode gold --n 500 --concurrency 20

Open-domain evaluation (full retrieval pipeline)

# Ingests the benchmark corpus into Qdrant + Elasticsearch, then runs end-to-end
uv run python -m benchmarks.runner --dataset hotpotqa --mode open_domain --n 500 --concurrency 5

Full argument reference

--dataset       hotpotqa | 2wikimultihopqa | both
--mode          gold | open_domain
--n             Number of examples (default: 500; paper uses 500)
--seed          Random seed for subsampling (default: 42)
--concurrency   Parallel queries (default: 1; use 10–25 for speed)
--provider      deepseek (overrides LLM_PROVIDER in .env)
--convergence-threshold   Jaccard threshold for convergence detection (default: 0.85)
--fast-path-threshold     Evaluator confidence for fast-path skip (default: disabled)
--disable-parallel        Run post-write verification steps sequentially
--output-dir    Results directory (default: benchmarks/results)

Results are saved as timestamped JSON files in benchmarks/results/evaluations/. Cost logs are written to benchmarks/results/cost_logs/cost_log.jsonl.

Running tests

uv run pytest

Docker (full stack including API)

docker compose up --build

9. Configuration Reference

All parameters are set via environment variables or the .env file (see .env.example). The Settings class in backend/app/config.py provides typed access with defaults.

Parameter Default Description
LLM_PROVIDER deepseek LLM provider (set to deepseek; no alternatives)
DEEPSEEK_API_KEY β€” DeepSeek API key (required)
DEEPSEEK_MODEL deepseek-chat DeepSeek model identifier
DEEPSEEK_BASE_URL https://api.deepseek.com DeepSeek API base URL
DATABASE_URL postgresql+asyncpg://ara:ara_secret@localhost:5432/ara_db PostgreSQL connection string
QDRANT_HOST localhost Qdrant server hostname
QDRANT_PORT 6333 Qdrant gRPC port
QDRANT_COLLECTION research_papers Qdrant collection name for document chunks
ELASTICSEARCH_URL http://localhost:9200 Elasticsearch endpoint
ELASTICSEARCH_INDEX research_chunks Elasticsearch index name
EMBEDDING_MODEL BAAI/bge-small-en-v1.5 HuggingFace model for dense embeddings
RERANKER_MODEL BAAI/bge-reranker-base HuggingFace cross-encoder for reranking
MAX_METACOGNITIVE_ROUNDS 3 Hard cap on metacognitive loop iterations per query (Paper Β§6.5)
METACOGNITIVE_CONVERGENCE_THRESHOLD 0.85 Jaccard similarity threshold for early loop termination (Paper Β§6.5)
FAST_PATH_THRESHOLD None Evaluator confidence threshold to skip the LLM diagnosis call; None disables (Phase 3 optimization)
FAITHFULNESS_THRESHOLD 0.7 Minimum faithfulness score to pass the monitoring gate
COMPLETENESS_THRESHOLD 0.6 Minimum completeness score to pass the monitoring gate
CITATION_PRECISION_THRESHOLD 0.4 Minimum citation precision to pass the monitoring gate
EVIDENCE_COVERAGE_THRESHOLD 0.45 Minimum evidence coverage to suppress additional retrieval hops
MIN_RETRIEVAL_DIVERSITY 0.25 Minimum document diversity required in the retrieved set
EVALUATOR_CONFIDENCE_THRESHOLD 0.4 Minimum evaluator confidence to influence routing decisions
DISABLE_PARALLEL False Run claim extraction / citation verification / evaluation sequentially instead of concurrently

Retrieval Routing Configurations

The planner classifies each query into one of three complexity levels and maps it to a fixed retrieval configuration (top-k, reranking, and multi-hop limits):

Complexity top_k Rerank Multi-hop limit
simple 5 βœ— 1
complex 10 βœ“ 1
multi-hop 10 βœ“ 3

10. Citation

If this implementation is useful for your research, please cite the original MetaRAG paper:

@article{yao2024metacognitive,
  title   = {Metacognitive Retrieval-Augmented Large Language Models},
  author  = {Yao, Yujia and He, Tianyi and Xu, Zheng and Zhao, Tuo},
  journal = {arXiv preprint arXiv:2402.11626},
  year    = {2024},
  url     = {https://arxiv.org/abs/2402.11626}
}

About

πŸ“š Metacognitive Retrieval-Augmented Generation with Adaptive Strategy Selection. Improved version of the paper "Metacognitive Retrieval-Augmented Large Language Models"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors