An open knowledge evaluation benchmark for Large Language Models.
BeQu systematically measures how well LLMs can elicit factual knowledge, evaluating both precision (are the elicited triples correct?) and recall (do they cover reference knowledge?) across 20 models, 10,000 entities, and 5 experimental scenarios.
Traditional NLP benchmarks rely on fixed question-answering pairs. BeQu goes further: instead of asking specific questions, it prompts LLMs to freely elicit structured knowledge about entities, then measures coverage and correctness against a reference corpus built from Wikipedia and the web.
Key capabilities:
- Prompt LLMs to triples (subject, predicate, object) for any set of Wikipedia entities
- Build reference corpora from Wikipedia articles and web search results
- Evaluate elicited triples via RAG-based LLM verification (precision) and coverage checking (recall)
- Compare results across models, domains, and prompt strategies
BeyondQuestions/
├── BeQu.py # Main CLI entry point
├── experiment_tracker.py # Experiment deduplication
├── combine_results.py # Result aggregation
│
├── ELICITATION/ # Knowledge elicitation pipeline
│ ├── main.py # OpenAI Batch API orchestration
│ ├── local.py # OpenRouter / local API wrapper
│ ├── gpt_kbc.py # GPT KBC runner (workspace isolation)
│ ├── template_utils.py # Jinja2 prompt template loading
│ └── templates/prompts/ # Prompt templates
│ ├── GPTKB.jinja # Direct triple elicitation
│ ├── GPTKB_2x_repetition.jinja
│ ├── GPTKB_3x_repetition.jinja
│ ├── wikidata_schema.jinja
│ ├── schemaorg_schema.jinja
│ └── LMCRAWL/ # Two-step elicitation
│ ├── predicates.jinja # Step 1: elicit predicates
│ └── objects.jinja # Step 2: elicit objects
│
├── EVALUATION/ # Triple evaluation pipeline
│ ├── main.py # Evaluation coordinator
│ ├── request.py # LLM verification (JSON schema output)
│ ├── process_request.py # Batch evaluation
│ ├── wikipedia_triple_extractor.py # Reference construction
│ ├── rag_retriever.py # Dense embedding retrieval
│ ├── retrieve_passages.py # Passage utilities
│ ├── wikipedia_utils.py # Wikipedia API wrapper
│ ├── wikidata_utils.py # Wikidata API wrapper
│ └── network_utils.py # Retry decorator
│
├── ENTITY LISTS/ # Test entity sets
│ ├── RANDOM/ # 10K random Wikipedia entities
│ ├── DOMAINS/ # Entities grouped by domain
│ └── POPULARITY/ # Entities grouped by page-view popularity
│
├── REFERENCE CORPUS/
│ ├── random/GT.json.gz
│ ├── domains/GT.json.gz
│ └── popularity/GT.json.gz
│
├── ELICITED TRIPLES/ # Model outputs (20 models)
└── RESULTS/ # Evaluation results
└── combined_results_manual.csv
LLMs are prompted to generate knowledge for each entity using configurable Jinja2 templates.
For each entity, the pipeline:
- Fetches the full Wikipedia article text
- Retrieves the top-20 web search results via Brave Search API
- Uses an LLM to extract triples from all sources
- Caches everything (text, URLs, triples) in a json file
Precision (RAG-based):
- For each elicited triple, concatenate subject + predicate + object as a query
- Retrieve the top-10 most similar passages from the reference corpus using dense embeddings (
text-embedding-3-small) - A judge LLM classifies: entailment / contradiction / neutral
- Precision = fraction of triples classified as entailment
Recall (Coverage Check):
- For each ground truth triple, check if any elicited triple covers it
- Verify with the judge LLM using all ground truth sources
- Recall = fraction of ground truth triples covered
Result categories per triple:
| Category | Meaning |
|---|---|
| Entailment | Verified as factually correct |
| Contradiction | Contradicts reference |
| Neutral | Truth cannot be determined |
BeQu supports 6 configurations to test different aspects of LLM knowledge:
| Setting | Description |
|---|---|
random |
10,000 random Wikipedia entities (baseline) |
domains |
1,000 entities across Wikipedia topic domains |
ranges |
Vary the number of expected triples per entity |
non_existing |
Made-up entities to test hallucination robustness |
multiple_prompts |
Run multiple elicitation templates per model |
reasoning_effort |
Vary LLM reasoning effort (low / medium / high) |
BeQu has been evaluated on 20 models:
Commercial:
| Provider | Model |
|---|---|
| OpenAI | GPT-5, GPT-oss-120b |
| Anthropic | Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5 |
| Gemini 3.1 Pro, Gemini 3 Flash, Gemma 3 (4B/12B/27B) | |
| xAI | Grok 4.1 Fast |
Open / API-Served:
| Provider | Model |
|---|---|
| Meta | Llama-4-Scout-17B-16E-Instruct |
| Mistral AI | mistral-large-2512 |
| DeepSeek | DeepSeek-V3.2 |
| Qwen | Qwen 3.5-27B |
| Moonshot | Kimi-K2.5 |
| MiniMax | MiniMax-M2.5 |
- Python 3.8+
git clone [ANONYMIZED]
cd BeyondQuestions
# Pull large files (reference corpus, detailed results)
git lfs pull
pip install openai anthropic requests pandas loguru tqdm \
sentence-transformers torch jinja2 fireAll workflows are driven through the BeQu.py CLI.
Construct the reference corpus from Wikipedia + web search without running any model:
python BeQu.py \
--entities_file_path "path/to/json/file" \
--ground_truth_dir_path "path/to/output" \
--build_ground_truth_onlypython BeQu.py \
--entities_file_path "path/to/json/file" \
--api [YOUR_API] \
--model_elicitation [MODEL] \
--prompt_template_dir_elicitation ELICITATION/templates/prompts/ \
--reasoning_effort_elicitation [EFFORT] \
--elicited_triples_dir "ELICITED TRIPLES" \
--ground_truth_dir_path "path/to/reference" \
--results_dir_path RESULTS \
--llm_judge [JUDGE] \
--sample_size 500 \
--seed 42python BeQu.py \
--skip_elicitation \
--elicited_triples_dir "ELICITED TRIPLES" \
--ground_truth_dir_path "path/to/reference" \
--results_dir_path RESULTSpython BeQu.py \
--evaluate_all_models \
--elicited_triples_dir "ELICITED TRIPLES" \
--results_dir_path RESULTSpython combine_results.py --results_dir RESULTS[
{"title": "Albert Einstein", "length": 5974},
{"title": "Marie Curie", "length": 4801}
]subject,predicate,object
Albert_Einstein,instanceOf,physicist
Albert_Einstein,birthPlace,Ulm
Albert_Einstein,fieldOfWork,theoretical physics
Model,Metric,Setting,Total #Triples,Entailment,Contradiction,Neutral
claude-sonnet-4-6,Precision (RAG),random,500,320,15,165
Model,Setting,Domain,Entailment_F1,Entailment_Precision,Entailment_Recall
claude-sonnet-4-6,random,,0.85,0.87,0.83
Aggregated results are available in RESULTS/combined_results_manual.csv.
Individual per-model evaluation outputs are stored under RESULTS/{model}/{setting}/ and include:
| File | Contents |
|---|---|
results.csv |
Aggregate precision / recall metrics |
results_by_category.csv |
Metrics broken down by Wikipedia domain |
results_by_popularity.csv |
Metrics broken down by popularity bucket |
results_detailed.csv |
Per-triple entailment labels (Git LFS) |
- Experiment deduplication: Configuration hashes (SHA-256) are stored in
.experiment_tracking.jsonso runs are never repeated accidentally. - Memory efficiency: Evaluation results are streamed to CSV incrementally rather than held in memory.
- Parallelism: Several models can be evaluated concurrently; elicitation threads are configurable.
- Reproducibility: Fix
--seedand--sample_sizeto reproduce any published result exactly.
This project is released under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. You are free to share and adapt the material for any purpose, provided appropriate credit is given.
Please cite our work: [TBD]