No existing benchmark tests whether an LLM + MCP tool can trace the ripple effects of a breaking code change across multiple repositories. We built our own.
We assembled 82,894 source files across 25 Kubernetes and observability repositories, wrote 100 cross-repo impact questions, and ran them against 11 LLMs via the ByteBell MCP knowledge graph — consuming over 1.7 billion tokens (~$350 USD total).
python3 -m venv .
source bin/activate
pip install requests python-dotenv psutilCreate .env with your OpenRouter API key:
OPENROUTER_API_KEY=sk-or-v1-your-key-here
Edit mcp_config.json to point to your ByteBell MCP server:
{
"mcpServers": {
"bytebell": {
"url": "http://your-server:3100/mcp?access_token=your_token"
}
}
}Edit models.json to configure which models to evaluate and their pricing.
The primary dataset lives in results/KubeCluster40/ and contains 45 questions across two categories:
| Prefix | Count | Description |
|---|---|---|
| OBS | 34 | Observability — interface/function changes in Prometheus, OpenTelemetry Collector, Thanos, Grafana, Jaeger, Loki, Mimir, and Tempo that ripple across the observability stack |
| MIXED | 11 | Mixed infrastructure — breaking changes to shared Kubernetes interfaces (e.g. SharedInformer, Querier) that affect both infrastructure tools and observability platforms |
Each question asks: "If you add/modify method X on interface Y in repo Z, which files across repos A, B, C, D would need to implement or adapt to this change?"
Each question_<ID>/ folder contains:
question_MIXED_TC001/
question.json # The question text
anthropic_claude-haiku-4.5.json # Claude Haiku's answer + tool calls + cost
deepseek_deepseek-chat-v3.1.json # DeepSeek's answer
google_gemini-3-flash-preview.json # Gemini Flash's answer
openai_gpt-5.1-codex-max.json # GPT-5.1 Codex Max's answer
openai_gpt-5.1-codex-mini.json # GPT-5.1 Codex Mini's answer
x-ai_grok-code-fast-1.json # Grok Code's answer
xiaomi_mimo-v2-flash.json # MiMo's answer
claude_opus_aicopilot.json # Claude Opus (via AICopilot) answer
claude_opus_4.6_direct_data_access.json # Ground truth (direct repo access via Claude Code)
ground_truth.json # Copy of above, used as reference by the judge
evaluation.json # Model metadata + relevance scores
analysis.json # LLM judge comparative scores
Each <model>.json contains the full result of running that model against the question via the ByteBell MCP knowledge graph:
{
"model": "openai/gpt-5.1-codex-max",
"answer": "## Architecture Overview\n...",
"llm_condensed_answer": "SUMMARY: ...\nFILES:\n- repo/path — reason\n...",
"cost": {
"input_tokens": 983242,
"output_tokens": 8207,
"total_tokens": 991449,
"cost_usd": 1.311123
},
"status": "success",
"latency_seconds": 137.96,
"tool_calls_count": 24,
"agent_steps": 25,
"tool_calls": [...]
}answer— the model's full verbose response (often 3000-6000 tokens)llm_condensed_answer— a structured summary extracted by a cheap model (see below)cost— token counts and USD cost for this single question
Raw model answers are verbose — tables, architecture overviews, detailed explanations. Before sending answers to the LLM judge, each answer is condensed using the smoke_test_model (currently xiaomi/mimo-v2-flash, the cheapest model at $0.09/M input) into a structured format:
SUMMARY: The change to SharedInformer affects 4 repos. ArgoCD and cert-manager
have custom informer factories that implement the interface directly...
FILES:
- argo-cd/pkg/client/informers/externalversions/factory.go — implements SharedInformerFactory
- cert-manager/internal/informers/core_filteredsecrets.go — dual-cache wrapper implements SharedInformer
- cert-manager/pkg/client/informers/externalversions/factory.go — generated informer factory
- opentelemetry-operator/main.go — uses controller-runtime which wraps SharedInformer
...
This condensation happens in two places:
- At generation time (
mcp_context_generation.py) — stored asllm_condensed_answerin each model result file - At evaluation time (
evaluate.py) — backfills any missing condensations before sending to the judge
The LLM judge (judge_model in models.json, currently anthropic/claude-sonnet-4.6) scores each model using a 50/20/20/10 weighted criteria. Before scoring, every file path in each model's answer is verified against the actual filesystem in dataset/Kubecluster/ — the judge sees exactly which paths exist and which are hallucinated:
| Weight | Criteria | What It Measures |
|---|---|---|
| 50% | Ground Truth Recall | What fraction of the ground truth expected files did the model find? |
| 20% | Extra Correct Files | Bonus for listing additional files beyond ground truth that actually exist on disk and are relevant (test files, configs, etc. all count) |
| 20% | Reasoning Quality | Did the model explain why each file is affected — interface implementation, dependency chain, data flow? |
| 10% | Hallucination Penalty | Deduction for file paths that do not exist on disk. Zero hallucinated paths = full points. |
Each model is scored independently as a percentage accuracy (0-100%). Scores are not normalized across models — multiple models can receive the same score.
After evaluation, analysis_summary.json contains the final leaderboard with accuracy vs. cost:
Model | Avg % | Cost $ | %/$ | Judged
---------------------------------------------------------+---------+------------+--------+-------
anthropic/claude-sonnet-4.6 + ByteBell MCP | 77.4% | $ 159.3442 | 0.49 | 45
anthropic/claude-haiku-4.5 + ByteBell MCP | 70.2% | $ 52.2822 | 1.34 | 55
openai/gpt-5.1-codex-max + ByteBell MCP | 70.0% | $ 50.1175 | 1.40 | 45
openai/gpt-5.2-codex + ByteBell MCP | 66.3% | $ 7.8490 | 8.45 | 3
google/gemini-3-flash-preview + ByteBell MCP | 62.2% | $ 13.9109 | 4.47 | 45
deepseek/deepseek-chat-v3.1 + ByteBell MCP | 57.8% | $ 3.5141 | 16.44 | 45
x-ai/grok-code-fast-1 + ByteBell MCP | 56.8% | $ 6.1030 | 9.31 | 45
xiaomi/mimo-v2-flash + ByteBell MCP | 55.5% | $ 4.4196 | 12.56 | 45
openai/gpt-5.1-codex-mini + ByteBell MCP | 53.5% | $ 11.4772 | 4.66 | 45
minimax/minimax-m2.5 + ByteBell MCP | 52.5% | $ 11.7227 | 4.48 | 43
claude-opus-4/aicopilot (no MCP) | 32.7% | $ 0.0000 | — | 40
Key metrics per model:
- Avg % — mean independent accuracy percentage across all judged questions (higher = better)
- Cost $ — total USD spent across all questions (input + output tokens)
- %/$ — accuracy percentage per dollar spent (higher = more accuracy per dollar)
- Judged — number of questions where the model produced a scoreable answer
The other models in this benchmark answer questions by calling MCP tools to search a knowledge graph. To establish a ground truth baseline, we took a fundamentally different approach — giving the model direct access to the raw source code.
We opened Claude Code (Anthropic's CLI agent) on a machine with all 25 repositories cloned locally in dataset/kubeCluster/. Claude Code had full filesystem access to all 82,894 source files. For each of the 45 questions, we asked it to search the actual codebases using grep, glob, and file reads to identify every affected file, then write a structured answer with:
- An architecture overview explaining the interface/type and its implementations
- A detailed analysis with specific file paths, line numbers, and code patterns
- An expected_files array listing every affected file with its repo, path, and reason for inclusion
This produces claude_opus_4.6_direct_data_access.json in each question folder — one per question, 45 total. These files are then copied to ground_truth.json and used as the authoritative reference when the LLM judge scores other models.
Key difference from MCP-based models: The MCP models search a pre-built knowledge graph with limited context windows. The ground truth was generated with direct access to every file in every repo — no knowledge graph abstraction, no token limits on search results, no tool call overhead. This makes it as close to a human expert's answer as an automated process can get.
Ground truth file format:
{
"model": "anthropic/claude-opus-4.6-direct-data-access",
"answer": "## Architecture Overview\n...",
"llm_condensed_answer": "SUMMARY: ...\nFILES:\n- repo/path — reason\n...",
"expected_files": [
{"repo": "prometheus", "files": ["storage/interface.go"], "reason": "Querier interface definition"},
{"repo": "thanos", "files": ["pkg/query/querier.go"], "reason": "Thanos querier Select() implementation"}
],
"cost": {"input_tokens": 0, "output_tokens": 0, "total_tokens": 0, "cost_usd": 0.0},
"status": "success"
}Cost fields are zero because the generation happened locally via Claude Code, not through the OpenRouter API.
question.json ──► mcp_context_generation.py ──► <model>.json (full answer + condensed)
│
▼
evaluate.py --force
│
┌──────────────┼──────────────┐
▼ ▼ ▼
evaluation.json analysis.json analysis_summary.json
(model metadata (judge scores (leaderboard with
+ relevance) per question) cost vs accuracy)
Downloads all 25 Kubernetes and observability repositories that make up the benchmark dataset into dataset/Kubecluster/.
Each repository is shallow-cloned (--depth 1) to minimize disk usage. Repos that already exist locally are skipped, so the script is safe to re-run — it will only fetch what's missing. On failure, it reports which repos could not be cloned and exits with code 1.
The 25 repositories:
| Repository | GitHub |
|---|---|
| argo-cd | argoproj/argo-cd |
| autoscaler | kubernetes/autoscaler |
| cert-manager | cert-manager/cert-manager |
| cilium | cilium/cilium |
| crossplane | crossplane/crossplane |
| external-dns | kubernetes-sigs/external-dns |
| external-secrets | external-secrets/external-secrets |
| flux2 | fluxcd/flux2 |
| gatekeeper | open-policy-agent/gatekeeper |
| grafana | grafana/grafana |
| helm | helm/helm |
| ingress-nginx | kubernetes/ingress-nginx |
| istio | istio/istio |
| jaeger | jaegertracing/jaeger |
| karpenter | aws/karpenter-provider-aws |
| kubernetes | kubernetes/kubernetes |
| kustomize | kubernetes-sigs/kustomize |
| loki | grafana/loki |
| mimir | grafana/mimir |
| opentelemetry-collector | open-telemetry/opentelemetry-collector |
| opentelemetry-collector-contrib | open-telemetry/opentelemetry-collector-contrib |
| opentelemetry-operator | open-telemetry/opentelemetry-operator |
| prometheus | prometheus/prometheus |
| tempo | grafana/tempo |
| thanos | thanos-io/thanos |
python3 src/download_dataset.pyNo arguments. Clones everything into dataset/Kubecluster/.
Pure MCP stress test (no LLM involved). Hammers the graph_search tool with concurrent threads to find the maximum concurrency the server can handle within a given RAM budget.
How it works:
The test runs in two phases:
-
Phase 1 — Discovery: Starts at 5 threads and increases by 5 each round (10-second probe per round). Each thread repeatedly calls
graph_searchon the MCP server. A backgroundServerMonitorthread samples the server process's CPU% and RSS memory every second usingpsutil. If the server's RSS exceeds--server-mem-limitor any call errors out, discovery stops and the last safe thread count is recorded. -
Phase 2 — Stability Soak: Runs the maximum safe thread count for 30 seconds of sustained load to confirm the server stays healthy under continuous pressure. Reports per-thread latency breakdown (avg, min, max, p50, p99) and server resource usage (CPU cores, peak RSS, samples).
The monitor finds the MCP server process by looking up which PID is listening on the MCP port via lsof.
python3 src/mcp_stress.py \
--mcp-config mcp_config.json \
--server-mem-limit 4000 \
--max-duration 300| Flag | Short | Default | Description |
|---|---|---|---|
--mcp-config |
-m |
required | Path to MCP config JSON file |
--server-mem-limit |
4000 |
Max server RSS in MB | |
--start-threads |
5 |
Initial thread count | |
--thread-step |
5 |
Threads to add each discovery round | |
--probe-duration |
10 |
Seconds per discovery probe round | |
--soak-duration |
30 |
Seconds for stability soak phase | |
--query |
SharedInformer |
Search query string | |
--channels |
classes imports |
Search channels | |
--timeout |
30 |
Read timeout per MCP call in seconds | |
--sample-interval |
1.0 |
Server monitor sample interval in seconds | |
--max-duration |
300 |
Hard time limit for the entire test |
End-to-end smoke test that auto-discovers the maximum concurrent thread count the LLM can handle, then answers ALL questions at that concurrency while monitoring server memory.
How it works:
-
Phase 1 — Discovery: Starts at 3 threads, bumps +1 each round. Each thread picks a random question from the provided file, sends it through the full LLM agent loop (LLM calls MCP tools to search the codebase, gets results, formulates an answer). If any thread errors (LLM timeout, refusal, crash), discovery stops and the last clean thread count is recorded.
-
Phase 2 — Answer All: Takes every question from the file, shuffles them, and batches them through the max safe thread count. Each batch runs in parallel. Progress is logged per-batch. Stops if the server memory limit is breached or errors appear.
Outputs a timestamped JSON file with full results including per-thread latency, tokens, cost, and server resource usage.
python3 src/smoke_test.py \
--questions cross_repo_whole.json \
--mcp-config mcp_config.json
# With a time cap:
python3 src/smoke_test.py \
--questions cross_repo_whole.json \
--mcp-config mcp_config.json \
--max-duration 300| Flag | Short | Default | Description |
|---|---|---|---|
--questions |
-q |
required | Path to questions JSON file |
--mcp-config |
-m |
required | Path to MCP config JSON file |
--start-threads |
3 |
Initial thread count for discovery | |
--thread-step |
1 |
Threads to add each round | |
--max-duration |
unlimited | Hard time limit in seconds | |
--model |
from models.json |
OpenRouter model name | |
--api-key |
env | OpenRouter API key | |
--max-steps |
25 |
Max agent steps per question | |
--timeout |
120 |
Read timeout per MCP call in seconds | |
--server-mem-limit |
4000 |
Max server RSS in MB (stops test if exceeded) | |
--seed |
none | Random seed for reproducibility |
Core infrastructure. Provides the MCPClient, LLMClient, and run_agent loop used by all other scripts. Also works standalone to run questions from a JSON file against a single model via MCP.
The agent loop sends a detailed system prompt instructing the LLM to plan, discover files across repos using MCP tools (server_info, list_knowledge, graph_search, retrieve_file), follow dependency chains, and produce a structured Markdown answer with an architecture overview, detailed analysis, and a table of all relevant files. Tool calls within a single step execute in parallel.
python3 src/evals.py \
--questions cross_repo_whole.json \
--mcp-config mcp_config.json \
--model deepseek/deepseek-chat-v3.1| Flag | Short | Default | Description |
|---|---|---|---|
--questions |
-q |
required | Path to questions JSON file |
--mcp-config |
-m |
required | Path to MCP config JSON file |
--output-dir |
-o |
results |
Output directory for results |
--data-dir |
-d |
results |
Directory for per-question result files |
--model |
deepseek/deepseek-chat-v3.1 |
OpenRouter model name | |
--api-key |
env | OpenRouter API key | |
--max-steps |
40 |
Max agent steps per question | |
--timeout |
300 |
Read timeout per MCP call in seconds | |
--delay |
1.0 |
Delay between questions in seconds | |
--start |
0 |
Start index (slice questions) | |
--end |
all | End index (slice questions) | |
--verbose |
-v |
off | Enable verbose logging |
Reads questions from a folder of question_*/question.json files and runs them against every model in models.json in parallel (one thread pool per model). Results are saved back into the same question folders. Cached successful answers are skipped on re-runs — only failed/blank ones are retried. Stops immediately on 402 payment errors.
After all models finish, prints a comparison table with success count, errors, average latency, tokens, cost, and wall-clock time per model.
python3 src/mcp_context_generation.py \
--questions-dir results/KubeCluster40 \
--mcp-config mcp_config.json \
--threads 3
# Only run specific models:
python3 src/mcp_context_generation.py \
--questions-dir results/KubeCluster40 \
--mcp-config mcp_config.json \
--models "xiaomi/mimo-v2-flash" "deepseek/deepseek-chat-v3.1"| Flag | Short | Default | Description |
|---|---|---|---|
--questions-dir |
-q |
required | Path to folder containing question_*/question.json files |
--mcp-config |
-m |
required | Path to MCP config JSON file |
--models |
all from models.json |
Specific model(s) to run | |
--threads |
-t |
3 |
Concurrent threads per model |
--max-steps |
25 |
Max agent steps per question | |
--timeout |
120 |
Read timeout per MCP call in seconds | |
--api-key |
env | OpenRouter API key | |
--num-questions |
-n |
all | Number of questions to run |
--seed |
none | Random seed for reproducibility |
Evaluates LLM answers by checking whether the file paths mentioned in each answer physically exist in dataset/Kubecluster/. For each question folder in the given results directory, it reads every model answer file, extracts file paths from markdown tables and inline backtick references, resolves repo names (handling aliases like "argocd" -> "argo-cd"), and checks the filesystem.
Computes two scores per model answer:
- relevance_score (0-10, higher = better): Combines file accuracy (fraction of mentioned files that exist), answer substance (length/structure), and file coverage (number of real files found).
- hallucination_score (0-10, higher = worse): Fraction of mentioned file paths that don't physically exist in the dataset.
Outputs per question folder:
evaluation.json— file-existence scores per modelanalysis.json— LLM judge comparison: semantic relevance score (1-10) with justification per model (requiresOPENROUTER_API_KEY)
Aggregated output:
analysis_summary.json— written to the results directory root, contains a per-model summary table (avg LLM judge score, questions judged) and per-question breakdown with every model's score and justification. The judge model is configured viajudge_modelinmodels.json.
python3 src/evaluate.py --results-dir results
# Or for a specific subfolder:
python3 src/evaluate.py --results-dir results/KubeCluster40| Flag | Short | Default | Description |
|---|---|---|---|
--results-dir |
-r |
required | Path to results folder containing question_*/ subfolders |
Aggregates per-model scores from all evaluation.json and analysis.json files into a single metrics.json. Reports average relevance, average hallucination, average LLM judge score, file accuracy percentage, questions answered, and questions errored per model, sorted by LLM judge score descending.
python3 src/aggregate_metrics.py --results-dir results
# Or for a specific subfolder:
python3 src/aggregate_metrics.py --results-dir results/KubeCluster40| Flag | Short | Default | Description |
|---|---|---|---|
--results-dir |
-r |
required | Path to results folder containing question_*/ subfolders |
Counts errored or blank answer files per model across all results.
python3 src/count_errors.pyClassifies and counts error types (402 Payment Required, 429 Rate Limited, Timeout, 5xx Server Error, blank answers, etc.) across all failed answer files, with per-model breakdown.
python3 src/error_analysis.pyOne-time script that replaces CRW_TC042-TC071 in cross_repo_whole.json with 30 new observability cross-repo questions (OBS_TC001-OBS_TC030) covering Prometheus, OpenTelemetry Collector, Thanos, Grafana, Jaeger, Loki, Mimir, and Tempo.
python3 src/replace_questions.pyQuestions live in folders like results/KubeCluster40/ with per-question subfolders. Model results are saved back into the same folders:
results/KubeCluster40/
question_MIXED_TC001/
question.json # The question + expected answer + expected files
xiaomi_mimo-v2-flash.json # Model answer + tool calls + cost
deepseek_deepseek-chat-v3.1.json # Another model's answer
openai_gpt-oss-120b.json # One file per model in models.json
...
evaluation.json # File-existence scores per model
analysis.json # LLM-based relevance comparison per model
question_MIXED_TC002/
question.json
xiaomi_mimo-v2-flash.json
deepseek_deepseek-chat-v3.1.json
...
evaluation.json
analysis.json
...
20250601_120000-mcp_context_generation.json # Timestamped run summary
metrics.json # Aggregate scores (after running aggregate_metrics.py)
Each question_<ID>/ folder contains:
| File | Description |
|---|---|
question.json |
The question text, expected answer, expected files, and source repo |
<model>.json |
One per model — the model's answer, tool calls, token usage, cost, latency, and status |
evaluation.json |
File-existence check results: relevance score (0-10), hallucination score (0-10), found/missing files per model |
analysis.json |
LLM judge comparison: semantic relevance score (1-10) with justification per model |
{
"id": "SA_TC001",
"question": "If the ServiceAccount struct in kubernetes/...",
"expected_answer": "The change affects...",
"expected_files": [
{"repo": "istio", "files": ["pilot/pkg/..."], "reason": "..."}
],
"repo": "kubernetes"
}{
"model": "deepseek/deepseek-chat-v3.1",
"answer": "## Architecture Overview\n...",
"cost": {
"input_tokens": 45000,
"output_tokens": 3200,
"total_tokens": 48200,
"cost_usd": 0.0141
},
"status": "success",
"latency_seconds": 42.5,
"tool_calls_count": 12,
"agent_steps": 8,
"tool_calls": [...]
}For each question, analysis.json contains an LLM-generated comparison of each model's answer against the expected answer:
{
"question_id": "SA_TC001",
"question": "...",
"expected_answer": "...",
"model_analyses": [
{
"model": "deepseek/deepseek-chat-v3.1",
"relevance": 8,
"justification": "The answer correctly identifies 6 of 7 expected files..."
}
]
}1. Run mcp_context_generation.py -q results/KubeCluster40 → question_*/<model>.json
2. Run evaluate.py -r results/KubeCluster40 → question_*/evaluation.json + analysis.json
3. Run aggregate_metrics.py -r results/KubeCluster40 → metrics.json
100 test cases in cross_repo_whole.json:
| Prefix | Count | Description |
|---|---|---|
| CRW | 41 | Cross-Repo Wide — struct/interface/function changes in core Kubernetes that break downstream consumers |
| OBS | 30 | Observability — cross-repo impact across Prometheus, Thanos, Mimir, Loki, Tempo, Jaeger, Grafana, and OTel Collector |
| KM | 14 | Kubernetes Modification — struct/interface changes in Kubernetes core packages |
| SA | 13 | Source Across — breaking changes from multiple source repos with broad cross-repo impact |
| NK | 2 | Non-Kubernetes — changes originating in non-core repos (kustomize, helm) |
To add more queries for the ByteBell MCP to resolve, create new question folders following the same format as results/KubeCluster40/question_*/question.json.
- Pick a unique question ID (e.g.
CUSTOM_TC001). - Create a folder inside your questions directory:
question_CUSTOM_TC001/ - Add a
question.jsoninside it with this format:
{
"id": "CUSTOM_TC001",
"question": "Your cross-repo impact question here...",
"expected_answer": "Description of the expected impact...",
"expected_files": [
{"repo": "argo-cd", "path": "controller/appcontroller.go", "why": "Reason this file is affected"},
{"repo": "prometheus", "path": "discovery/kubernetes/pod.go", "why": "Reason this file is affected"}
],
"repo": "source-repo-name"
}- Run
mcp_context_generation.pywith the path to your folder:
python3 src/mcp_context_generation.py \
--questions-dir results/KubeCluster40 \
--mcp-config mcp_config.jsonYou can add any number of question folders. The script discovers all question_*/question.json files automatically and skips questions that already have cached successful answers.
You don't have to use KubeCluster40. You can create any folder with question_*/question.json subfolders and point the script at it:
# Create a new question set
mkdir -p results/ObservabilityQuestions/question_OBS_NEW_001
# Add question.json ...
mkdir -p results/ObservabilityQuestions/question_OBS_NEW_002
# Add question.json ...
# Run against your custom folder
python3 src/mcp_context_generation.py \
--questions-dir results/ObservabilityQuestions \
--mcp-config mcp_config.jsonThis lets you maintain separate question sets for different domains, test scenarios, or evaluation runs. Each folder is self-contained — questions and model results all live together under the same directory.
- Connects to the ByteBell MCP server over StreamableHTTP
- Fetches available MCP tools (
server_info,list_knowledge,graph_search,graph_traverse,retrieve_file) - For each question, runs an agent loop:
- LLM receives the question + MCP tools in OpenAI function-calling format
- LLM calls tools to search across repos
- Tool calls execute in parallel via
ThreadPoolExecutor - Loop continues until the LLM produces a final answer or hits
--max-steps
- Results are saved incrementally after each question
- Evaluation checks mentioned file paths against the actual dataset filesystem
- Analysis uses an LLM judge to compare answers against expected answers for semantic relevance
Pure Python — no LangChain, no mcp_use. Direct HTTP calls to OpenRouter and the MCP server.