ByteBell Cross-Repository Impact Analysis Benchmark

No existing benchmark tests whether an LLM + MCP tool can trace the ripple effects of a breaking code change across multiple repositories. We built our own.

We assembled 82,894 source files across 25 Kubernetes and observability repositories, wrote 100 cross-repo impact questions, and ran them against 11 LLMs via the ByteBell MCP knowledge graph — consuming over 1.7 billion tokens (~$350 USD total).

Setup

python3 -m venv .
source bin/activate
pip install requests python-dotenv psutil

Create .env with your OpenRouter API key:

OPENROUTER_API_KEY=sk-or-v1-your-key-here

Edit mcp_config.json to point to your ByteBell MCP server:

{
  "mcpServers": {
    "bytebell": {
      "url": "http://your-server:3100/mcp?access_token=your_token"
    }
  }
}

Edit models.json to configure which models to evaluate and their pricing.

Datasets

KubeCluster40 — 45 Cross-Repository Impact Questions

The primary dataset lives in results/KubeCluster40/ and contains 45 questions across two categories:

Prefix	Count	Description
OBS	34	Observability — interface/function changes in Prometheus, OpenTelemetry Collector, Thanos, Grafana, Jaeger, Loki, Mimir, and Tempo that ripple across the observability stack
MIXED	11	Mixed infrastructure — breaking changes to shared Kubernetes interfaces (e.g. `SharedInformer`, `Querier`) that affect both infrastructure tools and observability platforms

Each question asks: "If you add/modify method X on interface Y in repo Z, which files across repos A, B, C, D would need to implement or adapt to this change?"

Per-Question Folder Structure

Each question_<ID>/ folder contains:

question_MIXED_TC001/
  question.json                                # The question text
  anthropic_claude-haiku-4.5.json              # Claude Haiku's answer + tool calls + cost
  deepseek_deepseek-chat-v3.1.json             # DeepSeek's answer
  google_gemini-3-flash-preview.json           # Gemini Flash's answer
  openai_gpt-5.1-codex-max.json               # GPT-5.1 Codex Max's answer
  openai_gpt-5.1-codex-mini.json              # GPT-5.1 Codex Mini's answer
  x-ai_grok-code-fast-1.json                  # Grok Code's answer
  xiaomi_mimo-v2-flash.json                   # MiMo's answer
  claude_opus_aicopilot.json                   # Claude Opus (via AICopilot) answer
  claude_opus_4.6_direct_data_access.json      # Ground truth (direct repo access via Claude Code)
  ground_truth.json                            # Copy of above, used as reference by the judge
  evaluation.json                              # Model metadata + relevance scores
  analysis.json                                # LLM judge comparative scores

Model Answer Files

Each <model>.json contains the full result of running that model against the question via the ByteBell MCP knowledge graph:

{
  "model": "openai/gpt-5.1-codex-max",
  "answer": "## Architecture Overview\n...",
  "llm_condensed_answer": "SUMMARY: ...\nFILES:\n- repo/path — reason\n...",
  "cost": {
    "input_tokens": 983242,
    "output_tokens": 8207,
    "total_tokens": 991449,
    "cost_usd": 1.311123
  },
  "status": "success",
  "latency_seconds": 137.96,
  "tool_calls_count": 24,
  "agent_steps": 25,
  "tool_calls": [...]
}

answer — the model's full verbose response (often 3000-6000 tokens)
llm_condensed_answer — a structured summary extracted by a cheap model (see below)
cost — token counts and USD cost for this single question

Answer Condensation

Raw model answers are verbose — tables, architecture overviews, detailed explanations. Before sending answers to the LLM judge, each answer is condensed using the smoke_test_model (currently xiaomi/mimo-v2-flash, the cheapest model at $0.09/M input) into a structured format:

SUMMARY: The change to SharedInformer affects 4 repos. ArgoCD and cert-manager
have custom informer factories that implement the interface directly...

FILES:
- argo-cd/pkg/client/informers/externalversions/factory.go — implements SharedInformerFactory
- cert-manager/internal/informers/core_filteredsecrets.go — dual-cache wrapper implements SharedInformer
- cert-manager/pkg/client/informers/externalversions/factory.go — generated informer factory
- opentelemetry-operator/main.go — uses controller-runtime which wraps SharedInformer
...

This condensation happens in two places:

At generation time (mcp_context_generation.py) — stored as llm_condensed_answer in each model result file
At evaluation time (evaluate.py) — backfills any missing condensations before sending to the judge

Scoring Criteria

The LLM judge (judge_model in models.json, currently anthropic/claude-sonnet-4.6) scores each model using a 50/20/20/10 weighted criteria. Before scoring, every file path in each model's answer is verified against the actual filesystem in dataset/Kubecluster/ — the judge sees exactly which paths exist and which are hallucinated:

Weight	Criteria	What It Measures
50%	Ground Truth Recall	What fraction of the ground truth expected files did the model find?
20%	Extra Correct Files	Bonus for listing additional files beyond ground truth that actually exist on disk and are relevant (test files, configs, etc. all count)
20%	Reasoning Quality	Did the model explain why each file is affected — interface implementation, dependency chain, data flow?
10%	Hallucination Penalty	Deduction for file paths that do not exist on disk. Zero hallucinated paths = full points.

Each model is scored independently as a percentage accuracy (0-100%). Scores are not normalized across models — multiple models can receive the same score.

Analysis Output

After evaluation, analysis_summary.json contains the final leaderboard with accuracy vs. cost:

Model                                                  |   Avg % |     Cost $ |    %/$ | Judged
---------------------------------------------------------+---------+------------+--------+-------
anthropic/claude-sonnet-4.6 + ByteBell MCP               |   77.4% | $ 159.3442 |   0.49 |     45
anthropic/claude-haiku-4.5 + ByteBell MCP                |   70.2% | $  52.2822 |   1.34 |     55
openai/gpt-5.1-codex-max + ByteBell MCP                 |   70.0% | $  50.1175 |   1.40 |     45
openai/gpt-5.2-codex + ByteBell MCP                     |   66.3% | $   7.8490 |   8.45 |      3
google/gemini-3-flash-preview + ByteBell MCP             |   62.2% | $  13.9109 |   4.47 |     45
deepseek/deepseek-chat-v3.1 + ByteBell MCP              |   57.8% | $   3.5141 |  16.44 |     45
x-ai/grok-code-fast-1 + ByteBell MCP                    |   56.8% | $   6.1030 |   9.31 |     45
xiaomi/mimo-v2-flash + ByteBell MCP                     |   55.5% | $   4.4196 |  12.56 |     45
openai/gpt-5.1-codex-mini + ByteBell MCP                |   53.5% | $  11.4772 |   4.66 |     45
minimax/minimax-m2.5 + ByteBell MCP                     |   52.5% | $  11.7227 |   4.48 |     43
claude-opus-4/aicopilot (no MCP)                        |   32.7% | $   0.0000 |      — |     40

Key metrics per model:

Avg % — mean independent accuracy percentage across all judged questions (higher = better)
Cost $ — total USD spent across all questions (input + output tokens)
%/$ — accuracy percentage per dollar spent (higher = more accuracy per dollar)
Judged — number of questions where the model produced a scoreable answer

Ground Truth Generation (`claude_opus_4.6_direct_data_access.json`)

The other models in this benchmark answer questions by calling MCP tools to search a knowledge graph. To establish a ground truth baseline, we took a fundamentally different approach — giving the model direct access to the raw source code.

We opened Claude Code (Anthropic's CLI agent) on a machine with all 25 repositories cloned locally in dataset/kubeCluster/. Claude Code had full filesystem access to all 82,894 source files. For each of the 45 questions, we asked it to search the actual codebases using grep, glob, and file reads to identify every affected file, then write a structured answer with:

An architecture overview explaining the interface/type and its implementations
A detailed analysis with specific file paths, line numbers, and code patterns
An expected_files array listing every affected file with its repo, path, and reason for inclusion

This produces claude_opus_4.6_direct_data_access.json in each question folder — one per question, 45 total. These files are then copied to ground_truth.json and used as the authoritative reference when the LLM judge scores other models.

Key difference from MCP-based models: The MCP models search a pre-built knowledge graph with limited context windows. The ground truth was generated with direct access to every file in every repo — no knowledge graph abstraction, no token limits on search results, no tool call overhead. This makes it as close to a human expert's answer as an automated process can get.

Ground truth file format:

{
  "model": "anthropic/claude-opus-4.6-direct-data-access",
  "answer": "## Architecture Overview\n...",
  "llm_condensed_answer": "SUMMARY: ...\nFILES:\n- repo/path — reason\n...",
  "expected_files": [
    {"repo": "prometheus", "files": ["storage/interface.go"], "reason": "Querier interface definition"},
    {"repo": "thanos", "files": ["pkg/query/querier.go"], "reason": "Thanos querier Select() implementation"}
  ],
  "cost": {"input_tokens": 0, "output_tokens": 0, "total_tokens": 0, "cost_usd": 0.0},
  "status": "success"
}

Cost fields are zero because the generation happened locally via Claude Code, not through the OpenRouter API.

Evaluation Pipeline

question.json ──► mcp_context_generation.py ──► <model>.json (full answer + condensed)
                                                        │
                                                        ▼
                                               evaluate.py --force
                                                        │
                                         ┌──────────────┼──────────────┐
                                         ▼              ▼              ▼
                                  evaluation.json  analysis.json  analysis_summary.json
                                  (model metadata  (judge scores  (leaderboard with
                                   + relevance)     per question)  cost vs accuracy)

Scripts

`src/download_dataset.py` — Dataset Downloader

Downloads all 25 Kubernetes and observability repositories that make up the benchmark dataset into dataset/Kubecluster/.

Each repository is shallow-cloned (--depth 1) to minimize disk usage. Repos that already exist locally are skipped, so the script is safe to re-run — it will only fetch what's missing. On failure, it reports which repos could not be cloned and exits with code 1.

The 25 repositories:

Repository	GitHub
argo-cd	argoproj/argo-cd
autoscaler	kubernetes/autoscaler
cert-manager	cert-manager/cert-manager
cilium	cilium/cilium
crossplane	crossplane/crossplane
external-dns	kubernetes-sigs/external-dns
external-secrets	external-secrets/external-secrets
flux2	fluxcd/flux2
gatekeeper	open-policy-agent/gatekeeper
grafana	grafana/grafana
helm	helm/helm
ingress-nginx	kubernetes/ingress-nginx
istio	istio/istio
jaeger	jaegertracing/jaeger
karpenter	aws/karpenter-provider-aws
kubernetes	kubernetes/kubernetes
kustomize	kubernetes-sigs/kustomize
loki	grafana/loki
mimir	grafana/mimir
opentelemetry-collector	open-telemetry/opentelemetry-collector
opentelemetry-collector-contrib	open-telemetry/opentelemetry-collector-contrib
opentelemetry-operator	open-telemetry/opentelemetry-operator
prometheus	prometheus/prometheus
tempo	grafana/tempo
thanos	thanos-io/thanos

python3 src/download_dataset.py

No arguments. Clones everything into dataset/Kubecluster/.

`src/mcp_stress.py` — MCP Server Stress Test

Pure MCP stress test (no LLM involved). Hammers the graph_search tool with concurrent threads to find the maximum concurrency the server can handle within a given RAM budget.

How it works:

The test runs in two phases:

Phase 1 — Discovery: Starts at 5 threads and increases by 5 each round (10-second probe per round). Each thread repeatedly calls graph_search on the MCP server. A background ServerMonitor thread samples the server process's CPU% and RSS memory every second using psutil. If the server's RSS exceeds --server-mem-limit or any call errors out, discovery stops and the last safe thread count is recorded.
Phase 2 — Stability Soak: Runs the maximum safe thread count for 30 seconds of sustained load to confirm the server stays healthy under continuous pressure. Reports per-thread latency breakdown (avg, min, max, p50, p99) and server resource usage (CPU cores, peak RSS, samples).

The monitor finds the MCP server process by looking up which PID is listening on the MCP port via lsof.

python3 src/mcp_stress.py \
    --mcp-config mcp_config.json \
    --server-mem-limit 4000 \
    --max-duration 300

Flag	Short	Default	Description
`--mcp-config`	`-m`	required	Path to MCP config JSON file
`--server-mem-limit`		`4000`	Max server RSS in MB
`--start-threads`		`5`	Initial thread count
`--thread-step`		`5`	Threads to add each discovery round
`--probe-duration`		`10`	Seconds per discovery probe round
`--soak-duration`		`30`	Seconds for stability soak phase
`--query`		`SharedInformer`	Search query string
`--channels`		`classes imports`	Search channels
`--timeout`		`30`	Read timeout per MCP call in seconds
`--sample-interval`		`1.0`	Server monitor sample interval in seconds
`--max-duration`		`300`	Hard time limit for the entire test

`src/smoke_test.py` — LLM + MCP Smoke Test

End-to-end smoke test that auto-discovers the maximum concurrent thread count the LLM can handle, then answers ALL questions at that concurrency while monitoring server memory.

How it works:

Phase 1 — Discovery: Starts at 3 threads, bumps +1 each round. Each thread picks a random question from the provided file, sends it through the full LLM agent loop (LLM calls MCP tools to search the codebase, gets results, formulates an answer). If any thread errors (LLM timeout, refusal, crash), discovery stops and the last clean thread count is recorded.
Phase 2 — Answer All: Takes every question from the file, shuffles them, and batches them through the max safe thread count. Each batch runs in parallel. Progress is logged per-batch. Stops if the server memory limit is breached or errors appear.

Outputs a timestamped JSON file with full results including per-thread latency, tokens, cost, and server resource usage.

python3 src/smoke_test.py \
    --questions cross_repo_whole.json \
    --mcp-config mcp_config.json

# With a time cap:
python3 src/smoke_test.py \
    --questions cross_repo_whole.json \
    --mcp-config mcp_config.json \
    --max-duration 300

Flag	Short	Default	Description
`--questions`	`-q`	required	Path to questions JSON file
`--mcp-config`	`-m`	required	Path to MCP config JSON file
`--start-threads`		`3`	Initial thread count for discovery
`--thread-step`		`1`	Threads to add each round
`--max-duration`		unlimited	Hard time limit in seconds
`--model`		from `models.json`	OpenRouter model name
`--api-key`		env	OpenRouter API key
`--max-steps`		`25`	Max agent steps per question
`--timeout`		`120`	Read timeout per MCP call in seconds
`--server-mem-limit`		`4000`	Max server RSS in MB (stops test if exceeded)
`--seed`		none	Random seed for reproducibility

`src/evals.py` — Single-Model Benchmark Runner

Core infrastructure. Provides the MCPClient, LLMClient, and run_agent loop used by all other scripts. Also works standalone to run questions from a JSON file against a single model via MCP.

The agent loop sends a detailed system prompt instructing the LLM to plan, discover files across repos using MCP tools (server_info, list_knowledge, graph_search, retrieve_file), follow dependency chains, and produce a structured Markdown answer with an architecture overview, detailed analysis, and a table of all relevant files. Tool calls within a single step execute in parallel.

python3 src/evals.py \
    --questions cross_repo_whole.json \
    --mcp-config mcp_config.json \
    --model deepseek/deepseek-chat-v3.1

Flag	Short	Default	Description
`--questions`	`-q`	required	Path to questions JSON file
`--mcp-config`	`-m`	required	Path to MCP config JSON file
`--output-dir`	`-o`	`results`	Output directory for results
`--data-dir`	`-d`	`results`	Directory for per-question result files
`--model`		`deepseek/deepseek-chat-v3.1`	OpenRouter model name
`--api-key`		env	OpenRouter API key
`--max-steps`		`40`	Max agent steps per question
`--timeout`		`300`	Read timeout per MCP call in seconds
`--delay`		`1.0`	Delay between questions in seconds
`--start`		`0`	Start index (slice questions)
`--end`		all	End index (slice questions)
`--verbose`	`-v`	off	Enable verbose logging

`src/mcp_context_generation.py` — MCP Context Generation

Reads questions from a folder of question_*/question.json files and runs them against every model in models.json in parallel (one thread pool per model). Results are saved back into the same question folders. Cached successful answers are skipped on re-runs — only failed/blank ones are retried. Stops immediately on 402 payment errors.

After all models finish, prints a comparison table with success count, errors, average latency, tokens, cost, and wall-clock time per model.

python3 src/mcp_context_generation.py \
    --questions-dir results/KubeCluster40 \
    --mcp-config mcp_config.json \
    --threads 3

# Only run specific models:
python3 src/mcp_context_generation.py \
    --questions-dir results/KubeCluster40 \
    --mcp-config mcp_config.json \
    --models "xiaomi/mimo-v2-flash" "deepseek/deepseek-chat-v3.1"

Flag	Short	Default	Description
`--questions-dir`	`-q`	required	Path to folder containing `question_*/question.json` files
`--mcp-config`	`-m`	required	Path to MCP config JSON file
`--models`		all from `models.json`	Specific model(s) to run
`--threads`	`-t`	`3`	Concurrent threads per model
`--max-steps`		`25`	Max agent steps per question
`--timeout`		`120`	Read timeout per MCP call in seconds
`--api-key`		env	OpenRouter API key
`--num-questions`	`-n`	all	Number of questions to run
`--seed`		none	Random seed for reproducibility

`src/evaluate.py` — Answer Evaluator

Evaluates LLM answers by checking whether the file paths mentioned in each answer physically exist in dataset/Kubecluster/. For each question folder in the given results directory, it reads every model answer file, extracts file paths from markdown tables and inline backtick references, resolves repo names (handling aliases like "argocd" -> "argo-cd"), and checks the filesystem.

Computes two scores per model answer:

relevance_score (0-10, higher = better): Combines file accuracy (fraction of mentioned files that exist), answer substance (length/structure), and file coverage (number of real files found).
hallucination_score (0-10, higher = worse): Fraction of mentioned file paths that don't physically exist in the dataset.

Outputs per question folder:

evaluation.json — file-existence scores per model
analysis.json — LLM judge comparison: semantic relevance score (1-10) with justification per model (requires OPENROUTER_API_KEY)

Aggregated output:

analysis_summary.json — written to the results directory root, contains a per-model summary table (avg LLM judge score, questions judged) and per-question breakdown with every model's score and justification. The judge model is configured via judge_model in models.json.

python3 src/evaluate.py --results-dir results

# Or for a specific subfolder:
python3 src/evaluate.py --results-dir results/KubeCluster40

Flag	Short	Default	Description
`--results-dir`	`-r`	required	Path to results folder containing `question_*/` subfolders

`src/aggregate_metrics.py` — Aggregate Metrics

Aggregates per-model scores from all evaluation.json and analysis.json files into a single metrics.json. Reports average relevance, average hallucination, average LLM judge score, file accuracy percentage, questions answered, and questions errored per model, sorted by LLM judge score descending.

python3 src/aggregate_metrics.py --results-dir results

# Or for a specific subfolder:
python3 src/aggregate_metrics.py --results-dir results/KubeCluster40

Flag	Short	Default	Description
`--results-dir`	`-r`	required	Path to results folder containing `question_*/` subfolders

`src/count_errors.py` — Error Counter

Counts errored or blank answer files per model across all results.

python3 src/count_errors.py

`src/error_analysis.py` — Error Type Breakdown

Classifies and counts error types (402 Payment Required, 429 Rate Limited, Timeout, 5xx Server Error, blank answers, etc.) across all failed answer files, with per-model breakdown.

python3 src/error_analysis.py

`src/replace_questions.py` — Question Updater

One-time script that replaces CRW_TC042-TC071 in cross_repo_whole.json with 30 new observability cross-repo questions (OBS_TC001-OBS_TC030) covering Prometheus, OpenTelemetry Collector, Thanos, Grafana, Jaeger, Loki, Mimir, and Tempo.

python3 src/replace_questions.py

Output Structure

Questions live in folders like results/KubeCluster40/ with per-question subfolders. Model results are saved back into the same folders:

results/KubeCluster40/
  question_MIXED_TC001/
    question.json                    # The question + expected answer + expected files
    xiaomi_mimo-v2-flash.json        # Model answer + tool calls + cost
    deepseek_deepseek-chat-v3.1.json # Another model's answer
    openai_gpt-oss-120b.json         # One file per model in models.json
    ...
    evaluation.json                  # File-existence scores per model
    analysis.json                    # LLM-based relevance comparison per model
  question_MIXED_TC002/
    question.json
    xiaomi_mimo-v2-flash.json
    deepseek_deepseek-chat-v3.1.json
    ...
    evaluation.json
    analysis.json
  ...
  20250601_120000-mcp_context_generation.json  # Timestamped run summary
  metrics.json                                 # Aggregate scores (after running aggregate_metrics.py)

Per-Question Folder Contents

Each question_<ID>/ folder contains:

File	Description
`question.json`	The question text, expected answer, expected files, and source repo
`<model>.json`	One per model — the model's answer, tool calls, token usage, cost, latency, and status
`evaluation.json`	File-existence check results: relevance score (0-10), hallucination score (0-10), found/missing files per model
`analysis.json`	LLM judge comparison: semantic relevance score (1-10) with justification per model

`question.json` Format

{
  "id": "SA_TC001",
  "question": "If the ServiceAccount struct in kubernetes/...",
  "expected_answer": "The change affects...",
  "expected_files": [
    {"repo": "istio", "files": ["pilot/pkg/..."], "reason": "..."}
  ],
  "repo": "kubernetes"
}

Model Answer File Format (e.g. `deepseek_deepseek-chat-v3.1.json`)

{
  "model": "deepseek/deepseek-chat-v3.1",
  "answer": "## Architecture Overview\n...",
  "cost": {
    "input_tokens": 45000,
    "output_tokens": 3200,
    "total_tokens": 48200,
    "cost_usd": 0.0141
  },
  "status": "success",
  "latency_seconds": 42.5,
  "tool_calls_count": 12,
  "agent_steps": 8,
  "tool_calls": [...]
}

`analysis.json` Format

For each question, analysis.json contains an LLM-generated comparison of each model's answer against the expected answer:

{
  "question_id": "SA_TC001",
  "question": "...",
  "expected_answer": "...",
  "model_analyses": [
    {
      "model": "deepseek/deepseek-chat-v3.1",
      "relevance": 8,
      "justification": "The answer correctly identifies 6 of 7 expected files..."
    }
  ]
}

Evaluation Workflow

1. Run mcp_context_generation.py -q results/KubeCluster40  →  question_*/<model>.json
2. Run evaluate.py -r results/KubeCluster40                →  question_*/evaluation.json + analysis.json
3. Run aggregate_metrics.py -r results/KubeCluster40       →  metrics.json

Test Case Categories

100 test cases in cross_repo_whole.json:

Prefix	Count	Description
CRW	41	Cross-Repo Wide — struct/interface/function changes in core Kubernetes that break downstream consumers
OBS	30	Observability — cross-repo impact across Prometheus, Thanos, Mimir, Loki, Tempo, Jaeger, Grafana, and OTel Collector
KM	14	Kubernetes Modification — struct/interface changes in Kubernetes core packages
SA	13	Source Across — breaking changes from multiple source repos with broad cross-repo impact
NK	2	Non-Kubernetes — changes originating in non-core repos (kustomize, helm)

Adding New Questions

To add more queries for the ByteBell MCP to resolve, create new question folders following the same format as results/KubeCluster40/question_*/question.json.

Steps

Pick a unique question ID (e.g. CUSTOM_TC001).
Create a folder inside your questions directory: question_CUSTOM_TC001/
Add a question.json inside it with this format:

{
  "id": "CUSTOM_TC001",
  "question": "Your cross-repo impact question here...",
  "expected_answer": "Description of the expected impact...",
  "expected_files": [
    {"repo": "argo-cd", "path": "controller/appcontroller.go", "why": "Reason this file is affected"},
    {"repo": "prometheus", "path": "discovery/kubernetes/pod.go", "why": "Reason this file is affected"}
  ],
  "repo": "source-repo-name"
}

Run mcp_context_generation.py with the path to your folder:

python3 src/mcp_context_generation.py \
    --questions-dir results/KubeCluster40 \
    --mcp-config mcp_config.json

You can add any number of question folders. The script discovers all question_*/question.json files automatically and skips questions that already have cached successful answers.

Using a Different Folder

You don't have to use KubeCluster40. You can create any folder with question_*/question.json subfolders and point the script at it:

# Create a new question set
mkdir -p results/ObservabilityQuestions/question_OBS_NEW_001
# Add question.json ...
mkdir -p results/ObservabilityQuestions/question_OBS_NEW_002
# Add question.json ...

# Run against your custom folder
python3 src/mcp_context_generation.py \
    --questions-dir results/ObservabilityQuestions \
    --mcp-config mcp_config.json

This lets you maintain separate question sets for different domains, test scenarios, or evaluation runs. Each folder is self-contained — questions and model results all live together under the same directory.

How It Works

Connects to the ByteBell MCP server over StreamableHTTP
Fetches available MCP tools (server_info, list_knowledge, graph_search, graph_traverse, retrieve_file)
For each question, runs an agent loop:
- LLM receives the question + MCP tools in OpenAI function-calling format
- LLM calls tools to search across repos
- Tool calls execute in parallel via ThreadPoolExecutor
- Loop continues until the LLM produces a final answer or hits --max-steps
Results are saved incrementally after each question
Evaluation checks mentioned file paths against the actual dataset filesystem
Analysis uses an LLM judge to compare answers against expected answers for semantic relevance

Pure Python — no LangChain, no mcp_use. Direct HTTP calls to OpenRouter and the MCP server.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
results		results
src		src
.env.example		.env.example
.gitignore		.gitignore
.mcp_config.json		.mcp_config.json
CIPHER_QUERIES.md		CIPHER_QUERIES.md
DATASET_SMALL.md		DATASET_SMALL.md
GRAPH_IMPLEMENTATION.md		GRAPH_IMPLEMENTATION.md
QUESTIONS.md		QUESTIONS.md
README.md		README.md
RESULTS_ANNOUNCEMENT.md		RESULTS_ANNOUNCEMENT.md
SWE-benchPro.md		SWE-benchPro.md
cross_repo_whole.json		cross_repo_whole.json
ground_truth.json		ground_truth.json
kube_dataset_mapping.json		kube_dataset_mapping.json
models.json		models.json
sample_questions.json		sample_questions.json
swe_bench_models.json		swe_bench_models.json
swe_bench_pro.json		swe_bench_pro.json
swe_bench_pro_cross_repo.json		swe_bench_pro_cross_repo.json

Folders and files

Latest commit

History

Repository files navigation

ByteBell Cross-Repository Impact Analysis Benchmark

Setup

Datasets

KubeCluster40 — 45 Cross-Repository Impact Questions

Per-Question Folder Structure

Model Answer Files

Answer Condensation

Scoring Criteria

Analysis Output

Ground Truth Generation (claude_opus_4.6_direct_data_access.json)

Evaluation Pipeline

Scripts

src/download_dataset.py — Dataset Downloader

src/mcp_stress.py — MCP Server Stress Test

src/smoke_test.py — LLM + MCP Smoke Test

src/evals.py — Single-Model Benchmark Runner

src/mcp_context_generation.py — MCP Context Generation

src/evaluate.py — Answer Evaluator

src/aggregate_metrics.py — Aggregate Metrics

src/count_errors.py — Error Counter

src/error_analysis.py — Error Type Breakdown

src/replace_questions.py — Question Updater

Output Structure

Per-Question Folder Contents

question.json Format

Model Answer File Format (e.g. deepseek_deepseek-chat-v3.1.json)

analysis.json Format

Evaluation Workflow

Test Case Categories

Adding New Questions

Steps

Using a Different Folder

How It Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Ground Truth Generation (`claude_opus_4.6_direct_data_access.json`)

`src/download_dataset.py` — Dataset Downloader

`src/mcp_stress.py` — MCP Server Stress Test

`src/smoke_test.py` — LLM + MCP Smoke Test

`src/evals.py` — Single-Model Benchmark Runner

`src/mcp_context_generation.py` — MCP Context Generation

`src/evaluate.py` — Answer Evaluator

`src/aggregate_metrics.py` — Aggregate Metrics

`src/count_errors.py` — Error Counter

`src/error_analysis.py` — Error Type Breakdown

`src/replace_questions.py` — Question Updater

`question.json` Format

Model Answer File Format (e.g. `deepseek_deepseek-chat-v3.1.json`)

`analysis.json` Format

Packages