This repo contains everything for our dead code detection benchmark: raw data from previous runs, the blog post draft, analysis of false positive patterns, and instructions for running new benchmarks.
dead-code-benchmark-blog/
README.md # This file (planning, methodology, run instructions)
blog/
2026-03-09-dead-code-graphs-ai-agents.md # Blog post draft
runs/
feb-06-initial/ # First antiwork/helper benchmark
feb-17-18-full-suite/ # 16-run benchmark (blogpost-deadcode.zip)
feb-20-with-synthetic/ # 10-task MCP vs baseline (fri-feb-20.zip)
feb-23-post-fixes/ # 16-task MCP-only post-fixes (sun-feb-23.zip)
mar-09-post-parser-improvements/ # NEW: Run after barrel re-export + 7 new phases
analysis/
fp-taxonomy.md # False positive root cause taxonomy
parser-improvement-plan.md # Prioritized improvement roadmap
benchmark-improvement-system.md # Continuous improvement loop design
| Date | Run | Tasks | Key Finding |
|---|---|---|---|
| Feb 6 | Initial | antiwork/helper + synthetic | 91.7% recall, 100% precision on real codebase (resolved!) |
| Feb 9 | Replication | Same as above | Identical results -- confirmed reproducibility |
| Feb 17-18 | Full suite | 16 runs (synthetic + 12 real PRs) | MCP wins 9/16, 0 resolved on large repos |
| Feb 20 | MCP vs baseline | 10 real PRs | Baseline scored 0% on all 10 tasks. MCP: 75-100% recall on 6/10 |
| Feb 23 | Post-fixes (MCP only) | 16 PRs | Directus 80% recall, some improvement |
| Mar 9 | Post parser improvements | TBD | Run this now -- see instructions below |
| Task | Best Recall | Best Precision | Run |
|---|---|---|---|
| directus_pr26311 | 100% | 0.6% | Feb 20 |
| podman_pr16084 | 100% | 0.3% | Feb 20 |
| latitude_pr2300 | 100% | 0.3% | Feb 20 |
| tyr_pr258 | 95.5% | 3.8% | Feb 20 |
| antiwork/helper | 91.7% | 100% | Feb 6 (RESOLVED) |
| mimir_pr3613 | 77.8% | 0.6% | Feb 20 |
| otel_js_pr5444 | 75.0% | 1.3% | Feb 20 |
| maskbook_pr12361 | 63.6% | 0.2% | Feb 20 |
These changes landed but have NOT been measured in a full benchmark yet:
-
Barrel re-export filtering (Mar 6, #444/#469)
- Internal testing: recall 59.0% -> 85.2% across 5 repos
- tyr: 36.4% -> 86.4%, mimir: 77.8% -> 88.9%, directus: 80% -> 93.3%, latitude: 80% -> 100%
- Propagates
isReExportflag, excludes re-export-only consumers from import counts - Step 2b fixpoint loop: cascading dead import detection
-
7 new pipeline phases (Feb 26-Mar 6, #469)
- Phase 2b: Vue SFC import extraction
- Phase 2c: Dynamic
import()/require()filesystem loading detection - Phase 4c: Library package detection (types/, module/, packages/)
- Phase 4d: Cross-package import resolution for monorepos
- Phase 6b: Class-to-method reachability propagation
- Phase 6d: Exported object literal property reachability
- Directus: 2,808 candidates -> 2,178 (-22.4%), preserved 22/23 known TPs
-
Class rescue via new/instanceof/extends (Mar 6, #497)
- Non-exported classes used via
new,instanceof,extendsno longer flagged dead - Example:
BlobNotFoundErrorin job-worker.service.ts
- Non-exported classes used via
-
Same-file rescue patterns (broadened, #443)
- Callback references:
items.map(processItem) - Object literal methods:
{ handler: myFunction } - JSX component references:
<TopLinks /> - Module-scope calls:
const X = getZipLimits() - Default parameters:
function foo(cb = myHelper) {}
- Callback references:
- Docker Desktop running
- Environment variables set:
export ANTHROPIC_API_KEY="sk-ant-..." export SUPERMODEL_API_KEY="smsk_live_..."
- mcpbr-eval repo set up (see below)
# The benchmark repo
cd ~/mcpbr-eval
git checkout feat/supermodel-benchmark
# Install dependencies
uv pip install -e ".[dev]"
# Verify it works
uv run pytest -m "not integration"Option A: Run all 17 tasks (MCP-only, ~$5-10, ~2 hours)
This is the fastest way to get "after" numbers for the blog post. No baseline needed since we already have baseline data showing 0% across the board.
cd ~/mcpbr-eval
uv run mcpbr run --config config/supermodel-deadcode-pr.yaml --mcp-only -vvOption B: Run specific high-signal tasks (recommended first)
Start with the 5 tasks that had internal recall improvements:
cd ~/mcpbr-eval
# These are the tasks with known recall improvements from parser fixes
for task in tyr_pr258 jslpsolver_pr159 mimir_pr3613 directus_pr26311 latitude_pr2300; do
echo "=== Running $task ==="
uv run mcpbr run --config config/supermodel-deadcode-pr.yaml --mcp-only -t $task -vv --no-incremental
doneOption C: Run with baseline comparison (~$15-20, ~4 hours)
Full A/B comparison. Only needed if we want fresh baseline numbers.
cd ~/mcpbr-eval
uv run mcpbr run --config config/supermodel-deadcode-pr.yaml -vvResults will be in the latest .mcpbr_run_* directory:
# Find the latest run
ls -td ~/mcpbr-eval/.mcpbr_run_* | head -1
# Copy results to this repo
LATEST=$(ls -td ~/mcpbr-eval/.mcpbr_run_* | head -1)
cp -r "$LATEST" ~/dead-code-benchmark-blog/runs/mar-09-post-parser-improvements/
# Quick summary
cat "$LATEST/results.json" | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f\"MCP: {d['summary']['mcp']['resolved']}/{d['summary']['mcp']['total']} resolved\")
print(f\"Cost: \${d['summary']['mcp']['total_cost']:.2f}\")
for t in d['tasks']:
m = t['mcp']
print(f\" {t['instance_id']}: P={m.get('precision',0):.1%} R={m.get('recall',0):.1%} F1={m.get('f1_score',0):.1%} ({m.get('true_positives',0)} TP, {m.get('false_positives',0)} FP)\")
"The benchmark config is at ~/mcpbr-eval/config/supermodel-deadcode-pr.yaml. Key settings:
| Setting | Value | Notes |
|---|---|---|
model |
claude-sonnet-4-20250514 |
Claude Sonnet 4 |
agent_harness |
claude-code |
Uses Claude Code as agent |
max_iterations |
30 | Agent turn limit |
timeout_seconds |
1200 | 20 min per task |
supermodel_api_base |
staging.api.supermodeltools.com | Uses staging API |
resolved_threshold |
0.8 | P>=80% AND R>=80% to "resolve" |
From the config file comments and benchmark experience:
| Task | Quality | Notes |
|---|---|---|
| tyr_pr258 | Best | 95.5% API recall (21/22 GT), clean TypeScript monorepo |
| directus_pr26311 | Good | 15 GT items, multi-package structure |
| mimir_pr3613 | Good | 9 GT items, Statistics Norway portal |
| latitude_pr2300 | Good | 5 GT items, monorepo with packages |
| jslpsolver_pr159 | Good | 10 GT items, smaller repo (baseline previously outperformed) |
| podman_pr16084 | OK | Only 2 GT items (small sample) |
| otel_js_pr5444 | OK | 4 GT items, non-determinism observed |
| maskbook_pr12361 | OK | 22 GT items but 6000+ candidates (analysis dump risk) |
| gemini_cli_pr18681 | Poor | 33% API recall, test-import gap |
| prisma_pr28485 | Poor | 0% recall (GT items only referenced by tests) |
| n8n_pr23572 | Invalid | Feature removal, not dead code (GT items are alive) |
| typescript_pr56817 | Blocked | Too large for staging API without cached_analysis |
| Priority | Root Cause | % of FPs | Status (Mar 9) |
|---|---|---|---|
| P0 | Import resolution: too many root files (145 in Tyr, should be ~3) | ~51% | Partially fixed (barrel re-exports done, framework wiring still missing) |
| P1 | export type { X } from re-exports not tracked |
~15% | Not fixed |
| P2 | export default X not detected as export |
~5% | Not fixed |
| P3 | JSX usage not recognized as calls (React components) | ~20% in React repos | Partially fixed (same-file rescue catches <Foo />, cross-file misses) |
| P4 | Test/script/config file imports not scanned | 11/28 scream test FPs | Not fixed |
| P5 | Python isExported broken (only checks __all__) |
Disables Python pipeline | Not fixed |
| P6 | Type/interface structural typing references | ~21% in TS repos | Hard to fix |
The dominant false positive source. When import resolution fails to trace how a file is consumed (e.g., Express app.use(router), dynamic require(), framework wiring), the file becomes a "root file" with no importers. All its exports then get flagged as "exported but file never imported."
In the Tyr benchmark, there were 145 root files (should be ~3-5). This single issue produced ~396 of 775 candidates (51%).
What's fixed: Barrel re-exports now correctly traced. Step 2b fixpoint loop handles cascading dead imports.
What's still broken: Framework-level wiring (Express router mounting, NestJS module registration, dynamic requires).
Symbols imported only from test files, build scripts, or config files (e.g., vitest.config.ts, rollup.config.mjs) appear to have "no importers" because the scanner doesn't cover those file types. This was 11 of 28 FPs in the jslpsolver scream test.
Highest-ROI unfixed item. Scanning test file imports would eliminate a huge chunk of FPs with minimal regression risk.
1. Run benchmark -> measure P/R/F1 per task
2. Sample FPs -> categorize root causes (using taxonomy above)
3. Pick highest-impact root cause -> implement fix
4. Run benchmark again -> verify improvement, check for regressions
5. Repeat
| Tier | Speed | Cost | What It Tests | When to Run |
|---|---|---|---|---|
| Tier 1 | Seconds | Free | Unit tests: does parser extract export default? |
Every commit |
| Tier 2 | Minutes | Free | Candidate-level regression: run analysis on 5 repos, diff candidate counts | Every PR |
| Tier 3 | Hours | $5-50 | Full agent benchmark via mcpbr: measures whole pipeline end-to-end | Weekly / pre-release |
Before running an expensive Tier 3 benchmark, verify parser changes with a candidate-level check:
# 1. Check out the benchmark repo at pre-PR commit
# 2. Run Supermodel dead code analysis (just the API, no agent)
# 3. Count candidates and check if known TPs are still detected
# 4. Compare candidate count to baseline
# Example for Directus:
# Before fix: 2,808 candidates, 22/23 TPs detected
# After fix: 2,178 candidates (-22.4%), 22/23 TPs still detected
# = Good: fewer FPs, no TP regressionFormalize the 12 PRs as a regression test suite:
{
"tasks": [
{
"id": "tyr_pr258",
"repo": "uncovering-world/track-your-regions",
"merge_commit": "6f480121...",
"ground_truth_count": 22,
"best_api_recall": 0.955,
"best_agent_recall": 0.955,
"known_misses": ["one item missed by API"]
}
]
}After each benchmark run, automatically categorize FPs:
- For each FP, check: imported by test file? Framework entry point? Type re-export? Generated directory?
- Tag each FP with root cause category
- Track category volumes over time
- Build a precision/recall dashboard per root cause
| What | Where |
|---|---|
| mcpbr-eval (benchmark runner) | ~/mcpbr-eval (branch: feat/supermodel-benchmark) |
| Benchmark config | ~/mcpbr-eval/config/supermodel-deadcode-pr.yaml |
| Supermodel benchmark code | ~/mcpbr-eval/src/mcpbr/benchmarks/supermodel/ |
| Dead code endpoint impl | ~/mcpbr-eval/src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py |
| Parser (tree-sitter) | ~/supermodel-public-api/src/data-plane/src/parsers/tree-sitter.ts |
| Dead code pipeline | ~/supermodel-public-api/src/data-plane/src/services/job-worker.service.ts:1344-3181 |
| Blog post draft | ~/jonathanpopham.github.io/_drafts/2026-03-09-dead-code-graphs-ai-agents.md |
| Previous run data (Downloads) | ~/Downloads/blogpost-deadcode/, fri-feb-20-*, sun-feb-23-* |