A benchmark for evaluating how well coding agents understand and navigate code across large, distributed enterprise codebases.
Existing benchmarks (SWE-bench, ITBench, DevOps-Gym) test agents in isolated, single-repo settings. EnterpriseBench tests what enterprise developers actually do: trace dependencies across repositories, investigate incidents that span services, and produce diverse artifacts — not just code patches.
- 112 tasks across 10 task types, organized into 7 enterprise workflow suites
- Real OSS codebases only — no synthetic or toy repositories
- Multi-repo by default — 51% of tasks span 2-5 repos connected by real dependency chains
- Diverse artifacts — code patches, incident reports, runbooks, configs, security assessments, call graphs
- Checkpoint-based scoring — 2-5 graduated checkpoints per task for partial credit
- Tool-access as a controlled variable — baseline, MCP-only, and hybrid modes
- Multi-session support — single-shot, chained, event-replay, and resume session types
| Suite | Tasks | Example Work |
|---|---|---|
| Dependency Management | 27 | Trace breaking changes across grpc-go -> etcd -> kubernetes |
| Customer Escalation | 24 | Map error messages to root causes across service boundaries |
| Technical Debt | 19 | Orchestrate refactors and dead code removal across repos |
| Feature Delivery | 18 | Assess schema evolution impact across monorepo packages |
| Incident Response | 12 | Investigate alerts spanning multiple interconnected services |
| Platform Engineering | 10 | Detect configuration drift between IaC definitions and live state |
| Security Operations | 2 | Assess vulnerabilities across dependency trees |
| Type | Description |
|---|---|
api_contract |
Trace breaking API changes across consumer repos |
config_drift |
Detect divergence between config sources and deployments |
db_schema_evolution |
Assess schema change impact across dependent services |
dead_code_necropsy |
Identify unreachable code and feature flag remnants |
dependency_graph |
Traverse and reason about dependency chains |
error_provenance |
Trace error messages to their originating code paths |
incident_investigation |
Root-cause analysis across multi-service incidents |
monorepo_boundary |
Navigate cross-package dependencies in monorepos |
refactor_orchestration |
Plan and execute multi-repo refactoring |
support_code_mapping |
Map customer-facing behavior to underlying implementation |
| Stratum | Share | Tasks |
|---|---|---|
| Calibration | 12.5% | 14 |
| Large single-repo | 25.9% | 29 |
| Dual-repo | 25.0% | 28 |
| Tri-repo | 14.3% | 16 |
| Multi-repo (4-5) | 11.6% | 13 |
| Monorepo cross-package | 10.7% | 12 |
EnterpriseBench/
├── benchmarks/ # 112 active task definitions by suite
│ ├── dependency_management/
│ ├── customer_escalation/
│ ├── feature_delivery/
│ ├── technical_debt/
│ ├── incident_response/
│ ├── platform_engineering/
│ ├── security_operations/
│ └── _archived/ # 28 retired tasks (preserved for reference)
├── lib/
│ └── eb_verify/ # Centralized verification library
│ └── plugins/ # 9 artifact validators
├── schemas/
│ └── task.schema.json # Task definition schema
├── scripts/
│ ├── mining/ # Task sourcing from OSS history
│ ├── sandbox/ # Multi-repo sandbox management
│ ├── orchestration/ # Session chaining, event replay
│ └── validation/ # Task mix and cross-repo validators
├── configs/ # Run configurations and repo version pins
├── results/ # Run results and sample outputs
├── tests/ # 779+ tests across 19 modules
└── docs/ # Architecture, task authoring guide, design docs
Each task is defined in TOML with structured metadata:
difficulty_stratum = "dual_repo"
verification_modes = ["deterministic"]
[task]
id = "api-contract-grpc-metadata-001"
suite = "dependency_management"
task_type = "api_contract"
difficulty = "hard"
session_type = "single"
description = "Trace impact of gRPC-Go metadata context separation on etcd consumers"
prompt = """
The gRPC-Go team has merged a change that separates incoming and outgoing
metadata in the context. Find all files in etcd that use the old metadata
APIs and will break...
"""
[[repos]]
url = "https://github.com/grpc/grpc-go"
rev = "v1.60.0"
role = "dependency"
[[repos]]
url = "https://github.com/etcd-io/etcd"
rev = "v3.5.10"
role = "primary"
[[checkpoints]]
id = "cp1"
description = "Identify all breaking API changes"
points = 25
verifier = "answer"See schemas/task.schema.json for the full schema and docs/TASK_AUTHORING_GUIDE.md for authoring instructions.
EnterpriseBench uses a single centralized verification library (eb_verify) with a plugin architecture. Nine artifact-type-aware validators handle the diverse outputs agents produce:
| Plugin | Validates |
|---|---|
answer |
Structured JSON answers with expected fields |
code_patch |
Git diffs against ground-truth patches |
config_validator |
Configuration file correctness |
incident_report |
Incident analysis structure and content |
runbook |
Operational runbook completeness |
security_assessment |
Security finding accuracy |
reproduction_script |
Bug reproduction script execution |
topological_order |
Dependency ordering correctness |
call_graph |
Function call graph accuracy |
Ground truth uses a layered approach: deterministic checks, LLM-based curation, and solve-verification.
Tasks use real open-source dependency chains rather than artificially combined repos:
- Go:
grpc-go->etcd->kubernetes - Python:
requests->boto3->awscli - Java:
protobuf-java->grpc-java->envoy-control-plane - TypeScript:
typescript->eslint->next.js - Cross-language:
protobufschema -> Go server -> Python client -> TypeScript frontend
Four atomic multi-repo patterns: propagate, investigate, enforce, orchestrate.
Tasks are executed in Docker sandboxes with repos cloned into /workspace/{repo-name}/. Each task can run in three tool-access modes:
| Mode | Description |
|---|---|
baseline |
Agent uses only local tools (grep, find, etc.) |
mcp_only |
Agent uses Sourcegraph MCP for code search and navigation |
hybrid |
Agent has access to both local tools and MCP |
# Run a single task
python scripts/orchestration/run_task.py --task api-contract-grpc-metadata-001 --mode baseline
# Validate task mix against PRD targets
python scripts/validation/task_mix_validator.py- Architecture — system design and verification flow
- Task Type PRD — detailed definitions for all 10 task types
- Task Authoring Guide — how to add new tasks
- Convergence Report — architecture decisions from structured debate
Apache 2.0 — see LICENSE.