EB: python3 verifier sweep (6 suites) + 88/88 sg-evals mirror batch + catch-up since 0rv.8 by sjarmak · Pull Request #1 · sjarmak/EnterpriseBench

sjarmak · 2026-06-04T21:47:04Z

Brings main current from the 2026-04-29 snapshot (8a78319, phase-8 paper draft). 16 commits, ~149 files.

Headline — python3 verifier sweep (9d5abab)
Converts python3-based checkpoint verifiers to bash+jq+grep across 6 of 7 suites. Task containers ship no python3, so 267 python3-invoking checks were false-failing exit-127 in every mode (masking 37 tasks at 0.0). Validated: 0 graded divergences on 110+ scenarios.

88/88 sg-evals mirror batch (a3a41fc..18a5a58)
Wave 1-4 registration of the Stephanie-authorized 88-pin mirror set (BATCH-COMPLETE), with remediation logs.

Accumulated EB work (catch-up)

32738b2 vendor benchmark_qa_core + unified ScoreResult
03e866d fail-loud on container EACCES no-op runs
9b37679 cross-repo gold context for 7 CRNT-failing tasks
28e4564 curate expected_solution.json (incident_response)
2809c80 add support-mapping-dual-httpx-httpcore task
76f5469 F2 leakage-check: skip generic build-manifest basenames
916c8ab dep-traversal-007 difficulty fix
ac55823 etcd consumer fix (api-contract-003)
a66e8ab scrub internal refs + untrack .beads for public release
534fbb1 Snorkel collaboration / division-of-labor proposal (docs)
b2d2e48 align validator report bead IDs

Approved for publish by repo owner (full-scope catch-up).

…task, 0rv.azt) New customer-escalation support_code_mapping task spanning encode/httpx 0.27.0 and encode/httpcore 1.0.5. Customer scenario: httpx.ReadTimeout fires despite timeout=None — answer requires tracing the timeout marshaling chain across the httpx public Timeout config, the httpcore.ConnectionPool / AsyncHTTP11Connection per-op enforcement, and the HTTPCORE_EXC_MAP exception remap. 3 distinct-dimension checks (mirrors spring-kafka template): - error_source: file discovery via eb_verify file_extraction plugin - error_chain: keyword-graded chain traversal across both repos - trigger_conditions: keyword-graded named-condition recall Validated: schema PASS, CRNT PASS (2 repos × required_files), expected_solution schema PASS. Smoke-tested with passing/failing/partial fixtures — checks score 1.0/1.0/1.0 vs 0.0/0.0/0.0 and the partial fixture (files-only) yields 1.0/0.0/0.0, proving the 3 dimensions are independent. Note: this task lands AC#1-4 of bead 0rv.azt. AC#5 (Go% <40% on multi-repo subset) is unreachable in 1 task — Go% drops only 46.2% -> 45.8% because 11 new dual-repo Go tasks landed after the 0rv.25 measurement (40.7%). Mayor mailed with shortlist of follow-ups.

Adds expected_solution.json for 11 tracked tasks under benchmarks/incident_response/ that previously lacked one. With this change the full suite (26 tasks) validates with --check-paths against scripts/validation/validate_expected_solutions.py. Coverage: - 2 incident-investigation-002, -003 (standard checkpoints, scaffolded from ground_truth) - 4 incident-investigation-dual-{flux,istio,kafka,prometheus}-001 (standard checkpoints with hand-curated root_cause / error_chain / remediation from expected_answer) - 1 incident-investigation-tri-containerd-001 (standard checkpoints) - 1 incident-investigation-quad-containerd-001 (custom checkpoints — kubelet seccomp conversion through containerd CRI, runc BPF filter, moby default profile) - 1 ansible-galaxy-tar-regression-prove-001 (custom regression-test write-and-fail checkpoints) - 1 ccx-incident-032 (custom source_files/error_chain/config checkpoints for Envoy connection-pool overflow) - 1 event-replay-click-ci-001 (custom triage/communication/remediation checkpoints for the markupsafe soft_str removal cascade) Each checkpoint provides: - expected_solution: prose synthesizing the canonical answer from ground_truth.json (root_cause, error_chain, affected_services, remediation) and the checkpoint's verifier semantics. - evaluation_criteria: 3-5 concrete signals naming specific files/functions that an agent's incident report must reach. All file-path-like fragments resolve at the pinned SHA via the GitHub Contents API (--check-paths green). Closes EnterpriseBench-auu.

Implements EnterpriseBench-1av (proxy of dr-2vydrm.3): the EB rig adapter for the codeprobe-shared benchmark_qa_core library, plus the unified cross-benchmark ScoreResult contract. Changes: * `lib/eb_verify/_vendor/benchmark_qa_core/` — file-vendored from codeprobe SHA 047df83 (`feature/codeprobe-ceuu-benchmark-qa-core`). Imports rewritten to relative form; sha256s recorded in VENDOR.md. * `lib/eb_verify/qa_adapter.py` — bridges EB's task.toml + ground_truth.json + expected_solution.json into the lib's flat input shapes. Synthesises scoring_method from `verification_modes`, buckets oracle files by their declared repo, and reads instruction.md as the agent-visible aux file. * `lib/eb_verify/schema_validator.py` — adds Layer 3 `_run_qa_layer` and `validate_task(qa_strict=…, workspace_root=…)`. Default is warn-only; strict mode promotes lib `error`-severity findings to validation errors. * `lib/eb_verify/scoring.py` — adds `ScoreResult`, `ScoreDiagnostics`, `VerificationResult.to_score_result()`, and `write_score_result()` for the cross-benchmark JSON contract (reward / scorer_family=checklist / sub_scores / diagnostics{task_time_seconds, token_cost_usd, ir_metrics, artifact_results}). * `lib/eb_verify/runner.py` — `run_all` now plumbs ScoreResult emission alongside reward.txt and accepts harness-supplied `task_time_seconds` / `token_cost_usd`. * `lib/eb_verify/cli.py` — `validate` subcommand learns `--qa-strict` and `--workspace`. * `tests/test_qa_adapter.py` (17) and `tests/test_score_result.py` (17) — new coverage; existing 80 tests in test_scoring/test_schema_validator/ test_runner/test_cli stay green. Corpus run (180 active tasks, strict mode): * 179 valid · 1 Layer-2 invalid (dep-traversal-007 stratum mismatch) · 0 Layer-3 errors * 358 warnings — 330 EB_A0 (expected; repos not cloned outside sandbox) and 28 F2 leakage candidates (22 distinct tasks). Follow-ups filed: * EnterpriseBench-3sk — fix dep-traversal-007 stratum/repo-count. * EnterpriseBench-pyh — triage 11 specific-path F2 leakage warnings. * EnterpriseBench-7en — add basename filter for generic build manifests (go.mod, package.json, pom.xml, …) so they don't trigger F2. Report: docs/qa/eb_validator_run_2026_04_30.md Test verification: PYTHONPATH=lib python3 -m pytest \\ tests/test_scoring.py tests/test_schema_validator.py \\ tests/test_score_result.py tests/test_qa_adapter.py \\ tests/test_runner.py tests/test_cli.py 114 passed in 0.27s Local-only branch per bead constraints (no push to public sjarmak/EnterpriseBench).

The original commit (32738b2) referenced follow-up beads 3sk/pyh/7en that lived in a different rig/store. Refile the same three follow-ups in the current rig and update the report to reference the real IDs: - EnterpriseBench-046 (dep-traversal-007 stratum fix) - EnterpriseBench-1is (F2 specific-path triage, 11 tasks) - EnterpriseBench-zco (basename filter for generic build manifests)

Layer-2 strict-mode validation flagged difficulty_stratum 'multi_repo' (expects 3-5 repos) against the task's actual 2 repos (grpc-node, kubernetes-client-js), the only Layer-2 strict failure in the corpus. The task wires exactly two consumer repos in an 'investigate' pattern; the prompt's Google Cloud mention is narrative and no third repo is configured. dual_repo (2 repos) is the semantically correct label and matches actual state. Acceptance: PYTHONPATH=lib python3 -m eb_verify.cli validate --qa-strict benchmarks/dependency_management/dep-traversal-007/task.toml -> VALID

F2 aux-file leakage fired false positives when instruction.md named a generic build-manifest basename (go.mod, package.json, pom.xml, go.sum, setup.py, ...) that matched an oracle file. These basenames are non-discriminative — every repo of a language has one — so naming them in the prompt does not leak which file the agent should investigate. check_aux_file_leakage now skips F2 for tokens that are BARE generic manifest basenames; tokens with directory components (e.g. envoy/go.mod) stay specific and still raise F2. Clears all 10 corpus false positives; remaining corpus F2s are genuine specific-path/specific-file leaks. This is an EB-only vendored patch (codeprobe upstream mirror tracked as dr-2vydrm.1); VENDOR.md records the intentional drift and the new leakage.py checksum. Tests: bare-manifest token suppressed + specific-path token still flagged, at both the adapter and lib level. Acceptance (EnterpriseBench-zco): - Generic-manifest tokens no longer raise F2 ✓ - Specific-path tokens still raise F2 ✓ - Tests added for both cases ✓ - VENDOR.md drift note updated ✓

Seven multi-repo tasks declared repos with zero required_files in task.toml [ground_truth], failing the Cross-Repo Necessity Test. Added the missing entries grounded in each task's ground_truth.json (curated producer/consumer files) or the dependency-manifest convention (go.mod / clientconn.go) for module-resolution and orchestration tasks. CRNT multi-repo: 122 -> 129/129 pass. Soundness gate: 180/180. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Remove references to the private codeprobe repo from vendored benchmark_qa_core docstrings, qa_adapter, scoring, and the QA run doc; reframe VENDOR.md as first-party code under the repo's Apache 2.0 LICENSE. - Untrack .beads/ (internal issue tracker + orchestration formulas/hooks); it syncs via Dolt, not git. Add .beads/ to .gitignore. - Fix stale 'unpublished' wording in AGENTS.md — CSB is public on Sourcegraph. No secret leakage (gitleaks clean on tree + history). Apache 2.0 covers the first-party vendored code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Internal draft: EB's deterministic Tier-1 + harness as the cost-reducing input to Snorkel's expert gold-context work; public-by-design release model (held-out scored split + refresh) and contamination strategy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Investigated whether etcd (v3.3.10) genuinely consumes the grpc-go transport package moved to internal. etcd's own functional/tester/ stresser_key.go imports google.golang.org/grpc/transport and uses transport.ErrConnClosing, so it breaks under the move — etcd is a real consumer. Replace the placeholder go.mod anchor with the verified file and record it in consumer_affected_files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The agent user could not read /workspace/instruction.md or its .mcp.json inside the container (EACCES), so the agent never started — yet the run recorded success=true, num_turns=0, mcp_calls=0, task_score=0.0. These fake 0s corrupted the MCP-vs-baseline comparison (50 no-op runs / 24 tasks in validity audit uu8z). (bead EnterpriseBench-s58f) Three parts: 1. Container file permissions + fail-loud gate - New _chown_to_agent(): chowns only existing paths to agent:agent and LOGS failures instead of the old silent `2>/dev/null; true` mask. Drops /workspace/.mcp.json and /workspace/agent_output from the setup chown (created later by their own steps). - New _assert_agent_readable(): pre-agent gate that runs `test -r` as the agent user for instruction.md (+ the MCP configs in MCP modes). On failure the run is recorded status=INVALID, success=False, failure_class=infra_perms — never a fake 0.0. 2. Run-validity classification (classify_run_validity) - INVALID: num_turns==0, MCP-config/EACCES parse error, or an mcp_only run whose transport never handshaked (made 0 MCP calls). success=False. - FALLBACK: MCP-mode run with real turns but 0 MCP calls (legit fs fallback) — preserved as a distinct, real scoreable run (success=True). - VALID: baseline, or any MCP run with >=1 MCP call. - _scan_mcp_config_error() detects the audited stderr signatures. - _configure_mcp() now returns the handshake result. 3. results.json / task_metrics.json schema - Added 'status', 'account', and 'mcp_handshake_ok' fields so runs are attributable and self-classifying. Tests: new tests/test_run_validity_classification.py (26 cases) covers the decision matrix, the EACCES no-op -> INVALID reproduction, the stderr scanner, and the schema fields. Extended test_mcp_config.py to mock the new gate. Orchestration suite green (98 passed); no regression in the wider suite (347 pre-existing content failures unchanged).

Task containers (golang/java/etc. images) ship bash+grep+jq but not python3, so checks running `python3 -c` / `python3 -m eb_verify...` exited 127 and recorded false checkpoint failures — corrupting scores in every mode and re-hit by every backfill/remediation wave. Reimplement the embedded python in bash+jq+grep with identical output JSON. Interpreter substitution only: no checkpoint-threshold or ground-truth changes. - 175 tracked check scripts converted across all 7 suites - Includes the previously-validated jackson (check_integration_path) and spring (check_parallelism) fixes - `python3 -m eb_verify.plugins.file_extraction` invocations ported to jq, validated against the live plugin - Every conversion validated python-vs-bash on synthesized inputs against the real ground_truth.json (jq -S byte-compare across pass/partial/fail + missing-file branches) - Independently spot-checked on a 15-file cross-family sample (110+ scenarios): 0 divergences in graded fields (score/passed/detail) - Preserves python quirks exactly: banker's-rounding (round-half-to-even), json.dumps float repr (1.0/0.0), literal substring matching for regex-looking terms, str-iterates-as-chars, dict/list/str coercion - check_test_fails.sh left as-is: it runs `python3 -m pytest` in a python-language task whose image ships python3 by design (not a 127 case, and not bash-convertible) 93 further conversions exist in-tree for not-yet-committed (untracked) task dirs; they land with their parent dirs. Bead: EnterpriseBench-lrm

…iation log Wave-1 of the Stephanie-authorized 88-pin mirror remediation (bead EnterpriseBench-768k.2): first 12 tasks of the 51 with truly-missing mirrors, in deterministic truly-missing-table order. - All 12 tasks validated CREATE-as-is (0 repins): every pin resolves upstream and matches its task gold; all 22 were unregistered. - 22 unique sg-evals mirrors created (public + anon-cloneable); registered here in repo_versions.json (last_verified 2026-06-04). Canonical sort incidentally fixed pre-existing containerd/httpx ordering drift. - Per-task decisions + verification in results/analysis/mirror_remediation_log_2026_06.md. - Cutoff: last task #12 support-mapping-dual-openssl-curl-cipher-001; wave-2 resumes at #13 support-mapping-dual-django-wagtail-treepath-001.

…mediation log Second wave of the Stephanie-authorized 88-pin mirror remediation (parent EnterpriseBench-768k, bead 768k.3 / workflow EnterpriseBench-4epp). Processes the 13th–24th distinct tasks by first-appearance order in the truly-missing table, resuming at support-mapping-dual-django-wagtail-treepath-001. Outcome: 12/12 tasks → 10 CREATE-as-is, 2 REPIN. 25 unique sg-evals mirrors created (all public, anonymously cloneable, non-empty); 25 (url, pinned_rev) entries registered in configs/repo_versions.json (205→230, last_verified=2026-06-04). REPINs (defensibility record — original pins do not resolve to any upstream tag/commit; both corrected to the same release): - api-contract-dual-pydantic-fastapi-001: pydantic v2.0.0 -> v2.0 (pydantic 2.0.0 release is git-tagged v2.0; v2.0.0 -> HTTP 422) - config-drift-tri-javax-jakarta-spring-001: hibernate-orm 6.0.0.Final -> 6.0.0 (Hibernate 6.0.0.Final release is git-tagged 6.0.0; 6.0.0.Final -> HTTP 422) Both tasks' gold/checkpoints are version-generic, so neither repin alters ground truth. Verification: visibility=PUBLIC 25/25, isEmpty=false 25/25, anon info/refs=200 25/25. Not modified (out of scope / concurrent work): sg_indexing_list.json, mirror_creation_manifest.json (SG indexing is out-of-band). Wave-3 resume: support-mapping-dual-httpx-httpcore-001 (27 tasks remaining).

Third incremental child of EnterpriseBench-768k (Stephanie-authorized 88-pin mirror remediation). Wave-3 = tasks #25–#36 of the 51 truly-missing tasks, resuming at support-mapping-dual-httpx-httpcore-001 per wave-2's cutoff. Outcome: 12/12 tasks -> 12 CREATE-as-is, 0 REPIN. Every pin resolved upstream on its declared tag form (no .Final-style corrections needed this wave). 19 unique new mirrors created (public + anon-cloneable + non-empty); 0 needed the push-protection private-toggle workaround. Slash tags handled (logging-log4j2 rel/2.20.0->rel_2.20.0, rel/2.23.1->rel_2.23.1); underscore tag slf4j v_2.0.13. Shared/skip-existing: jaeger--v1.51.0 (#30,#31), spring-boot--v3.1.0 (already done wave-2). repo_versions.json: +19 (url, pinned_rev) entries, last_verified=2026-06-04, count 230->249, canonical case-sensitive sort preserved, 0 dup keys. Verification: visibility=PUBLIC 19/19, isEmpty=false 19/19, anon info/refs=200 19/19. Not modified (out of scope / concurrent work): sg_indexing_list.json, mirror_creation_manifest.json (SG indexing is out-of-band). Running tally: 36/51 tasks, 66/88 mirrors (22 wave-1 + 25 wave-2 + 19 wave-3). Wave-4 resume: incident-investigation-dual-nats-001 (15 tasks remaining).

…tasks) + BATCH-COMPLETE Closes the Stephanie-authorized 88-pin truly-missing mirror remediation (parent EnterpriseBench-768k, bead EnterpriseBench-768k.5 / convoy dc546). Wave-4 = tasks #37-#51 (resume incident-investigation-dual-nats-001): 15/15 tasks CREATE-as-is, 0 REPIN. 22 new sg-evals mirrors created (public + anon-cloneable + non-empty), registered in repo_versions.json (249->271). 3 used the push-protection private-toggle workaround (next.js v14.1.4, nomad v1.9.3, vault v1.18.1), all restored public. Final batch tally: 51/51 tasks, 88/88 mirrors (22+25+19+22 across waves 1-4). The 5 dropped false-positives never created. Branch-local; no push/PR.

sjarmak and others added 16 commits April 29, 2026 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EB: python3 verifier sweep (6 suites) + 88/88 sg-evals mirror batch + catch-up since 0rv.8#1

EB: python3 verifier sweep (6 suites) + 88/88 sg-evals mirror batch + catch-up since 0rv.8#1
sjarmak wants to merge 16 commits into
mainfrom
fix/eb-lrm-python3-verifier-sweep

sjarmak commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sjarmak commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant