EB: python3 verifier sweep (6 suites) + 88/88 sg-evals mirror batch + catch-up since 0rv.8#1
Open
sjarmak wants to merge 16 commits into
Open
EB: python3 verifier sweep (6 suites) + 88/88 sg-evals mirror batch + catch-up since 0rv.8#1sjarmak wants to merge 16 commits into
sjarmak wants to merge 16 commits into
Conversation
…task, 0rv.azt) New customer-escalation support_code_mapping task spanning encode/httpx 0.27.0 and encode/httpcore 1.0.5. Customer scenario: httpx.ReadTimeout fires despite timeout=None — answer requires tracing the timeout marshaling chain across the httpx public Timeout config, the httpcore.ConnectionPool / AsyncHTTP11Connection per-op enforcement, and the HTTPCORE_EXC_MAP exception remap. 3 distinct-dimension checks (mirrors spring-kafka template): - error_source: file discovery via eb_verify file_extraction plugin - error_chain: keyword-graded chain traversal across both repos - trigger_conditions: keyword-graded named-condition recall Validated: schema PASS, CRNT PASS (2 repos × required_files), expected_solution schema PASS. Smoke-tested with passing/failing/partial fixtures — checks score 1.0/1.0/1.0 vs 0.0/0.0/0.0 and the partial fixture (files-only) yields 1.0/0.0/0.0, proving the 3 dimensions are independent. Note: this task lands AC#1-4 of bead 0rv.azt. AC#5 (Go% <40% on multi-repo subset) is unreachable in 1 task — Go% drops only 46.2% -> 45.8% because 11 new dual-repo Go tasks landed after the 0rv.25 measurement (40.7%). Mayor mailed with shortlist of follow-ups.
Adds expected_solution.json for 11 tracked tasks under benchmarks/incident_response/
that previously lacked one. With this change the full suite (26 tasks)
validates with --check-paths against scripts/validation/validate_expected_solutions.py.
Coverage:
- 2 incident-investigation-002, -003 (standard checkpoints, scaffolded from ground_truth)
- 4 incident-investigation-dual-{flux,istio,kafka,prometheus}-001 (standard checkpoints
with hand-curated root_cause / error_chain / remediation from expected_answer)
- 1 incident-investigation-tri-containerd-001 (standard checkpoints)
- 1 incident-investigation-quad-containerd-001 (custom checkpoints — kubelet seccomp
conversion through containerd CRI, runc BPF filter, moby default profile)
- 1 ansible-galaxy-tar-regression-prove-001 (custom regression-test write-and-fail
checkpoints)
- 1 ccx-incident-032 (custom source_files/error_chain/config checkpoints for Envoy
connection-pool overflow)
- 1 event-replay-click-ci-001 (custom triage/communication/remediation checkpoints
for the markupsafe soft_str removal cascade)
Each checkpoint provides:
- expected_solution: prose synthesizing the canonical answer from
ground_truth.json (root_cause, error_chain, affected_services, remediation)
and the checkpoint's verifier semantics.
- evaluation_criteria: 3-5 concrete signals naming specific files/functions
that an agent's incident report must reach. All file-path-like fragments
resolve at the pinned SHA via the GitHub Contents API (--check-paths green).
Closes EnterpriseBench-auu.
Implements EnterpriseBench-1av (proxy of dr-2vydrm.3): the EB rig adapter
for the codeprobe-shared benchmark_qa_core library, plus the unified
cross-benchmark ScoreResult contract.
Changes:
* `lib/eb_verify/_vendor/benchmark_qa_core/` — file-vendored from codeprobe
SHA 047df83 (`feature/codeprobe-ceuu-benchmark-qa-core`). Imports
rewritten to relative form; sha256s recorded in VENDOR.md.
* `lib/eb_verify/qa_adapter.py` — bridges EB's task.toml + ground_truth.json
+ expected_solution.json into the lib's flat input shapes. Synthesises
scoring_method from `verification_modes`, buckets oracle files by their
declared repo, and reads instruction.md as the agent-visible aux file.
* `lib/eb_verify/schema_validator.py` — adds Layer 3 `_run_qa_layer` and
`validate_task(qa_strict=…, workspace_root=…)`. Default is warn-only;
strict mode promotes lib `error`-severity findings to validation errors.
* `lib/eb_verify/scoring.py` — adds `ScoreResult`, `ScoreDiagnostics`,
`VerificationResult.to_score_result()`, and `write_score_result()` for
the cross-benchmark JSON contract (reward / scorer_family=checklist /
sub_scores / diagnostics{task_time_seconds, token_cost_usd, ir_metrics,
artifact_results}).
* `lib/eb_verify/runner.py` — `run_all` now plumbs ScoreResult emission
alongside reward.txt and accepts harness-supplied `task_time_seconds` /
`token_cost_usd`.
* `lib/eb_verify/cli.py` — `validate` subcommand learns `--qa-strict`
and `--workspace`.
* `tests/test_qa_adapter.py` (17) and `tests/test_score_result.py` (17) —
new coverage; existing 80 tests in test_scoring/test_schema_validator/
test_runner/test_cli stay green.
Corpus run (180 active tasks, strict mode):
* 179 valid · 1 Layer-2 invalid (dep-traversal-007 stratum mismatch) ·
0 Layer-3 errors
* 358 warnings — 330 EB_A0 (expected; repos not cloned outside sandbox)
and 28 F2 leakage candidates (22 distinct tasks).
Follow-ups filed:
* EnterpriseBench-3sk — fix dep-traversal-007 stratum/repo-count.
* EnterpriseBench-pyh — triage 11 specific-path F2 leakage warnings.
* EnterpriseBench-7en — add basename filter for generic build manifests
(go.mod, package.json, pom.xml, …) so they don't trigger F2.
Report: docs/qa/eb_validator_run_2026_04_30.md
Test verification:
PYTHONPATH=lib python3 -m pytest \\
tests/test_scoring.py tests/test_schema_validator.py \\
tests/test_score_result.py tests/test_qa_adapter.py \\
tests/test_runner.py tests/test_cli.py
114 passed in 0.27s
Local-only branch per bead constraints (no push to public sjarmak/EnterpriseBench).
The original commit (32738b2) referenced follow-up beads 3sk/pyh/7en that lived in a different rig/store. Refile the same three follow-ups in the current rig and update the report to reference the real IDs: - EnterpriseBench-046 (dep-traversal-007 stratum fix) - EnterpriseBench-1is (F2 specific-path triage, 11 tasks) - EnterpriseBench-zco (basename filter for generic build manifests)
Layer-2 strict-mode validation flagged difficulty_stratum 'multi_repo' (expects 3-5 repos) against the task's actual 2 repos (grpc-node, kubernetes-client-js), the only Layer-2 strict failure in the corpus. The task wires exactly two consumer repos in an 'investigate' pattern; the prompt's Google Cloud mention is narrative and no third repo is configured. dual_repo (2 repos) is the semantically correct label and matches actual state. Acceptance: PYTHONPATH=lib python3 -m eb_verify.cli validate --qa-strict benchmarks/dependency_management/dep-traversal-007/task.toml -> VALID
F2 aux-file leakage fired false positives when instruction.md named a generic build-manifest basename (go.mod, package.json, pom.xml, go.sum, setup.py, ...) that matched an oracle file. These basenames are non-discriminative — every repo of a language has one — so naming them in the prompt does not leak which file the agent should investigate. check_aux_file_leakage now skips F2 for tokens that are BARE generic manifest basenames; tokens with directory components (e.g. envoy/go.mod) stay specific and still raise F2. Clears all 10 corpus false positives; remaining corpus F2s are genuine specific-path/specific-file leaks. This is an EB-only vendored patch (codeprobe upstream mirror tracked as dr-2vydrm.1); VENDOR.md records the intentional drift and the new leakage.py checksum. Tests: bare-manifest token suppressed + specific-path token still flagged, at both the adapter and lib level. Acceptance (EnterpriseBench-zco): - Generic-manifest tokens no longer raise F2 ✓ - Specific-path tokens still raise F2 ✓ - Tests added for both cases ✓ - VENDOR.md drift note updated ✓
Seven multi-repo tasks declared repos with zero required_files in task.toml [ground_truth], failing the Cross-Repo Necessity Test. Added the missing entries grounded in each task's ground_truth.json (curated producer/consumer files) or the dependency-manifest convention (go.mod / clientconn.go) for module-resolution and orchestration tasks. CRNT multi-repo: 122 -> 129/129 pass. Soundness gate: 180/180. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Remove references to the private codeprobe repo from vendored benchmark_qa_core docstrings, qa_adapter, scoring, and the QA run doc; reframe VENDOR.md as first-party code under the repo's Apache 2.0 LICENSE. - Untrack .beads/ (internal issue tracker + orchestration formulas/hooks); it syncs via Dolt, not git. Add .beads/ to .gitignore. - Fix stale 'unpublished' wording in AGENTS.md — CSB is public on Sourcegraph. No secret leakage (gitleaks clean on tree + history). Apache 2.0 covers the first-party vendored code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Internal draft: EB's deterministic Tier-1 + harness as the cost-reducing input to Snorkel's expert gold-context work; public-by-design release model (held-out scored split + refresh) and contamination strategy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Investigated whether etcd (v3.3.10) genuinely consumes the grpc-go transport package moved to internal. etcd's own functional/tester/ stresser_key.go imports google.golang.org/grpc/transport and uses transport.ErrConnClosing, so it breaks under the move — etcd is a real consumer. Replace the placeholder go.mod anchor with the verified file and record it in consumer_affected_files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The agent user could not read /workspace/instruction.md or its .mcp.json
inside the container (EACCES), so the agent never started — yet the run
recorded success=true, num_turns=0, mcp_calls=0, task_score=0.0. These
fake 0s corrupted the MCP-vs-baseline comparison (50 no-op runs / 24 tasks
in validity audit uu8z). (bead EnterpriseBench-s58f)
Three parts:
1. Container file permissions + fail-loud gate
- New _chown_to_agent(): chowns only existing paths to agent:agent and
LOGS failures instead of the old silent `2>/dev/null; true` mask. Drops
/workspace/.mcp.json and /workspace/agent_output from the setup chown
(created later by their own steps).
- New _assert_agent_readable(): pre-agent gate that runs `test -r` as the
agent user for instruction.md (+ the MCP configs in MCP modes). On
failure the run is recorded status=INVALID, success=False,
failure_class=infra_perms — never a fake 0.0.
2. Run-validity classification (classify_run_validity)
- INVALID: num_turns==0, MCP-config/EACCES parse error, or an mcp_only
run whose transport never handshaked (made 0 MCP calls). success=False.
- FALLBACK: MCP-mode run with real turns but 0 MCP calls (legit fs
fallback) — preserved as a distinct, real scoreable run (success=True).
- VALID: baseline, or any MCP run with >=1 MCP call.
- _scan_mcp_config_error() detects the audited stderr signatures.
- _configure_mcp() now returns the handshake result.
3. results.json / task_metrics.json schema
- Added 'status', 'account', and 'mcp_handshake_ok' fields so runs are
attributable and self-classifying.
Tests: new tests/test_run_validity_classification.py (26 cases) covers the
decision matrix, the EACCES no-op -> INVALID reproduction, the stderr
scanner, and the schema fields. Extended test_mcp_config.py to mock the new
gate. Orchestration suite green (98 passed); no regression in the wider
suite (347 pre-existing content failures unchanged).
Task containers (golang/java/etc. images) ship bash+grep+jq but not python3, so checks running `python3 -c` / `python3 -m eb_verify...` exited 127 and recorded false checkpoint failures — corrupting scores in every mode and re-hit by every backfill/remediation wave. Reimplement the embedded python in bash+jq+grep with identical output JSON. Interpreter substitution only: no checkpoint-threshold or ground-truth changes. - 175 tracked check scripts converted across all 7 suites - Includes the previously-validated jackson (check_integration_path) and spring (check_parallelism) fixes - `python3 -m eb_verify.plugins.file_extraction` invocations ported to jq, validated against the live plugin - Every conversion validated python-vs-bash on synthesized inputs against the real ground_truth.json (jq -S byte-compare across pass/partial/fail + missing-file branches) - Independently spot-checked on a 15-file cross-family sample (110+ scenarios): 0 divergences in graded fields (score/passed/detail) - Preserves python quirks exactly: banker's-rounding (round-half-to-even), json.dumps float repr (1.0/0.0), literal substring matching for regex-looking terms, str-iterates-as-chars, dict/list/str coercion - check_test_fails.sh left as-is: it runs `python3 -m pytest` in a python-language task whose image ships python3 by design (not a 127 case, and not bash-convertible) 93 further conversions exist in-tree for not-yet-committed (untracked) task dirs; they land with their parent dirs. Bead: EnterpriseBench-lrm
…iation log Wave-1 of the Stephanie-authorized 88-pin mirror remediation (bead EnterpriseBench-768k.2): first 12 tasks of the 51 with truly-missing mirrors, in deterministic truly-missing-table order. - All 12 tasks validated CREATE-as-is (0 repins): every pin resolves upstream and matches its task gold; all 22 were unregistered. - 22 unique sg-evals mirrors created (public + anon-cloneable); registered here in repo_versions.json (last_verified 2026-06-04). Canonical sort incidentally fixed pre-existing containerd/httpx ordering drift. - Per-task decisions + verification in results/analysis/mirror_remediation_log_2026_06.md. - Cutoff: last task #12 support-mapping-dual-openssl-curl-cipher-001; wave-2 resumes at #13 support-mapping-dual-django-wagtail-treepath-001.
…mediation log Second wave of the Stephanie-authorized 88-pin mirror remediation (parent EnterpriseBench-768k, bead 768k.3 / workflow EnterpriseBench-4epp). Processes the 13th–24th distinct tasks by first-appearance order in the truly-missing table, resuming at support-mapping-dual-django-wagtail-treepath-001. Outcome: 12/12 tasks → 10 CREATE-as-is, 2 REPIN. 25 unique sg-evals mirrors created (all public, anonymously cloneable, non-empty); 25 (url, pinned_rev) entries registered in configs/repo_versions.json (205→230, last_verified=2026-06-04). REPINs (defensibility record — original pins do not resolve to any upstream tag/commit; both corrected to the same release): - api-contract-dual-pydantic-fastapi-001: pydantic v2.0.0 -> v2.0 (pydantic 2.0.0 release is git-tagged v2.0; v2.0.0 -> HTTP 422) - config-drift-tri-javax-jakarta-spring-001: hibernate-orm 6.0.0.Final -> 6.0.0 (Hibernate 6.0.0.Final release is git-tagged 6.0.0; 6.0.0.Final -> HTTP 422) Both tasks' gold/checkpoints are version-generic, so neither repin alters ground truth. Verification: visibility=PUBLIC 25/25, isEmpty=false 25/25, anon info/refs=200 25/25. Not modified (out of scope / concurrent work): sg_indexing_list.json, mirror_creation_manifest.json (SG indexing is out-of-band). Wave-3 resume: support-mapping-dual-httpx-httpcore-001 (27 tasks remaining).
Third incremental child of EnterpriseBench-768k (Stephanie-authorized 88-pin mirror remediation). Wave-3 = tasks #25–#36 of the 51 truly-missing tasks, resuming at support-mapping-dual-httpx-httpcore-001 per wave-2's cutoff. Outcome: 12/12 tasks -> 12 CREATE-as-is, 0 REPIN. Every pin resolved upstream on its declared tag form (no .Final-style corrections needed this wave). 19 unique new mirrors created (public + anon-cloneable + non-empty); 0 needed the push-protection private-toggle workaround. Slash tags handled (logging-log4j2 rel/2.20.0->rel_2.20.0, rel/2.23.1->rel_2.23.1); underscore tag slf4j v_2.0.13. Shared/skip-existing: jaeger--v1.51.0 (#30,#31), spring-boot--v3.1.0 (already done wave-2). repo_versions.json: +19 (url, pinned_rev) entries, last_verified=2026-06-04, count 230->249, canonical case-sensitive sort preserved, 0 dup keys. Verification: visibility=PUBLIC 19/19, isEmpty=false 19/19, anon info/refs=200 19/19. Not modified (out of scope / concurrent work): sg_indexing_list.json, mirror_creation_manifest.json (SG indexing is out-of-band). Running tally: 36/51 tasks, 66/88 mirrors (22 wave-1 + 25 wave-2 + 19 wave-3). Wave-4 resume: incident-investigation-dual-nats-001 (15 tasks remaining).
…tasks) + BATCH-COMPLETE Closes the Stephanie-authorized 88-pin truly-missing mirror remediation (parent EnterpriseBench-768k, bead EnterpriseBench-768k.5 / convoy dc546). Wave-4 = tasks #37-#51 (resume incident-investigation-dual-nats-001): 15/15 tasks CREATE-as-is, 0 REPIN. 22 new sg-evals mirrors created (public + anon-cloneable + non-empty), registered in repo_versions.json (249->271). 3 used the push-protection private-toggle workaround (next.js v14.1.4, nomad v1.9.3, vault v1.18.1), all restored public. Final batch tally: 51/51 tasks, 88/88 mirrors (22+25+19+22 across waves 1-4). The 5 dropped false-positives never created. Branch-local; no push/PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Brings
maincurrent from the 2026-04-29 snapshot (8a78319, phase-8 paper draft). 16 commits, ~149 files.Headline — python3 verifier sweep (
9d5abab)Converts python3-based checkpoint verifiers to bash+jq+grep across 6 of 7 suites. Task containers ship no python3, so 267 python3-invoking checks were false-failing exit-127 in every mode (masking 37 tasks at 0.0). Validated: 0 graded divergences on 110+ scenarios.
88/88 sg-evals mirror batch (
a3a41fc..18a5a58)Wave 1-4 registration of the Stephanie-authorized 88-pin mirror set (BATCH-COMPLETE), with remediation logs.
Accumulated EB work (catch-up)
32738b2vendor benchmark_qa_core + unified ScoreResult03e866dfail-loud on container EACCES no-op runs9b37679cross-repo gold context for 7 CRNT-failing tasks28e4564curate expected_solution.json (incident_response)2809c80add support-mapping-dual-httpx-httpcore task76f5469F2 leakage-check: skip generic build-manifest basenames916c8abdep-traversal-007 difficulty fixac55823etcd consumer fix (api-contract-003)a66e8abscrub internal refs + untrack .beads for public release534fbb1Snorkel collaboration / division-of-labor proposal (docs)b2d2e48align validator report bead IDsApproved for publish by repo owner (full-scope catch-up).