Skip to content

EB: python3 verifier sweep (6 suites) + 88/88 sg-evals mirror batch + catch-up since 0rv.8#1

Open
sjarmak wants to merge 16 commits into
mainfrom
fix/eb-lrm-python3-verifier-sweep
Open

EB: python3 verifier sweep (6 suites) + 88/88 sg-evals mirror batch + catch-up since 0rv.8#1
sjarmak wants to merge 16 commits into
mainfrom
fix/eb-lrm-python3-verifier-sweep

Conversation

@sjarmak

@sjarmak sjarmak commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Brings main current from the 2026-04-29 snapshot (8a78319, phase-8 paper draft). 16 commits, ~149 files.

Headline — python3 verifier sweep (9d5abab)
Converts python3-based checkpoint verifiers to bash+jq+grep across 6 of 7 suites. Task containers ship no python3, so 267 python3-invoking checks were false-failing exit-127 in every mode (masking 37 tasks at 0.0). Validated: 0 graded divergences on 110+ scenarios.

88/88 sg-evals mirror batch (a3a41fc..18a5a58)
Wave 1-4 registration of the Stephanie-authorized 88-pin mirror set (BATCH-COMPLETE), with remediation logs.

Accumulated EB work (catch-up)

  • 32738b2 vendor benchmark_qa_core + unified ScoreResult
  • 03e866d fail-loud on container EACCES no-op runs
  • 9b37679 cross-repo gold context for 7 CRNT-failing tasks
  • 28e4564 curate expected_solution.json (incident_response)
  • 2809c80 add support-mapping-dual-httpx-httpcore task
  • 76f5469 F2 leakage-check: skip generic build-manifest basenames
  • 916c8ab dep-traversal-007 difficulty fix
  • ac55823 etcd consumer fix (api-contract-003)
  • a66e8ab scrub internal refs + untrack .beads for public release
  • 534fbb1 Snorkel collaboration / division-of-labor proposal (docs)
  • b2d2e48 align validator report bead IDs

Approved for publish by repo owner (full-scope catch-up).

sjarmak and others added 16 commits April 29, 2026 10:55
…task, 0rv.azt)

New customer-escalation support_code_mapping task spanning encode/httpx 0.27.0
and encode/httpcore 1.0.5. Customer scenario: httpx.ReadTimeout fires despite
timeout=None — answer requires tracing the timeout marshaling chain across the
httpx public Timeout config, the httpcore.ConnectionPool / AsyncHTTP11Connection
per-op enforcement, and the HTTPCORE_EXC_MAP exception remap.

3 distinct-dimension checks (mirrors spring-kafka template):
- error_source: file discovery via eb_verify file_extraction plugin
- error_chain: keyword-graded chain traversal across both repos
- trigger_conditions: keyword-graded named-condition recall

Validated: schema PASS, CRNT PASS (2 repos × required_files), expected_solution
schema PASS. Smoke-tested with passing/failing/partial fixtures — checks score
1.0/1.0/1.0 vs 0.0/0.0/0.0 and the partial fixture (files-only) yields
1.0/0.0/0.0, proving the 3 dimensions are independent.

Note: this task lands AC#1-4 of bead 0rv.azt. AC#5 (Go% <40% on multi-repo
subset) is unreachable in 1 task — Go% drops only 46.2% -> 45.8% because 11
new dual-repo Go tasks landed after the 0rv.25 measurement (40.7%). Mayor
mailed with shortlist of follow-ups.
Adds expected_solution.json for 11 tracked tasks under benchmarks/incident_response/
that previously lacked one. With this change the full suite (26 tasks)
validates with --check-paths against scripts/validation/validate_expected_solutions.py.

Coverage:
- 2 incident-investigation-002, -003 (standard checkpoints, scaffolded from ground_truth)
- 4 incident-investigation-dual-{flux,istio,kafka,prometheus}-001 (standard checkpoints
  with hand-curated root_cause / error_chain / remediation from expected_answer)
- 1 incident-investigation-tri-containerd-001 (standard checkpoints)
- 1 incident-investigation-quad-containerd-001 (custom checkpoints — kubelet seccomp
  conversion through containerd CRI, runc BPF filter, moby default profile)
- 1 ansible-galaxy-tar-regression-prove-001 (custom regression-test write-and-fail
  checkpoints)
- 1 ccx-incident-032 (custom source_files/error_chain/config checkpoints for Envoy
  connection-pool overflow)
- 1 event-replay-click-ci-001 (custom triage/communication/remediation checkpoints
  for the markupsafe soft_str removal cascade)

Each checkpoint provides:
- expected_solution: prose synthesizing the canonical answer from
  ground_truth.json (root_cause, error_chain, affected_services, remediation)
  and the checkpoint's verifier semantics.
- evaluation_criteria: 3-5 concrete signals naming specific files/functions
  that an agent's incident report must reach. All file-path-like fragments
  resolve at the pinned SHA via the GitHub Contents API (--check-paths green).

Closes EnterpriseBench-auu.
Implements EnterpriseBench-1av (proxy of dr-2vydrm.3): the EB rig adapter
for the codeprobe-shared benchmark_qa_core library, plus the unified
cross-benchmark ScoreResult contract.

Changes:

* `lib/eb_verify/_vendor/benchmark_qa_core/` — file-vendored from codeprobe
  SHA 047df83 (`feature/codeprobe-ceuu-benchmark-qa-core`). Imports
  rewritten to relative form; sha256s recorded in VENDOR.md.
* `lib/eb_verify/qa_adapter.py` — bridges EB's task.toml + ground_truth.json
  + expected_solution.json into the lib's flat input shapes. Synthesises
  scoring_method from `verification_modes`, buckets oracle files by their
  declared repo, and reads instruction.md as the agent-visible aux file.
* `lib/eb_verify/schema_validator.py` — adds Layer 3 `_run_qa_layer` and
  `validate_task(qa_strict=…, workspace_root=…)`. Default is warn-only;
  strict mode promotes lib `error`-severity findings to validation errors.
* `lib/eb_verify/scoring.py` — adds `ScoreResult`, `ScoreDiagnostics`,
  `VerificationResult.to_score_result()`, and `write_score_result()` for
  the cross-benchmark JSON contract (reward / scorer_family=checklist /
  sub_scores / diagnostics{task_time_seconds, token_cost_usd, ir_metrics,
  artifact_results}).
* `lib/eb_verify/runner.py` — `run_all` now plumbs ScoreResult emission
  alongside reward.txt and accepts harness-supplied `task_time_seconds` /
  `token_cost_usd`.
* `lib/eb_verify/cli.py` — `validate` subcommand learns `--qa-strict`
  and `--workspace`.
* `tests/test_qa_adapter.py` (17) and `tests/test_score_result.py` (17) —
  new coverage; existing 80 tests in test_scoring/test_schema_validator/
  test_runner/test_cli stay green.

Corpus run (180 active tasks, strict mode):

* 179 valid · 1 Layer-2 invalid (dep-traversal-007 stratum mismatch) ·
  0 Layer-3 errors
* 358 warnings — 330 EB_A0 (expected; repos not cloned outside sandbox)
  and 28 F2 leakage candidates (22 distinct tasks).

Follow-ups filed:

* EnterpriseBench-3sk — fix dep-traversal-007 stratum/repo-count.
* EnterpriseBench-pyh — triage 11 specific-path F2 leakage warnings.
* EnterpriseBench-7en — add basename filter for generic build manifests
  (go.mod, package.json, pom.xml, …) so they don't trigger F2.

Report: docs/qa/eb_validator_run_2026_04_30.md

Test verification:

  PYTHONPATH=lib python3 -m pytest \\
    tests/test_scoring.py tests/test_schema_validator.py \\
    tests/test_score_result.py tests/test_qa_adapter.py \\
    tests/test_runner.py tests/test_cli.py
  114 passed in 0.27s

Local-only branch per bead constraints (no push to public sjarmak/EnterpriseBench).
The original commit (32738b2) referenced follow-up beads 3sk/pyh/7en
that lived in a different rig/store. Refile the same three follow-ups
in the current rig and update the report to reference the real IDs:

  - EnterpriseBench-046 (dep-traversal-007 stratum fix)
  - EnterpriseBench-1is (F2 specific-path triage, 11 tasks)
  - EnterpriseBench-zco (basename filter for generic build manifests)
Layer-2 strict-mode validation flagged difficulty_stratum 'multi_repo'
(expects 3-5 repos) against the task's actual 2 repos (grpc-node,
kubernetes-client-js), the only Layer-2 strict failure in the corpus.

The task wires exactly two consumer repos in an 'investigate' pattern;
the prompt's Google Cloud mention is narrative and no third repo is
configured. dual_repo (2 repos) is the semantically correct label and
matches actual state.

Acceptance: PYTHONPATH=lib python3 -m eb_verify.cli validate --qa-strict
benchmarks/dependency_management/dep-traversal-007/task.toml -> VALID
F2 aux-file leakage fired false positives when instruction.md named a
generic build-manifest basename (go.mod, package.json, pom.xml, go.sum,
setup.py, ...) that matched an oracle file. These basenames are
non-discriminative — every repo of a language has one — so naming them in
the prompt does not leak which file the agent should investigate.

check_aux_file_leakage now skips F2 for tokens that are BARE generic
manifest basenames; tokens with directory components (e.g. envoy/go.mod)
stay specific and still raise F2. Clears all 10 corpus false positives;
remaining corpus F2s are genuine specific-path/specific-file leaks.

This is an EB-only vendored patch (codeprobe upstream mirror tracked as
dr-2vydrm.1); VENDOR.md records the intentional drift and the new
leakage.py checksum.

Tests: bare-manifest token suppressed + specific-path token still flagged,
at both the adapter and lib level.

Acceptance (EnterpriseBench-zco):
- Generic-manifest tokens no longer raise F2 ✓
- Specific-path tokens still raise F2 ✓
- Tests added for both cases ✓
- VENDOR.md drift note updated ✓
Seven multi-repo tasks declared repos with zero required_files in
task.toml [ground_truth], failing the Cross-Repo Necessity Test. Added
the missing entries grounded in each task's ground_truth.json (curated
producer/consumer files) or the dependency-manifest convention
(go.mod / clientconn.go) for module-resolution and orchestration tasks.

CRNT multi-repo: 122 -> 129/129 pass. Soundness gate: 180/180.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Remove references to the private codeprobe repo from vendored
  benchmark_qa_core docstrings, qa_adapter, scoring, and the QA run doc;
  reframe VENDOR.md as first-party code under the repo's Apache 2.0 LICENSE.
- Untrack .beads/ (internal issue tracker + orchestration formulas/hooks);
  it syncs via Dolt, not git. Add .beads/ to .gitignore.
- Fix stale 'unpublished' wording in AGENTS.md — CSB is public on Sourcegraph.

No secret leakage (gitleaks clean on tree + history). Apache 2.0 covers
the first-party vendored code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Internal draft: EB's deterministic Tier-1 + harness as the cost-reducing
input to Snorkel's expert gold-context work; public-by-design release model
(held-out scored split + refresh) and contamination strategy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Investigated whether etcd (v3.3.10) genuinely consumes the grpc-go
transport package moved to internal. etcd's own functional/tester/
stresser_key.go imports google.golang.org/grpc/transport and uses
transport.ErrConnClosing, so it breaks under the move — etcd is a real
consumer. Replace the placeholder go.mod anchor with the verified file
and record it in consumer_affected_files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The agent user could not read /workspace/instruction.md or its .mcp.json
inside the container (EACCES), so the agent never started — yet the run
recorded success=true, num_turns=0, mcp_calls=0, task_score=0.0. These
fake 0s corrupted the MCP-vs-baseline comparison (50 no-op runs / 24 tasks
in validity audit uu8z). (bead EnterpriseBench-s58f)

Three parts:

1. Container file permissions + fail-loud gate
   - New _chown_to_agent(): chowns only existing paths to agent:agent and
     LOGS failures instead of the old silent `2>/dev/null; true` mask. Drops
     /workspace/.mcp.json and /workspace/agent_output from the setup chown
     (created later by their own steps).
   - New _assert_agent_readable(): pre-agent gate that runs `test -r` as the
     agent user for instruction.md (+ the MCP configs in MCP modes). On
     failure the run is recorded status=INVALID, success=False,
     failure_class=infra_perms — never a fake 0.0.

2. Run-validity classification (classify_run_validity)
   - INVALID: num_turns==0, MCP-config/EACCES parse error, or an mcp_only
     run whose transport never handshaked (made 0 MCP calls). success=False.
   - FALLBACK: MCP-mode run with real turns but 0 MCP calls (legit fs
     fallback) — preserved as a distinct, real scoreable run (success=True).
   - VALID: baseline, or any MCP run with >=1 MCP call.
   - _scan_mcp_config_error() detects the audited stderr signatures.
   - _configure_mcp() now returns the handshake result.

3. results.json / task_metrics.json schema
   - Added 'status', 'account', and 'mcp_handshake_ok' fields so runs are
     attributable and self-classifying.

Tests: new tests/test_run_validity_classification.py (26 cases) covers the
decision matrix, the EACCES no-op -> INVALID reproduction, the stderr
scanner, and the schema fields. Extended test_mcp_config.py to mock the new
gate. Orchestration suite green (98 passed); no regression in the wider
suite (347 pre-existing content failures unchanged).
Task containers (golang/java/etc. images) ship bash+grep+jq but not
python3, so checks running `python3 -c` / `python3 -m eb_verify...`
exited 127 and recorded false checkpoint failures — corrupting scores
in every mode and re-hit by every backfill/remediation wave.

Reimplement the embedded python in bash+jq+grep with identical output
JSON. Interpreter substitution only: no checkpoint-threshold or
ground-truth changes.

- 175 tracked check scripts converted across all 7 suites
- Includes the previously-validated jackson (check_integration_path) and
  spring (check_parallelism) fixes
- `python3 -m eb_verify.plugins.file_extraction` invocations ported to
  jq, validated against the live plugin
- Every conversion validated python-vs-bash on synthesized inputs against
  the real ground_truth.json (jq -S byte-compare across pass/partial/fail
  + missing-file branches)
- Independently spot-checked on a 15-file cross-family sample (110+
  scenarios): 0 divergences in graded fields (score/passed/detail)
- Preserves python quirks exactly: banker's-rounding (round-half-to-even),
  json.dumps float repr (1.0/0.0), literal substring matching for
  regex-looking terms, str-iterates-as-chars, dict/list/str coercion
- check_test_fails.sh left as-is: it runs `python3 -m pytest` in a
  python-language task whose image ships python3 by design (not a 127
  case, and not bash-convertible)

93 further conversions exist in-tree for not-yet-committed (untracked)
task dirs; they land with their parent dirs.

Bead: EnterpriseBench-lrm
…iation log

Wave-1 of the Stephanie-authorized 88-pin mirror remediation
(bead EnterpriseBench-768k.2): first 12 tasks of the 51 with truly-missing
mirrors, in deterministic truly-missing-table order.

- All 12 tasks validated CREATE-as-is (0 repins): every pin resolves
  upstream and matches its task gold; all 22 were unregistered.
- 22 unique sg-evals mirrors created (public + anon-cloneable); registered
  here in repo_versions.json (last_verified 2026-06-04). Canonical sort
  incidentally fixed pre-existing containerd/httpx ordering drift.
- Per-task decisions + verification in
  results/analysis/mirror_remediation_log_2026_06.md.
- Cutoff: last task #12 support-mapping-dual-openssl-curl-cipher-001;
  wave-2 resumes at #13 support-mapping-dual-django-wagtail-treepath-001.
…mediation log

Second wave of the Stephanie-authorized 88-pin mirror remediation
(parent EnterpriseBench-768k, bead 768k.3 / workflow EnterpriseBench-4epp).
Processes the 13th–24th distinct tasks by first-appearance order in the
truly-missing table, resuming at support-mapping-dual-django-wagtail-treepath-001.

Outcome: 12/12 tasks → 10 CREATE-as-is, 2 REPIN. 25 unique sg-evals mirrors
created (all public, anonymously cloneable, non-empty); 25 (url, pinned_rev)
entries registered in configs/repo_versions.json (205→230, last_verified=2026-06-04).

REPINs (defensibility record — original pins do not resolve to any upstream
tag/commit; both corrected to the same release):
- api-contract-dual-pydantic-fastapi-001: pydantic v2.0.0 -> v2.0
  (pydantic 2.0.0 release is git-tagged v2.0; v2.0.0 -> HTTP 422)
- config-drift-tri-javax-jakarta-spring-001: hibernate-orm 6.0.0.Final -> 6.0.0
  (Hibernate 6.0.0.Final release is git-tagged 6.0.0; 6.0.0.Final -> HTTP 422)
Both tasks' gold/checkpoints are version-generic, so neither repin alters ground truth.

Verification: visibility=PUBLIC 25/25, isEmpty=false 25/25, anon info/refs=200 25/25.
Not modified (out of scope / concurrent work): sg_indexing_list.json,
mirror_creation_manifest.json (SG indexing is out-of-band).

Wave-3 resume: support-mapping-dual-httpx-httpcore-001 (27 tasks remaining).
Third incremental child of EnterpriseBench-768k (Stephanie-authorized 88-pin
mirror remediation). Wave-3 = tasks #25–#36 of the 51 truly-missing tasks,
resuming at support-mapping-dual-httpx-httpcore-001 per wave-2's cutoff.

Outcome: 12/12 tasks -> 12 CREATE-as-is, 0 REPIN. Every pin resolved upstream
on its declared tag form (no .Final-style corrections needed this wave).
19 unique new mirrors created (public + anon-cloneable + non-empty); 0 needed
the push-protection private-toggle workaround. Slash tags handled
(logging-log4j2 rel/2.20.0->rel_2.20.0, rel/2.23.1->rel_2.23.1); underscore tag
slf4j v_2.0.13. Shared/skip-existing: jaeger--v1.51.0 (#30,#31),
spring-boot--v3.1.0 (already done wave-2).

repo_versions.json: +19 (url, pinned_rev) entries, last_verified=2026-06-04,
count 230->249, canonical case-sensitive sort preserved, 0 dup keys.

Verification: visibility=PUBLIC 19/19, isEmpty=false 19/19,
anon info/refs=200 19/19.
Not modified (out of scope / concurrent work): sg_indexing_list.json,
mirror_creation_manifest.json (SG indexing is out-of-band).

Running tally: 36/51 tasks, 66/88 mirrors (22 wave-1 + 25 wave-2 + 19 wave-3).
Wave-4 resume: incident-investigation-dual-nats-001 (15 tasks remaining).
…tasks) + BATCH-COMPLETE

Closes the Stephanie-authorized 88-pin truly-missing mirror remediation
(parent EnterpriseBench-768k, bead EnterpriseBench-768k.5 / convoy dc546).

Wave-4 = tasks #37-#51 (resume incident-investigation-dual-nats-001):
15/15 tasks CREATE-as-is, 0 REPIN. 22 new sg-evals mirrors created
(public + anon-cloneable + non-empty), registered in repo_versions.json
(249->271). 3 used the push-protection private-toggle workaround
(next.js v14.1.4, nomad v1.9.3, vault v1.18.1), all restored public.

Final batch tally: 51/51 tasks, 88/88 mirrors (22+25+19+22 across waves 1-4).
The 5 dropped false-positives never created. Branch-local; no push/PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant