Skip to content

fix(evals): unstale 3 release-gate eval hard-fails behind legit refactors (soc-2gd6 #eval-hard-fails)#402

Merged
boshu2 merged 1 commit into
mainfrom
fix/eval-lane-hard-fails-soc-2gd6
May 22, 2026
Merged

fix(evals): unstale 3 release-gate eval hard-fails behind legit refactors (soc-2gd6 #eval-hard-fails)#402
boshu2 merged 1 commit into
mainfrom
fix/eval-lane-hard-fails-soc-2gd6

Conversation

@boshu2
Copy link
Copy Markdown
Owner

@boshu2 boshu2 commented May 22, 2026

Why

The v2.42.0 release gate (scripts/ci-local-release.sh) was red on 8 evals. The 3 score-0/near-0 hard fails are all eval-staleness behind legitimate recent refactors — verified, not gaming or security weakening. Operator decision: update eval to match source of truth (executable > contract).

Eval Was Cause Fix
hook-manifest-command-counts 0 session-pr-counter.sh (PR #362) is the legit 37th hook script; eval hardcoded 43/36 bump expected counts 43→44, 36→37
push-worktree landing-plane 0.14 #387 tiered-AGENTS split moved "Landing the Plane" to AGENTS-WORKFLOW.md (+ dropped 2 lines) redirect eval target AGENTS.mdAGENTS-WORKFLOW.md + restore the 2 dropped policy lines
security-toolchain ci-soft-gate-policy 0 gate is intentionally HARD (no continue-on-error); job already runs security-gate.sh --mode quick + uploads artifacts drop the stale continue-on-error requirement (security stays HARD)

Security note: security-toolchain-gate stays a HARD blocking gate. Only the stale "soft gate" assertion was removed from the eval; the actual scan + artifact upload + summary-blocking are unchanged.

How tested

  • hook-manifest jq → hook-manifest-counts-ok
  • security smoke ci-policysecurity-toolchain-ci-policy-ok
  • all 7 landing-plane strings present in AGENTS-WORKFLOW.md
  • shellcheck clean on edited smoke

Scope honesty

This fixes the 3 hard fails only. The release gate still has 5 minor evals (0.71–0.99) + the vil/release-smoke lane — a separate remediation, deliberately NOT in this PR (no green-washing).

Sibling pattern: same "update eval to match legitimately-changed source of truth" move as the cli-command-surface canary bumps in #396/#397.

Fitness: release-gate eval hard-fails 3 → 0.

Closes-scenario: soc-2gd6#eval-hard-fails
Bounded-context: BC4-Validation
Evidence: evals/agentops-core/fixtures/security-toolchain-governance-smoke.sh

…tors (soc-2gd6)

The v2.42.0 release gate was red on 3 score-0/near-0 evals. All three are eval-staleness behind legitimate recent changes — verified, NOT gaming or security weakening (operator chose "update eval to match source of truth"):

| Eval | Was | Cause | Fix |
|---|---|---|---|
| hook-manifest-command-counts | 0 | session-pr-counter.sh (PR #362) is the legit 37th hook script; eval hardcoded 43/36 | bump expected counts 43→44, 36→37 |
| push-worktree landing-plane | 0.14 | #387 tiered-AGENTS split moved the "Landing the Plane" section to AGENTS-WORKFLOW.md (and dropped 2 lines) | redirect eval target AGENTS.md→AGENTS-WORKFLOW.md + restore the 2 dropped policy lines |
| security-toolchain ci-soft-gate-policy | 0 | the gate is intentionally HARD (no continue-on-error); the job already runs security-gate.sh --mode quick + uploads artifacts | drop the stale continue-on-error requirement from the eval (security stays HARD) |

Security note: the security-toolchain-gate stays a HARD blocking gate. The only eval bit removed was the stale "soft gate" assertion; the actual scan (security-gate.sh --mode quick) + artifact upload + summary-blocking are unchanged.

How tested:
- hook-manifest jq check → hook-manifest-counts-ok
- security smoke ci-policy → security-toolchain-ci-policy-ok
- all 7 landing-plane strings present in AGENTS-WORKFLOW.md
- shellcheck clean on the edited smoke

Sibling pattern: same "update eval to match legitimately-changed source of truth" move as the cli-command-surface canary bumps in #396/#397.

Fitness: release-gate eval hard-fails 3 → 0. (5 minor evals 0.71-0.99 + the vil lane remain — separate remediation, NOT in this PR.)

Closes-scenario: soc-2gd6#eval-hard-fails
Bounded-context: BC4-Validation
Evidence: evals/agentops-core/fixtures/security-toolchain-governance-smoke.sh
@github-actions github-actions Bot added the docs label May 22, 2026
@boshu2 boshu2 merged commit ce9ec94 into main May 22, 2026
71 checks passed
@boshu2 boshu2 deleted the fix/eval-lane-hard-fails-soc-2gd6 branch May 22, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant