Extend the AI-coding-verifier corpus: email tool, CI-gate removal, suppression by pengfei-threemoonslab · Pull Request #155 · ThreeMoonsLab/agents-shipgate

pengfei-threemoonslab · 2026-05-31T02:13:36Z

What

Extends the benchmark/ai-coding-verifier corpus (the deterministic base/head merge-verdict scenarios in tests/test_verifier_scenarios.py) with three canonical capability transitions the product claims to handle but the corpus didn't yet cover.

The corpus deliberately asserts semantic verifier.json fields against the real engine rather than committing golden trees (per its README: "rather than committing fragile golden trees") — so this follows that pattern, not a parallel golden-file one.

New scenarios

Scenario	Diff	Asserts
`agent_adds_email_tool`	adds an external-comms `messaging.send_customer_email` tool	the email action is detected as `action_added`; `can_merge_without_human: false`
`agent_removes_ci_gate`	deletes `.github/workflows/agents-shipgate.yml`	`trust_root_touched`/`policy_weakened`; not auto-mergeable — the flagship anti-bypass case
`agent_adds_suppression`	adds a `checks.ignore` to `shipgate.yaml`	`trust_root_touched`; the agent can't silently suppress and self-merge

All three assertions reflect the real engine output (confirmed by running). One honest note, recorded in the suppression test: adding a checks.ignore for a check with no active blocker surfaces as trust_root_touched, not policy_weakened — defensible (the effective gate isn't weakened), and still routed to a human. If you'd want a suppression of an active blocker to register as policy_weakened, that's a small follow-up worth considering.

Verification

tests/test_verifier_scenarios.py — 7 scenarios pass; ruff clean. Test-only + README; no engine change.

🤖 Generated with Claude Code

…ppression The benchmark/ai-coding-verifier corpus deliberately asserts base/head scenarios against the real engine (no fragile golden trees). It covered refund + policy-edit + two docs-only cases; add three canonical capability transitions: - agent_adds_email_tool: an external-communication action is a gated capability change (action_added detected; not auto-mergeable). - agent_removes_ci_gate: deleting the Shipgate CI workflow touches a trust root / weakens policy and routes to human review — the gate cannot be removed to self-merge (the flagship anti-bypass case). - agent_adds_suppression: adding a checks.ignore touches a trust root; the agent cannot silently suppress and self-merge. (Surfaces as trust_root_touched, not policy_weakened, because the suppressed check has no active blocker here.) All assertions reflect real engine output (confirmed by running). README table updated. Test-only + docs; no engine change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Addresses review of #155: the three new scenarios passed on generic signals (trust_root_touched, "email" in subject), so a regression in the specific check each scenario is named for would not be caught. Tightened to the actual check each transition fires (confirmed by probing the real engine): - agent_adds_email_tool: merge_verdict == blocked + blocker SHIP-ACTION-EXTERNAL-COMMUNICATION-AUDIT-MISSING. - agent_removes_ci_gate (renamed _blocks): merge_verdict == blocked + blocker SHIP-VERIFY-CI-GATE-REMOVED. - agent_adds_suppression: merge_verdict == human_review_required + review_item SHIP-VERIFY-BASELINE-OR-WAIVER-EXPANDED + policy_broadened change naming suppression:SHIP-POLICY-APPROVAL-MISSING. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pengfei-threemoonslab and others added 2 commits May 30, 2026 19:13

pengfei-threemoonslab merged commit 1c2b896 into main May 31, 2026
1 check passed

pengfei-threemoonslab deleted the feat/verifier-corpus-scenarios branch May 31, 2026 05:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the AI-coding-verifier corpus: email tool, CI-gate removal, suppression#155

Extend the AI-coding-verifier corpus: email tool, CI-gate removal, suppression#155
pengfei-threemoonslab merged 2 commits into
mainfrom
feat/verifier-corpus-scenarios

pengfei-threemoonslab commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pengfei-threemoonslab commented May 31, 2026

What

New scenarios

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant