Skip to content

Extend the AI-coding-verifier corpus: email tool, CI-gate removal, suppression#155

Merged
pengfei-threemoonslab merged 2 commits into
mainfrom
feat/verifier-corpus-scenarios
May 31, 2026
Merged

Extend the AI-coding-verifier corpus: email tool, CI-gate removal, suppression#155
pengfei-threemoonslab merged 2 commits into
mainfrom
feat/verifier-corpus-scenarios

Conversation

@pengfei-threemoonslab
Copy link
Copy Markdown
Contributor

What

Extends the benchmark/ai-coding-verifier corpus (the deterministic base/head merge-verdict scenarios in tests/test_verifier_scenarios.py) with three canonical capability transitions the product claims to handle but the corpus didn't yet cover.

The corpus deliberately asserts semantic verifier.json fields against the real engine rather than committing golden trees (per its README: "rather than committing fragile golden trees") — so this follows that pattern, not a parallel golden-file one.

New scenarios

Scenario Diff Asserts
agent_adds_email_tool adds an external-comms messaging.send_customer_email tool the email action is detected as action_added; can_merge_without_human: false
agent_removes_ci_gate deletes .github/workflows/agents-shipgate.yml trust_root_touched/policy_weakened; not auto-mergeable — the flagship anti-bypass case
agent_adds_suppression adds a checks.ignore to shipgate.yaml trust_root_touched; the agent can't silently suppress and self-merge

All three assertions reflect the real engine output (confirmed by running). One honest note, recorded in the suppression test: adding a checks.ignore for a check with no active blocker surfaces as trust_root_touched, not policy_weakened — defensible (the effective gate isn't weakened), and still routed to a human. If you'd want a suppression of an active blocker to register as policy_weakened, that's a small follow-up worth considering.

Verification

tests/test_verifier_scenarios.py — 7 scenarios pass; ruff clean. Test-only + README; no engine change.

🤖 Generated with Claude Code

pengfei-threemoonslab and others added 2 commits May 30, 2026 19:13
…ppression

The benchmark/ai-coding-verifier corpus deliberately asserts base/head scenarios
against the real engine (no fragile golden trees). It covered refund +
policy-edit + two docs-only cases; add three canonical capability transitions:

- agent_adds_email_tool: an external-communication action is a gated capability
  change (action_added detected; not auto-mergeable).
- agent_removes_ci_gate: deleting the Shipgate CI workflow touches a trust root
  / weakens policy and routes to human review — the gate cannot be removed to
  self-merge (the flagship anti-bypass case).
- agent_adds_suppression: adding a checks.ignore touches a trust root; the agent
  cannot silently suppress and self-merge. (Surfaces as trust_root_touched, not
  policy_weakened, because the suppressed check has no active blocker here.)

All assertions reflect real engine output (confirmed by running). README table
updated. Test-only + docs; no engine change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses review of #155: the three new scenarios passed on generic signals
(trust_root_touched, "email" in subject), so a regression in the specific check
each scenario is named for would not be caught. Tightened to the actual check
each transition fires (confirmed by probing the real engine):

- agent_adds_email_tool: merge_verdict == blocked + blocker
  SHIP-ACTION-EXTERNAL-COMMUNICATION-AUDIT-MISSING.
- agent_removes_ci_gate (renamed _blocks): merge_verdict == blocked + blocker
  SHIP-VERIFY-CI-GATE-REMOVED.
- agent_adds_suppression: merge_verdict == human_review_required + review_item
  SHIP-VERIFY-BASELINE-OR-WAIVER-EXPANDED + policy_broadened change naming
  suppression:SHIP-POLICY-APPROVAL-MISSING.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pengfei-threemoonslab pengfei-threemoonslab merged commit 1c2b896 into main May 31, 2026
1 check passed
@pengfei-threemoonslab pengfei-threemoonslab deleted the feat/verifier-corpus-scenarios branch May 31, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant