Skip to content

Surface the self-approval prohibition at the top of verifier.json#148

Merged
pengfei-threemoonslab merged 2 commits into
mainfrom
feat/verifier-self-approve-signal
May 30, 2026
Merged

Surface the self-approval prohibition at the top of verifier.json#148
pengfei-threemoonslab merged 2 commits into
mainfrom
feat/verifier-self-approve-signal

Conversation

@pengfei-threemoonslab
Copy link
Copy Markdown
Contributor

What

Promotes the self-approval prohibition to the top of verifier.json. When a PR edits the rules that evaluate it — a weakened release policy or a touched trust root — a coding agent must never silently self-approve (reward hacking). #146 carried that message inside a fix_task instruction; this surfaces it in the two fields an agent reads first.

Why

The verifier already detects policy_weakened / trust_root_touched and routes them to a human, but the agent-facing headline and human_review.why still showed the generic scan headline. An agent skimming the top of the artifact wouldn't see the most important fact: you cannot clear your own gate.

Changes

  • _self_approval_note() — the explicit "a coding agent cannot self-approve that change — a human must review it" message. policy_weakened takes precedence over trust_root_touched; clean reviews get no note.
  • headline leads with the note when present (ahead of agent_summary.headline).
  • human_review.why leads with the note, and a note forces human_review.required = True regardless of the verdict path — defense in depth so a weakened policy can never be marked agent-clearable.
  • 8 unit tests (tests/test_self_approval_signal.py).

Verification

Full suite 2346 passed, 4 skipped, 0 failed; generate_schemas.py --check clean (no schema change — additive logic over the existing capability_review flags); ruff clean.

🤖 Generated with Claude Code

pengfei-threemoonslab and others added 2 commits May 29, 2026 23:56
When a PR weakens the release policy or touches a trust root, a coding agent
must not silently self-approve a change to its own gate. That prohibition was
only present inside a fix_task instruction (PR #146); promote it to the two
fields an agent reads first.

- Add _self_approval_note(): the explicit "a coding agent cannot self-approve
  that change - a human must review it" message for policy_weakened (taking
  precedence) and trust_root_touched.
- verifier.json headline leads with the note when present.
- human_review.why leads with the note, and a self-approval note forces
  human_review.required=True regardless of the verdict path.

Full suite: 2346 passed, 4 skipped. No schema change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ew fix)

Addresses review of #148: a self-approval note forced human_review.required=True,
but can_merge_without_human and first_next_action still keyed only off
merge_verdict, so the defensive (mergeable + note) path could emit "human review
required" and "safe to merge" at once.

- _can_merge_without_human returns False whenever a self-approval note exists.
- _first_next_action routes to a human review (never the "safe to merge" action)
  when a self-approval note is present, including the fix_task-None defensive
  case.
- Both thread capability_review from _build_verifier. Clean mergeable behavior
  (no note) is unchanged; covered by a regression test.

Full suite: 2349 passed, 4 skipped. No schema change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pengfei-threemoonslab pengfei-threemoonslab merged commit 48686d9 into main May 30, 2026
1 check passed
@pengfei-threemoonslab pengfei-threemoonslab deleted the feat/verifier-self-approve-signal branch May 30, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant