OpenDIKW · helebest · Jun 29, 2026 · Jun 29, 2026
diff --git a/.claude/agents/fixer.md b/.claude/agents/fixer.md
@@ -20,3 +20,5 @@ Forbidden — these "fix the test, not the code" and are never allowed:
 - Lowering the `vite.config.ts` coverage thresholds.
 
 If the only path to green is one of the forbidden moves, STOP: report the real root cause and why it can't be fixed cleanly. Do not weaken the check — escalate instead.
+
+This list is not honor-system only: the `gate-integrity` CI job (`npm run check:gate`) mechanically fails any PR that lowers a coverage threshold, raises a bundle budget, raises e2e retries, or deletes/disables a test / strips its assertions — so a forbidden move won't merge even if attempted. Fix the cause; don't route around the gate.
diff --git a/.claude/skills/dikw-web-delivery-workflow/SKILL.md b/.claude/skills/dikw-web-delivery-workflow/SKILL.md
@@ -52,7 +52,10 @@ a fresh agent that didn't write the code).
    the **same** change — not "later". (Disk `.md` is English-only in this repo.)
 
 8. **Final gate + PR.** `npm.cmd run verify` (lint + format:check + typecheck + coverage + build + e2e)
-   green, then `npm.cmd run check:bundle` (gzip budget; also runs in CI). Bump
+   green, then `npm.cmd run check:bundle` (gzip budget) and `npm.cmd run check:gate`
+   (reward-hacking gate; both also run in CI — the latter as the required
+   `gate-integrity` job). If `check:gate` flags a *deliberate* weakening, a maintainer
+   adds the `gate-change` label to the PR; never route around it. Bump
    `package.json` version (3-digit SemVer) and add a `CHANGELOG.md` entry when
    warranted. Branch with a descriptive name, commit `<type>(<scope>): <subject>`,
    push, `gh pr create`.

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -165,6 +165,35 @@ jobs:
           severity: HIGH,CRITICAL
           version: ${{ env.TRIVY_VERSION }}
 
+  gate-integrity:
+    # Reward-hacking gate: fails a PR that weakens the verification ITSELF — lowers a
+    # coverage threshold (vite.config.ts), raises a bundle budget (check-bundle.mjs),
+    # raises e2e retries (playwright.config.ts), deletes/disables a test or strips its
+    # assertions, or edits the gate/CI machinery (this script, .github/workflows/**,
+    # fixer.md's forbidden list). A maintainer can allow a deliberate change with the
+    # visible, auditable `gate-change` PR label. PR-scoped on purpose: it is a required
+    # status check, so nothing merges to main without it, and the label context only
+    # exists on a PR. Pure Node script (no deps) — no `npm ci` needed.
+    name: Gate integrity (no reward-hacking)
+    if: github.event_name == 'pull_request'
+    runs-on: ubuntu-latest
+    timeout-minutes: 5
+    steps:
+      - uses: actions/checkout@v7
+        with:
+          # Full history so `<base>...HEAD` can resolve the merge base.
+          fetch-depth: 0
+
+      - uses: actions/setup-node@v6
+        with:
+          node-version: "24"
+
+      - name: Check gate integrity
+        env:
+          GATE_BASE_REF: ${{ github.event.pull_request.base.sha }}
+          GATE_HAS_OVERRIDE: ${{ contains(github.event.pull_request.labels.*.name, 'gate-change') }}
+        run: node scripts/check-gate-integrity.mjs
+
   release:
     name: Release tag (dikw-web-v*)
     # Cuts a GitHub Release tagged dikw-web-v<package.json version> once the whole

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,25 @@ file format introduced in `[0.0.1.0]` was dropped.
 
 ## [Unreleased]
 
+## [0.8.7] - 2026-06-29
+
+### Added
+
+- **Reward-hacking gate (`gate-integrity` CI job / `npm run check:gate`).** A
+  deterministic check the in-loop agent cannot fool, hardening the delivery loop
+  against an agent weakening the verification to make a check go green.
+  `scripts/check-gate-integrity.mjs` diffs the branch against its merge base and
+  fails the PR if a coverage threshold was lowered, the coverage `exclude` list
+  grew, a bundle budget was raised, e2e `retries` were raised, a test was
+  deleted/disabled or stripped of assertions, or the gate/CI machinery itself
+  (the script, `.github/workflows/**`, `fixer.md`'s forbidden list) was edited.
+  The good direction (raising a threshold, adding tests) is always allowed; a
+  deliberate weakening is allowed only when a maintainer adds the visible
+  `gate-change` PR label. Until now "don't weaken the tests" was prose only in
+  `CLAUDE.md` / `docs/review-rubric.md` / `fixer.md` — the weakest defense. The job
+  is PR-scoped and a required status check. See
+  `docs/adr/0005-delivery-loop-hardening.md`.
+
 ## [0.8.6] - 2026-06-29
 
 ### Fixed

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -72,7 +72,7 @@ End-to-end loop from request to landed PR. Run autonomously for behavior changes
 4. **Final pass.** Run `/code-review`, scored against `docs/review-rubric.md` (the project-specific principles), and resolve every finding before continuing.
 5. **Verify in the browser.** For UI changes, invoke the `dikw-web-verify-frontend` skill: navigate the changed routes via Chrome MCP, confirm a clean runtime console on real data, exercise the affected interactions, and run the `docs/ui-checklist.md` rubric in light + dark — confirm the change actually rendered as intended, not just that unit tests pass.
 6. **Update markdown docs.** Walk `CLAUDE.md`, `README.md`, and the relevant `docs/*.md` against the diff; any contract, behavior, command, or doc index that drifted must be updated in the same change. Don't leave docs to "catch up later".
-7. **Create the PR.** Branch with a descriptive name, commit with `<type>(<scope>): <subject>` matching the project's existing convention (see recent `git log`), push, then `gh pr create`. CI auto-runs lint + format:check + typecheck + coverage + build + e2e + bundle budget + security scans (npm audit, gitleaks, Trivy, CodeQL). Bump `package.json.version` manually (standard 3-digit SemVer) when the change warrants it, and add an entry to `CHANGELOG.md` under the matching version heading. On merge to `main`, CI's `release` job auto-cuts a GitHub Release tagged `dikw-web-v<version>` from `package.json.version` (idempotent — only a version bump creates a new tag; notes come from the matching CHANGELOG section via `scripts/changelog-notes.mjs`), so a deliberate version bump is what publishes a release.
+7. **Create the PR.** Branch with a descriptive name, commit with `<type>(<scope>): <subject>` matching the project's existing convention (see recent `git log`), push, then `gh pr create`. CI auto-runs lint + format:check + typecheck + coverage + build + e2e + bundle budget + the `gate-integrity` reward-hacking gate (`check:gate`) + security scans (npm audit, gitleaks, Trivy, CodeQL). Bump `package.json.version` manually (standard 3-digit SemVer) when the change warrants it, and add an entry to `CHANGELOG.md` under the matching version heading. On merge to `main`, CI's `release` job auto-cuts a GitHub Release tagged `dikw-web-v<version>` from `package.json.version` (idempotent — only a version bump creates a new tag; notes come from the matching CHANGELOG section via `scripts/changelog-notes.mjs`), so a deliberate version bump is what publishes a release.
 8. **Monitor CI and PR comments; resolve as they surface, then merge.** After pushing, actively watch both signals — don't passively wait, and don't batch resolution to merge time.
    - **CI rollup**: `gh pr checks <N>` (or `--watch` to block until terminal). Failing job logs: `gh run view <run-id> --log-failed`. Flaky e2e gets **one** rerun, not five (see [[project_flaky_graph_e2e]] in memory for which test).
    - **PR review prose**: `gh api repos/{owner}/{repo}/pulls/{N}/reviews` for review bodies, `.../pulls/{N}/comments` for inline threads, `.../issues/{N}/comments` for top-level CodeRabbit summaries. `gh pr checks` shows pass/fail only, not the prose.
@@ -98,6 +98,7 @@ Windows shell: use `npm.cmd` (not `npm`) when invoking from PowerShell.
 - `npm.cmd run build` — typecheck, `vite build` (browser bundle to `dist/`), then `build:server` (esbuild bundles `server/agent/standalone.ts` to `dist-server/standalone.mjs` with `--packages=external`, since ADK + MikroORM + native sqlite3 can't be bundled — so the sidecar imports its deps from a production `node_modules` at runtime). `npm.cmd start` runs that standalone sidecar.
 - `npm.cmd run verify` — full gate: lint + format:check + typecheck + coverage + build + e2e. Run before committing behavior changes.
 - `npm.cmd run check:bundle` — gzip bundle budget (entry JS / total JS / CSS) against `dist/`; runs in CI after the verify gate. Raise the budgets in `scripts/check-bundle.mjs` deliberately, like the coverage thresholds — don't bump to pass.
+- `npm.cmd run check:gate` — **reward-hacking gate** (`scripts/check-gate-integrity.mjs`): diffs the branch against its merge base (`origin/main` locally; the PR base in CI) and fails if the verification *itself* was weakened — a lowered coverage threshold, a grown coverage `exclude`, a raised bundle budget, raised e2e `retries`, a deleted/disabled test or removed assertions, or any edit to the gate/CI machinery (the script, `.github/workflows/**`, `fixer.md`'s forbidden list). The good direction (raising a threshold, adding a test) is always allowed; a deliberate weakening is allowed only when a maintainer adds the visible `gate-change` label to the PR (`GATE_HAS_OVERRIDE`). Runs as the **PR-scoped required CI job `gate-integrity`** — it is the deterministic backstop for the prose "don't weaken the tests" rule in `fixer.md` / `docs/review-rubric.md`. See `docs/adr/0005-delivery-loop-hardening.md`.
 - `npm.cmd run smoke:core` — live-core `/v1` contract smoke (`scripts/smoke-core.mjs`, the `dikw-web-smoke-core` skill). Not a CI gate; needs a reachable core. Run after a `dikw-core` bump or before a demo.
 - `npm.cmd run live:verify` — full live integration verification of the working tree against a **real `dikw-core`** (GHCR image, Postgres backend) on dynamic ports: boot → seed the write pipeline (import→ingest→synth→lint, reusing `buildImportBundle` + `DikwClient`) → read-contract smoke → browser read-route e2e (Playwright `live` project) → agent↔core check → teardown. Needs Docker + `.env.core` (LLM/embedding keys, git-ignored; copy `.env.core.example`). Not a CI gate (boots a container, calls live LLMs); `live-integration.yml` runs it on dispatch/nightly/label. Sub-commands: `live:up` / `live:seed` / `live:smoke` / `live:down` (`-- --volumes` to drop data). `-- --keep` leaves the stack up. See `docs/integration-verification.md`.
 

diff --git a/docs/adr/0005-delivery-loop-hardening.md b/docs/adr/0005-delivery-loop-hardening.md
@@ -0,0 +1,88 @@
+# 5. Delivery-loop hardening
+
+This ADR records the decision to harden dikw-web's request→merged-PR **delivery
+loop** against the failure modes of agentic, low-supervision development. It is the
+umbrella record for a set of changes landing as separate PRs; this first revision
+covers the **reward-hacking gate** (`gate-integrity` / `npm run check:gate`).
+
+## Status
+
+Accepted (2026-06-29). Living document — later sections are added as the remaining
+hardening items land (trustworthy green signal, measured perf/a11y, automated
+CI-watch-and-fix, loop observability).
+
+## Context
+
+The loop already encodes most manual + pre-merge checks as skills (the
+`dikw-web-delivery-workflow`, `dikw-web-verify-frontend`, `dikw-web-smoke-core`
+skills; the `docs/ui-checklist.md` + `docs/review-rubric.md` rubrics) and runs an
+independent review on a different model (`/codex:review`) before merge. The CI
+`verify` gate (lint / format / typecheck / coverage / build / e2e) plus
+`check:bundle` is deterministic and blocks merge.
+
+The gap is the one an autonomous agent exploits under pressure to make a check go
+green: **reward hacking**. The verification's own knobs live in editable source —
+coverage thresholds in `vite.config.ts`, bundle budgets in `scripts/check-bundle.mjs`,
+e2e retries in `playwright.config.ts` — and a test can be deleted, `.skip`-ped, or
+stripped of assertions. The only thing standing in the way was prose: "don't weaken
+the tests" in `CLAUDE.md`, `docs/review-rubric.md`, and the `fixer` agent's forbidden
+list. Prose is the weakest possible defense: the agent that edits the threshold also
+passes the check that reads it.
+
+## Decision
+
+Add a deterministic gate the in-loop agent **cannot fool**:
+`scripts/check-gate-integrity.mjs`, run as the PR-scoped required CI job
+`gate-integrity` and mirrored locally by `npm run check:gate`.
+
+It diffs the branch against its merge base (`<base>...HEAD`) and fails if the
+verification itself was weakened:
+
+- **coverage-threshold-lowered** — any of statements/branches/functions/lines in
+  `vite.config.ts` dropped.
+- **coverage-exclude-grown** — the coverage `exclude` array gained entries (drops
+  files out of the denominator).
+- **bundle-budget-raised** — any `*GzipKB` budget in `check-bundle.mjs` rose.
+- **e2e-retries-raised** — the CI branch of `retries` in `playwright.config.ts` rose
+  (a higher retry count masks new flakes).
+- **test-file-deleted** / **test-skip-added** / **test-assertions-removed** — a test
+  file was removed, gained a `.skip`/`.only`/`.todo`/`xit`/`xdescribe` marker, or lost
+  `expect()` calls. Only *in-place* modifications are diffed; a new file is new
+  coverage, and a rename-that-guts shows up as a separate delete (still caught).
+- **gate-machinery-modified** — any edit to the gate script itself,
+  `.github/workflows/**`, or `fixer.md`'s forbidden list. This is the **self-guard**:
+  an agent cannot quietly delete a check or drop the CI job, because doing so trips the
+  gate.
+
+The "good direction" — raising a threshold, tightening a budget, adding tests — is
+always allowed without ceremony. A *deliberate* weakening is allowed only when a
+maintainer attaches the visible, auditable **`gate-change`** label to the PR
+(`GATE_HAS_OVERRIDE=true`); the job then passes but prints what it allowed, for the
+audit trail. The reviewer's remaining job is to judge whether a labelled change is
+justified.
+
+This mirrors the "a second check the agent does not control" pattern: the gate is the
+mechanical backstop for the prose rule, and it guards its own machinery so it cannot
+be edited away.
+
+## Consequences
+
+- The PR that *introduces or changes* the gate machinery (including this one) trips
+  `gate-machinery-modified` by design and must carry the `gate-change` label. That is
+  the intended, visible audit point, not a bug.
+- The skip/assertion heuristics are deliberately dumb (regex counts), so they can have
+  false positives (e.g. a test file that contains skip markers as string fixtures —
+  handled by only diffing in-place modifications with a non-null base). The
+  `gate-change` label is the escape hatch; the gate is a backstop, not a proof system.
+- The job is PR-scoped (`if: github.event_name == 'pull_request'`) because the label
+  context only exists on a PR, and a required status check already blocks every merge
+  to `main`. It is therefore **not** added to `release.needs` (a skipped dependency
+  would skip `release`).
+
+## Excluded (deliberately, for "simplicity first")
+
+The fuller autonomous-loop machinery — an on-disk `STATUS.md` + `loop_state.json`
+state protocol, container `--network none` isolation, per-token cost accounting — is
+out of scope. dikw-web's loop is interactive Claude Code with worktree isolation
+already available for background jobs and no destructive operations; that machinery
+would add complexity without proportional value here.
diff --git a/docs/review-rubric.md b/docs/review-rubric.md
@@ -19,7 +19,11 @@ Each item is pass/fail against the **diff under review**.
   orphaned imports/vars are removed.
 - [ ] **Goal-driven / TDD.** Behavior changes land test-first (see `docs/tdd.md`).
   Coverage thresholds in `vite.config.ts` (60/45/55/60) are **not lowered** to
-  pass — tests are added/repaired instead.
+  pass — tests are added/repaired instead. This is now machine-enforced by the
+  `gate-integrity` CI job (`npm run check:gate`): lowering a threshold, raising a
+  bundle budget, raising e2e retries, or deleting/disabling a test fails the PR
+  unless a maintainer attaches the `gate-change` label. The reviewer's job here is
+  to judge whether a labelled `gate-change` is actually justified.
 
 ## Repo-specific traps (these don't trip generic reviewers)
 

diff --git a/docs/tdd.md b/docs/tdd.md
@@ -217,6 +217,8 @@ Locale and theme regressions should be caught at the browser boundary:
 
 Initial thresholds are intentionally modest: statements/lines 60%, functions 55%, branches 45%. Raise thresholds only when the suite gains durable behavior coverage. Do not lower thresholds to merge a feature; add or repair tests instead.
 
+"Do not lower thresholds" is no longer just a rule of discipline — the `gate-integrity` CI job (`npm run check:gate`, `scripts/check-gate-integrity.mjs`) enforces it mechanically. It diffs the PR against its merge base and fails if the verification itself was weakened: a lowered coverage threshold, a grown coverage `exclude` list, a raised bundle budget, raised e2e retries, a deleted/disabled test, removed assertions, or any edit to the gate/CI machinery. A deliberate, reviewed change is allowed only when a maintainer attaches the visible `gate-change` label to the PR. See `docs/adr/0005-delivery-loop-hardening.md`.
+
 ## Real Core Smoke Testing
 
 Mocked E2E is the default gate. Manual smoke against a real local `dikw-core` remains useful before demos, but it should not block normal TDD work because local data and providers vary.

diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "dikw-web",
-  "version": "0.8.6",
+  "version": "0.8.7",
   "private": true,
   "type": "module",
   "engines": {
@@ -27,6 +27,7 @@
     "live:smoke": "node scripts/live-core/smoke.mjs",
     "live:verify": "node scripts/live-core/run.mjs",
     "check:bundle": "node scripts/check-bundle.mjs",
+    "check:gate": "node scripts/check-gate-integrity.mjs",
     "verify": "npm run lint && npm run format:check && npm run typecheck && npm run test:coverage && npm run build && npm run test:e2e"
   },
   "dependencies": {
Original file line number	Diff line number	Diff line change
Expand Up		@@ -20,3 +20,5 @@ Forbidden — these "fix the test, not the code" and are never allowed:
		- Lowering the `vite.config.ts` coverage thresholds.

		If the only path to green is one of the forbidden moves, STOP: report the real root cause and why it can't be fixed cleanly. Do not weaken the check — escalate instead.

		This list is not honor-system only: the `gate-integrity` CI job (`npm run check:gate`) mechanically fails any PR that lowers a coverage threshold, raises a bundle budget, raises e2e retries, or deletes/disables a test / strips its assertions — so a forbidden move won't merge even if attempted. Fix the cause; don't route around the gate.