Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .claude/agents/fixer.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,5 @@ Forbidden — these "fix the test, not the code" and are never allowed:
- Lowering the `vite.config.ts` coverage thresholds.

If the only path to green is one of the forbidden moves, STOP: report the real root cause and why it can't be fixed cleanly. Do not weaken the check — escalate instead.

This list is not honor-system only: the `gate-integrity` CI job (`npm run check:gate`) mechanically fails any PR that lowers a coverage threshold, raises a bundle budget, raises e2e retries, or deletes/disables a test / strips its assertions — so a forbidden move won't merge even if attempted. Fix the cause; don't route around the gate.
5 changes: 4 additions & 1 deletion .claude/skills/dikw-web-delivery-workflow/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,10 @@ a fresh agent that didn't write the code).
the **same** change — not "later". (Disk `.md` is English-only in this repo.)

8. **Final gate + PR.** `npm.cmd run verify` (lint + format:check + typecheck + coverage + build + e2e)
green, then `npm.cmd run check:bundle` (gzip budget; also runs in CI). Bump
green, then `npm.cmd run check:bundle` (gzip budget) and `npm.cmd run check:gate`
(reward-hacking gate; both also run in CI — the latter as the required
`gate-integrity` job). If `check:gate` flags a *deliberate* weakening, a maintainer
adds the `gate-change` label to the PR; never route around it. Bump
`package.json` version (3-digit SemVer) and add a `CHANGELOG.md` entry when
warranted. Branch with a descriptive name, commit `<type>(<scope>): <subject>`,
push, `gh pr create`.
Expand Down
29 changes: 29 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,35 @@ jobs:
severity: HIGH,CRITICAL
version: ${{ env.TRIVY_VERSION }}

gate-integrity:
# Reward-hacking gate: fails a PR that weakens the verification ITSELF — lowers a
# coverage threshold (vite.config.ts), raises a bundle budget (check-bundle.mjs),
# raises e2e retries (playwright.config.ts), deletes/disables a test or strips its
# assertions, or edits the gate/CI machinery (this script, .github/workflows/**,
# fixer.md's forbidden list). A maintainer can allow a deliberate change with the
# visible, auditable `gate-change` PR label. PR-scoped on purpose: it is a required
# status check, so nothing merges to main without it, and the label context only
# exists on a PR. Pure Node script (no deps) — no `npm ci` needed.
name: Gate integrity (no reward-hacking)
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v7
with:
# Full history so `<base>...HEAD` can resolve the merge base.
fetch-depth: 0

- uses: actions/setup-node@v6
with:
node-version: "24"

- name: Check gate integrity
env:
GATE_BASE_REF: ${{ github.event.pull_request.base.sha }}
GATE_HAS_OVERRIDE: ${{ contains(github.event.pull_request.labels.*.name, 'gate-change') }}
run: node scripts/check-gate-integrity.mjs

release:
name: Release tag (dikw-web-v*)
# Cuts a GitHub Release tagged dikw-web-v<package.json version> once the whole
Expand Down
19 changes: 19 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,25 @@ file format introduced in `[0.0.1.0]` was dropped.

## [Unreleased]

## [0.8.7] - 2026-06-29

### Added

- **Reward-hacking gate (`gate-integrity` CI job / `npm run check:gate`).** A
deterministic check the in-loop agent cannot fool, hardening the delivery loop
against an agent weakening the verification to make a check go green.
`scripts/check-gate-integrity.mjs` diffs the branch against its merge base and
fails the PR if a coverage threshold was lowered, the coverage `exclude` list
grew, a bundle budget was raised, e2e `retries` were raised, a test was
deleted/disabled or stripped of assertions, or the gate/CI machinery itself
(the script, `.github/workflows/**`, `fixer.md`'s forbidden list) was edited.
The good direction (raising a threshold, adding tests) is always allowed; a
deliberate weakening is allowed only when a maintainer adds the visible
`gate-change` PR label. Until now "don't weaken the tests" was prose only in
`CLAUDE.md` / `docs/review-rubric.md` / `fixer.md` — the weakest defense. The job
is PR-scoped and a required status check. See
`docs/adr/0005-delivery-loop-hardening.md`.

## [0.8.6] - 2026-06-29

### Fixed
Expand Down
3 changes: 2 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ End-to-end loop from request to landed PR. Run autonomously for behavior changes
4. **Final pass.** Run `/code-review`, scored against `docs/review-rubric.md` (the project-specific principles), and resolve every finding before continuing.
5. **Verify in the browser.** For UI changes, invoke the `dikw-web-verify-frontend` skill: navigate the changed routes via Chrome MCP, confirm a clean runtime console on real data, exercise the affected interactions, and run the `docs/ui-checklist.md` rubric in light + dark — confirm the change actually rendered as intended, not just that unit tests pass.
6. **Update markdown docs.** Walk `CLAUDE.md`, `README.md`, and the relevant `docs/*.md` against the diff; any contract, behavior, command, or doc index that drifted must be updated in the same change. Don't leave docs to "catch up later".
7. **Create the PR.** Branch with a descriptive name, commit with `<type>(<scope>): <subject>` matching the project's existing convention (see recent `git log`), push, then `gh pr create`. CI auto-runs lint + format:check + typecheck + coverage + build + e2e + bundle budget + security scans (npm audit, gitleaks, Trivy, CodeQL). Bump `package.json.version` manually (standard 3-digit SemVer) when the change warrants it, and add an entry to `CHANGELOG.md` under the matching version heading. On merge to `main`, CI's `release` job auto-cuts a GitHub Release tagged `dikw-web-v<version>` from `package.json.version` (idempotent — only a version bump creates a new tag; notes come from the matching CHANGELOG section via `scripts/changelog-notes.mjs`), so a deliberate version bump is what publishes a release.
7. **Create the PR.** Branch with a descriptive name, commit with `<type>(<scope>): <subject>` matching the project's existing convention (see recent `git log`), push, then `gh pr create`. CI auto-runs lint + format:check + typecheck + coverage + build + e2e + bundle budget + the `gate-integrity` reward-hacking gate (`check:gate`) + security scans (npm audit, gitleaks, Trivy, CodeQL). Bump `package.json.version` manually (standard 3-digit SemVer) when the change warrants it, and add an entry to `CHANGELOG.md` under the matching version heading. On merge to `main`, CI's `release` job auto-cuts a GitHub Release tagged `dikw-web-v<version>` from `package.json.version` (idempotent — only a version bump creates a new tag; notes come from the matching CHANGELOG section via `scripts/changelog-notes.mjs`), so a deliberate version bump is what publishes a release.
8. **Monitor CI and PR comments; resolve as they surface, then merge.** After pushing, actively watch both signals — don't passively wait, and don't batch resolution to merge time.
- **CI rollup**: `gh pr checks <N>` (or `--watch` to block until terminal). Failing job logs: `gh run view <run-id> --log-failed`. Flaky e2e gets **one** rerun, not five (see [[project_flaky_graph_e2e]] in memory for which test).
- **PR review prose**: `gh api repos/{owner}/{repo}/pulls/{N}/reviews` for review bodies, `.../pulls/{N}/comments` for inline threads, `.../issues/{N}/comments` for top-level CodeRabbit summaries. `gh pr checks` shows pass/fail only, not the prose.
Expand All @@ -98,6 +98,7 @@ Windows shell: use `npm.cmd` (not `npm`) when invoking from PowerShell.
- `npm.cmd run build` — typecheck, `vite build` (browser bundle to `dist/`), then `build:server` (esbuild bundles `server/agent/standalone.ts` to `dist-server/standalone.mjs` with `--packages=external`, since ADK + MikroORM + native sqlite3 can't be bundled — so the sidecar imports its deps from a production `node_modules` at runtime). `npm.cmd start` runs that standalone sidecar.
- `npm.cmd run verify` — full gate: lint + format:check + typecheck + coverage + build + e2e. Run before committing behavior changes.
- `npm.cmd run check:bundle` — gzip bundle budget (entry JS / total JS / CSS) against `dist/`; runs in CI after the verify gate. Raise the budgets in `scripts/check-bundle.mjs` deliberately, like the coverage thresholds — don't bump to pass.
- `npm.cmd run check:gate` — **reward-hacking gate** (`scripts/check-gate-integrity.mjs`): diffs the branch against its merge base (`origin/main` locally; the PR base in CI) and fails if the verification *itself* was weakened — a lowered coverage threshold, a grown coverage `exclude`, a raised bundle budget, raised e2e `retries`, a deleted/disabled test or removed assertions, or any edit to the gate/CI machinery (the script, `.github/workflows/**`, `fixer.md`'s forbidden list). The good direction (raising a threshold, adding a test) is always allowed; a deliberate weakening is allowed only when a maintainer adds the visible `gate-change` label to the PR (`GATE_HAS_OVERRIDE`). Runs as the **PR-scoped required CI job `gate-integrity`** — it is the deterministic backstop for the prose "don't weaken the tests" rule in `fixer.md` / `docs/review-rubric.md`. See `docs/adr/0005-delivery-loop-hardening.md`.
- `npm.cmd run smoke:core` — live-core `/v1` contract smoke (`scripts/smoke-core.mjs`, the `dikw-web-smoke-core` skill). Not a CI gate; needs a reachable core. Run after a `dikw-core` bump or before a demo.
- `npm.cmd run live:verify` — full live integration verification of the working tree against a **real `dikw-core`** (GHCR image, Postgres backend) on dynamic ports: boot → seed the write pipeline (import→ingest→synth→lint, reusing `buildImportBundle` + `DikwClient`) → read-contract smoke → browser read-route e2e (Playwright `live` project) → agent↔core check → teardown. Needs Docker + `.env.core` (LLM/embedding keys, git-ignored; copy `.env.core.example`). Not a CI gate (boots a container, calls live LLMs); `live-integration.yml` runs it on dispatch/nightly/label. Sub-commands: `live:up` / `live:seed` / `live:smoke` / `live:down` (`-- --volumes` to drop data). `-- --keep` leaves the stack up. See `docs/integration-verification.md`.

Expand Down
88 changes: 88 additions & 0 deletions docs/adr/0005-delivery-loop-hardening.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# 5. Delivery-loop hardening

This ADR records the decision to harden dikw-web's request→merged-PR **delivery
loop** against the failure modes of agentic, low-supervision development. It is the
umbrella record for a set of changes landing as separate PRs; this first revision
covers the **reward-hacking gate** (`gate-integrity` / `npm run check:gate`).

## Status

Accepted (2026-06-29). Living document — later sections are added as the remaining
hardening items land (trustworthy green signal, measured perf/a11y, automated
CI-watch-and-fix, loop observability).

## Context

The loop already encodes most manual + pre-merge checks as skills (the
`dikw-web-delivery-workflow`, `dikw-web-verify-frontend`, `dikw-web-smoke-core`
skills; the `docs/ui-checklist.md` + `docs/review-rubric.md` rubrics) and runs an
independent review on a different model (`/codex:review`) before merge. The CI
`verify` gate (lint / format / typecheck / coverage / build / e2e) plus
`check:bundle` is deterministic and blocks merge.

The gap is the one an autonomous agent exploits under pressure to make a check go
green: **reward hacking**. The verification's own knobs live in editable source —
coverage thresholds in `vite.config.ts`, bundle budgets in `scripts/check-bundle.mjs`,
e2e retries in `playwright.config.ts` — and a test can be deleted, `.skip`-ped, or
stripped of assertions. The only thing standing in the way was prose: "don't weaken
the tests" in `CLAUDE.md`, `docs/review-rubric.md`, and the `fixer` agent's forbidden
list. Prose is the weakest possible defense: the agent that edits the threshold also
passes the check that reads it.

## Decision

Add a deterministic gate the in-loop agent **cannot fool**:
`scripts/check-gate-integrity.mjs`, run as the PR-scoped required CI job
`gate-integrity` and mirrored locally by `npm run check:gate`.

It diffs the branch against its merge base (`<base>...HEAD`) and fails if the
verification itself was weakened:

- **coverage-threshold-lowered** — any of statements/branches/functions/lines in
`vite.config.ts` dropped.
- **coverage-exclude-grown** — the coverage `exclude` array gained entries (drops
files out of the denominator).
- **bundle-budget-raised** — any `*GzipKB` budget in `check-bundle.mjs` rose.
- **e2e-retries-raised** — the CI branch of `retries` in `playwright.config.ts` rose
(a higher retry count masks new flakes).
- **test-file-deleted** / **test-skip-added** / **test-assertions-removed** — a test
file was removed, gained a `.skip`/`.only`/`.todo`/`xit`/`xdescribe` marker, or lost
`expect()` calls. Only *in-place* modifications are diffed; a new file is new
coverage, and a rename-that-guts shows up as a separate delete (still caught).
- **gate-machinery-modified** — any edit to the gate script itself,
`.github/workflows/**`, or `fixer.md`'s forbidden list. This is the **self-guard**:
an agent cannot quietly delete a check or drop the CI job, because doing so trips the
gate.

The "good direction" — raising a threshold, tightening a budget, adding tests — is
always allowed without ceremony. A *deliberate* weakening is allowed only when a
maintainer attaches the visible, auditable **`gate-change`** label to the PR
(`GATE_HAS_OVERRIDE=true`); the job then passes but prints what it allowed, for the
audit trail. The reviewer's remaining job is to judge whether a labelled change is
justified.

This mirrors the "a second check the agent does not control" pattern: the gate is the
mechanical backstop for the prose rule, and it guards its own machinery so it cannot
be edited away.

## Consequences

- The PR that *introduces or changes* the gate machinery (including this one) trips
`gate-machinery-modified` by design and must carry the `gate-change` label. That is
the intended, visible audit point, not a bug.
- The skip/assertion heuristics are deliberately dumb (regex counts), so they can have
false positives (e.g. a test file that contains skip markers as string fixtures —
handled by only diffing in-place modifications with a non-null base). The
`gate-change` label is the escape hatch; the gate is a backstop, not a proof system.
- The job is PR-scoped (`if: github.event_name == 'pull_request'`) because the label
context only exists on a PR, and a required status check already blocks every merge
to `main`. It is therefore **not** added to `release.needs` (a skipped dependency
would skip `release`).

## Excluded (deliberately, for "simplicity first")

The fuller autonomous-loop machinery — an on-disk `STATUS.md` + `loop_state.json`
state protocol, container `--network none` isolation, per-token cost accounting — is
out of scope. dikw-web's loop is interactive Claude Code with worktree isolation
already available for background jobs and no destructive operations; that machinery
would add complexity without proportional value here.
6 changes: 5 additions & 1 deletion docs/review-rubric.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@ Each item is pass/fail against the **diff under review**.
orphaned imports/vars are removed.
- [ ] **Goal-driven / TDD.** Behavior changes land test-first (see `docs/tdd.md`).
Coverage thresholds in `vite.config.ts` (60/45/55/60) are **not lowered** to
pass — tests are added/repaired instead.
pass — tests are added/repaired instead. This is now machine-enforced by the
`gate-integrity` CI job (`npm run check:gate`): lowering a threshold, raising a
bundle budget, raising e2e retries, or deleting/disabling a test fails the PR
unless a maintainer attaches the `gate-change` label. The reviewer's job here is
to judge whether a labelled `gate-change` is actually justified.

## Repo-specific traps (these don't trip generic reviewers)

Expand Down
2 changes: 2 additions & 0 deletions docs/tdd.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,8 @@ Locale and theme regressions should be caught at the browser boundary:

Initial thresholds are intentionally modest: statements/lines 60%, functions 55%, branches 45%. Raise thresholds only when the suite gains durable behavior coverage. Do not lower thresholds to merge a feature; add or repair tests instead.

"Do not lower thresholds" is no longer just a rule of discipline — the `gate-integrity` CI job (`npm run check:gate`, `scripts/check-gate-integrity.mjs`) enforces it mechanically. It diffs the PR against its merge base and fails if the verification itself was weakened: a lowered coverage threshold, a grown coverage `exclude` list, a raised bundle budget, raised e2e retries, a deleted/disabled test, removed assertions, or any edit to the gate/CI machinery. A deliberate, reviewed change is allowed only when a maintainer attaches the visible `gate-change` label to the PR. See `docs/adr/0005-delivery-loop-hardening.md`.

## Real Core Smoke Testing

Mocked E2E is the default gate. Manual smoke against a real local `dikw-core` remains useful before demos, but it should not block normal TDD work because local data and providers vary.
Expand Down
3 changes: 2 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "dikw-web",
"version": "0.8.6",
"version": "0.8.7",
"private": true,
"type": "module",
"engines": {
Expand All @@ -27,6 +27,7 @@
"live:smoke": "node scripts/live-core/smoke.mjs",
"live:verify": "node scripts/live-core/run.mjs",
"check:bundle": "node scripts/check-bundle.mjs",
"check:gate": "node scripts/check-gate-integrity.mjs",
"verify": "npm run lint && npm run format:check && npm run typecheck && npm run test:coverage && npm run build && npm run test:e2e"
},
"dependencies": {
Expand Down
Loading