diff --git a/CHANGELOG.md b/CHANGELOG.md index 0c327eb5b..44f1d7519 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,115 @@ All notable changes to GBrain will be documented in this file. +## [0.36.1.0] - 2026-05-17 + +**The brain learns how you tend to be wrong, then argues against your blind spots on every advice call.** + +Hindsight-inspired calibration loop. Three new cycle phases, eight expansions, one big admin tab. gbrain stops being a smart database and starts being a mirror that knows your track record. The brain extracts gradeable claims from your prose, grades them against reality, aggregates your record into a calibration profile, and surfaces it everywhere advice gets given — `gbrain think --with-calibration` rewrites questions against your known biases, contradictions point at the bias pattern that produced them, real-time nudges fire when you commit a high-conviction take in a domain where you've been wrong before. + +### What you can now do + +**Run `gbrain calibration`** and see your own track record narrated in conversational voice: "You called early-stage tactics well — 8 of 10 held up. Geography is your blind spot — 4 of 6 missed." Not "Brier 0.18." Not "conviction-bucket 0.8-0.9." Friend-not-doctor by design, voice-gated against academic slop. + +**Run `gbrain think --with-calibration "should we hire fast in NY?"`** and the answer surfaces both your prior AND the counter-prior from your hedged-domain self. The bias filter applies to question framing, not evidence interpretation (D22 placement: AFTER retrieval, BEFORE the question). Off by default — flip it when you trust the profile. + +**Open the admin SPA Calibration tab** at `gbrain serve --http`'s `/admin` and see your Brier-trend sparkline, per-domain accuracy bars, pattern statements, and "you committed to these and never revisited" abandoned-threads card. Server-rendered SVG, zero new chart library dep, follows the existing Linear-calm-clarity admin design. + +**Commit a high-conviction take in a known-biased domain** and a conversational stderr nudge fires: "Hey — you just bet pretty hard on this NY tech thing (0.85). Last year you made three similar geographic calls; two of them missed. Maybe dial it back to a 0.65 hunch?" 14-day cooldown per pattern so you don't get re-nudged on the same call. + +**Run `gbrain calibration --undo-wave v0.36.1.0`** to fully reverse the wave's mutations. Every new row carries `wave_version='v0.36.1.0'`; the undo command reverts auto-applied take resolutions, deletes calibration profiles, purges nudge logs, optionally scrubs gstack-coupled learnings. Dry-run mode shows what would change before you commit. + +### Validated by published benchmarks + +**First published benchmark for AI memory systems that reason about user track records.** Sibling repo [gbrain-evals](https://github.com/garrytan/gbrain-evals) ships two new categories specifically gating this wave: + +- **cat14 — advice quality A/B.** `think --with-calibration` vs baseline on 8 hand-authored probes covering 6 calibration scenarios (relevant bias, confidence boost, empty profile, irrelevant bias, multi-bias, voice stress). First live run: **75% calibrated wins / 0% baseline wins / 25% tie**, voice gate **100%**, force-fit prevention **100%**. +- **cat15 — propose_takes F1.** Extract-takes prompt against 48 hand-labeled claims across 8 synthetic pages covering 5 prose genres. First live run: **0.952 F1 training / 0.922 F1 holdout**, train-holdout gap **0.03** (no overfitting). The tuned prompt back-ports verbatim into this wave's `EXTRACT_TAKES_PROMPT`, replacing the v1 stub. + +The iteration log at `eval/data/cat14-calibration/iteration-log.md` documents three same-day prompt variants where the gate caught two regressions before either could ship. Full benchmark report: [2026-05-18 BrainBench Cat 14 + Cat 15 Calibration](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-05-18-brainbench-cat14-cat15-calibration.md). No prior published baseline exists in the personal-AI calibration-loop space; Hindsight introduced the concept as a skills demo without quantified evaluation. cat14 + cat15 stake out the category. + +Reproduction: ~$0.15 per full eval run (cat14 + cat15), under 5 minutes wallclock on Apple Silicon. + +### Itemized changes + +#### Three new cycle phases + +- **`propose_takes`** — LLM scans markdown prose, proposes gradeable claims to a review queue. Idempotency cache on `(source_id, page_slug, content_hash, prompt_version)` — unchanged pages never re-spend tokens. F2 fence-dedup: the LLM sees existing canonical takes and dedupes. v0.36.1.0 ships the tuned `v0.36.1.0-tuned-cat15` prompt (replaces the v1 stub) — validated at **0.922 F1 holdout** by the cat15 eval against the hand-labeled corpus at `test/fixtures/calibration/`. +- **`grade_takes`** — walks unresolved takes older than 6 months, retrieves evidence, asks a judge model. Auto-resolve DISABLED by default (D17 — earn trust first); operator opts in via `cycle.grade_takes.auto_resolve.enabled: true`. Conservative threshold: confidence >= 0.95 single-model OR >= 0.85 with 3/3 ensemble unanimous. Ensemble tiebreaker (E2) reuses v0.27.x cross-modal-eval substrate; fires on borderline single-model verdicts (0.6-0.95 band). +- **`calibration_profile`** — aggregates resolved takes into 2-4 narrative pattern statements + active bias tags. Voice-gated via shared `gateVoice()` (D24 — five surfaces, one function). Cold-brain branch: skips when <5 resolved takes. `grade_completion` field surfaces partial-grade state to the dashboard "60% graded" badge. + +#### Eight expansions + +- **E1** Anti-bias prompt rewrite at think time (cathedral moment — see "What you can now do"). +- **E2** Multi-judge ensemble grading (3-model parallel via Promise.allSettled, 3/3 unanimous required for auto-apply). +- **E3** Calibration-aware contradictions — v0.32.6 probe + bias-tag join; outputs flag "this contradiction fits your over-confident-geography pattern." +- **E4** Outcome-driven gstack-learnings coupling — incorrect auto-resolutions write to `~/.gstack/projects//learnings.jsonl` so plan-ceo-review / ship / investigate skills pull calibration data. Config-gated (off for external users until gstack-API stabilizes). +- **E5** Brier-trend forecasting on new takes — inline blurb at write time. Pure math over existing scorecards, no LLM. +- **E6** Admin SPA Calibration tab with 4 server-rendered SVG charts (Brier trend, per-domain accuracy, abandoned threads, clickable pattern drill-downs). +- **E7** Real-time pattern surfacing on sync — conversational nudges with 14-day cooldown via `take_nudge_log`. +- **E8** Team-brain calibration sharing via mounts — asymmetric publish-opt-in (D15) + D18 cross-brain query semantics (4 rules, pinned by `test/cross-brain-calibration.test.ts`). + +#### Schema (6 new migrations) + +- v67 `calibration_profiles` — per-(source, holder) row with scorecard + narrative + bias tags + voice-gate audit. +- v68 `take_proposals` — propose_takes queue with composite idempotency unique key. +- v69 `take_grade_cache` — grade_takes verdict cache (composite PK on take + prompt_version + judge + evidence_signature). +- v70 `take_nudge_log` — E7 cooldown state (polymorphic FK: take_id OR proposal_id). +- v71 `takes_resolved_at_idx` — partial index for Brier-trend aggregation (CONCURRENTLY on Postgres, plain on PGLite). +- v72 `think_ab_results` — D19 A/B harness data for `gbrain think --ab`. + +Every row carries `wave_version='v0.36.1.0'` so `--undo-wave` can reverse precisely. + +#### Architecture + +- `BaseCyclePhase` abstract class (D21) enforces source-scope threading, budget metering, error envelope, and progress-reporter integration at the type level. Forgetting `sourceScopeOpts(ctx)` becomes a compile error — closes the v0.34.1 leak class structurally for every new phase. +- Voice gate (D11 + D24): single `gateVoice()` function, mode parameter drives surface-specific rubrics. 2 regenerations then hand-written template fallback. Audit columns persisted on `calibration_profiles` rows so the operator can spot voice-rubric drift. +- Cross-brain query semantics (D18): 4 rules, hermetically tested. Subagent prohibition gates on `ctx.viaSubagent && !allowedSlugPrefixes` — closes the OAuth-token-to-cross-brain-leak escalation surface. + +#### New surfaces + +- CLI: `gbrain calibration`, `gbrain calibration --regenerate`, `gbrain calibration --undo-wave [--dry-run] [--scrub-gstack]`, `gbrain calibration ab-report [--days N]`, `gbrain takes revisit `, `gbrain takes nudge --reset `, `gbrain think --with-calibration`. +- MCP op: `get_calibration_profile` (scope: read). +- HTTP routes: `/admin/api/calibration/profile`, `/admin/api/calibration/charts/:type`, `/admin/api/calibration/pattern/:id`. +- Doctor checks (4): `abandoned_threads`, `calibration_freshness`, `grade_confidence_drift` (CDX-11 mitigation), `voice_gate_health`. + +#### Privacy + +- `test/fixtures/calibration/` ships a synthetic corpus (5 representative pages across 5 genres) for extract-takes prompt tuning. Every page is anonymized per the CLAUDE.md placeholder list — no real names, no real funds, no real deals. +- `scripts/check-synthetic-corpus-privacy.sh` (CDX-14 mitigation) is wired into `bun run verify`. Greps for explicit dollar amounts + verifies every non-essay fixture references at least one canonical placeholder. Fail-fast on accidental leakage. +- E7 nudges route to STDERR only in v0.36.1.0; multi-channel routing (webhook, admin SPA toast) is a v0.37+ follow-up. + +#### Tests + +- 290+ new tests across 13 new test files. All hermetic (no real LLM, no PGLite contention). +- IRON RULE regression suite in `test/regressions/v0.36.1.0-iron-rule.test.ts` pins R1-R5: think baseline unchanged, contradictions output unchanged, takes resolution flow independent of grade_takes, search/list_pages/get_page unchanged, search modes unchanged. + +#### Plan + review trail + +- `/plan-ceo-review` cleared SCOPE_EXPANSION mode with 19 decisions + 8 accepted expansions. +- `/plan-eng-review` cleared with 9 findings + 21 implementation tasks across 5 lanes. +- `/plan-design-review` cleared with mockup variant-B (Linear calm clarity) approved. +- Codex outside-voice surfaced 18 findings; 5 spec bugs auto-fixed, 6 strategic tensions resolved via D17/D18/D19. +- 30 decisions D1-D30 resolved with explicit user input. Plan persisted at `~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md`. + +#### Design system + +- `DESIGN.md` at the repo root formalizes the de facto admin tokens that landed v0.26.0. Calibration target for future `/plan-design-review` and `/design-review`. +- `--text-muted` bumped from #555 to #777 globally for WCAG AA contrast (was 4.0 — below the 4.5 floor; now ~5.5). + +## To take advantage of v0.36.1.0 + +`gbrain upgrade` is all you need for the schema. Calibration data only accumulates as you use the brain. + +1. Run `gbrain upgrade`. Six migrations v67-v72 land. +2. Build up resolved takes. The calibration profile needs >= 5 resolved takes before it generates anything. Resolve takes manually via `gbrain takes resolve N --quality correct|incorrect|partial` OR wait for grade_takes auto-grading (opt in via `gbrain config set cycle.grade_takes.auto_resolve.enabled true`). +3. Run a dream cycle: `gbrain dream` (or wait for autopilot). The new phases run in this order: propose_takes → grade_takes → calibration_profile. +4. Read the profile: `gbrain calibration`. +5. Try anti-bias think: `gbrain think --with-calibration ""`. +6. Open the admin SPA: `gbrain serve --http` → /admin → Calibration tab. +7. To roll back the entire wave: `gbrain calibration --undo-wave v0.36.1.0 --dry-run` first to see what would change, then drop `--dry-run`. +8. If anything looks wrong, file an issue: https://github.com/garrytan/gbrain/issues with `gbrain doctor` output and which step broke. + ## [0.36.0.0] - 2026-05-17 **Skillpacks are scaffolding now, not amber. Scaffold once, own the files, fork freely.** @@ -44,6 +153,7 @@ Major file/test surface changes: - Extended: `src/core/skillpack/bundle.ts` (frontmatter `sources:` validation, `enumerateScaffoldEntries`), `src/core/repo-root.ts` (`cwd_walk_up` tier) - New docs: `docs/guides/skillpacks-as-scaffolding.md` (model + workflow) - `src/core/skillpack/installer.ts` and `test/skillpack-install.test.ts` survive for now — the v0.32 `gbrain skillpack diff` informational command still uses `diffSkill` from there. Slated for deletion in v0.37 cleanup. + ## [0.35.8.0] - 2026-05-17 **Phantom unprefixed entity pages drain automatically on the next autopilot cycle. Your `alice.md` residue gets folded into `people/alice-example.md` with embeddings + strikethrough state preserved, no operator action needed.** @@ -134,6 +244,7 @@ The original /plan-eng-review proposed reusing `writeFactsToFence` for the migra - output of `gbrain doctor` - contents of `~/.gbrain/audit/phantoms-*.jsonl` - which step looks wrong + ## [0.35.7.0] - 2026-05-17 **The contradiction probe grew up. Typed claims over time, regressions detected automatically, founder scorecards as a one-liner.** diff --git a/CLAUDE.md b/CLAUDE.md index 503c23f9a..6a52bcade 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -286,6 +286,133 @@ gbrain-evals consumes: `gbrain/engine`, `gbrain/types`, `gbrain/operations`, `gbrain/extract`. Removing any of these is a breaking change for the gbrain-evals consumer. +## v0.36.1.0 Hindsight calibration wave (key files cluster) + +The wave that taught gbrain to know how the user tends to be wrong + use +that knowledge at every advice surface. Six-migration schema (v67-v72), +three new cycle phases, eight expansions, one admin tab. Plan persisted +at `~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md`. +Convention skill at `skills/conventions/calibration.md` has the agent- +facing rules. + +- `src/core/cycle/base-phase.ts` — abstract `BaseCyclePhase` class. + Enforces `sourceScopeOpts(ctx)` threading at the type level; closes + the v0.34.1 source-isolation leak class structurally for every new + phase. Inherits source-scope, budget meter, error envelope, progress + reporter. propose_takes / grade_takes / calibration_profile all + extend it. +- `src/core/cycle/propose-takes.ts` — LLM scans markdown prose, + proposes gradeable claims to `take_proposals` queue. Idempotency + cache on `(source_id, page_slug, content_hash, prompt_version)` + composite unique index. F2 fence-dedup: existing canonical takes + passed to extractor as context. v0.36.1.0 ships a stub prompt; tuned + prompt arrives via the T19 synthetic corpus build. +- `src/core/cycle/grade-takes.ts` — walks unresolved takes older than + 6 months, retrieves evidence, asks judge model, caches verdict. + Auto-resolve DISABLED by default (D17). Conservative thresholds: + >=0.95 single OR >=0.85 ensemble 3/3 unanimous. T5 ensemble + (`aggregateEnsemble`) reuses v0.27.x cross-modal substrate; fires on + borderline 0.6-0.95 band. Writes to `take_grade_cache`. +- `src/core/cycle/calibration-profile.ts` — aggregates resolved takes + into 2-4 narrative pattern statements + active bias tags. Voice-gated + via `gateVoice()`. Cold-brain skip when <5 resolved. Writes to + `calibration_profiles` with audit columns (`voice_gate_passed`, + `voice_gate_attempts`, `grade_completion`). +- `src/core/calibration/voice-gate.ts` — single `gateVoice()` function + (D24), mode parameter (`pattern_statement` | `nudge` | + `forecast_blurb` | `dashboard_caption` | `morning_pulse`). 2 regens + then hand-written template fallback from + `src/core/calibration/templates.ts`. Haiku judge with mode-specific + rubrics; all rubrics structurally forbid clinical/preachy voice. +- `src/core/calibration/cross-brain.ts` — D18 4-rule contract for + cross-brain calibration reads. Local-first → mount-fallback (only + with `canReadMountsForCtx(ctx)` true) → cross-brain attribution via + `source_brain_id` + `from_mount` → subagent prohibition closes the + OAuth-token-to-cross-brain-leak surface. All 4 rules pinned in + `test/cross-brain-calibration.test.ts`. +- `src/core/calibration/nudge.ts` — E7 real-time pattern surfacing. + `evaluateAndFireNudge(opts)` is the full pipeline: threshold check + (conviction > 0.7, holder match, slug-derived domain hint matches + active bias tag), cooldown probe (14d via take_nudge_log), fire + + log. STDERR-only output for v0.36.1.0; multi-channel deferred. +- `src/core/calibration/take-forecast.ts` — E5 Brier-trend at write + time. Pure math over existing `TakesScorecard`; no LLM. Returns + `predicted_brier`, `bucket_n`, `overall_brier`. Insufficient-data + branch at `MIN_BUCKET_N = 5`. `batchForecast` memoizes per + (holder, domain) tuple. +- `src/core/calibration/gstack-coupling.ts` — E4 outcome-driven + learnings coupling. `writeIncorrectResolution(opts)` shells out to + `gstack-learnings-log` binary. Config gate + (`cycle.grade_takes.write_gstack_learnings`, default false for + external users). Namespace prefix `gbrain:calibration:v0.36.1.0:` so + `--undo-wave` can scrub. +- `src/core/calibration/svg-renderer.ts` — D23 server-rendered SVG for + the admin SPA Calibration tab. Pure functions: data → SVG string. + Inlines design tokens; XSS-safe via `escapeXml()`. Four chart + renderers: `renderBrierTrend`, `renderDomainBars`, + `renderAbandonedThreadsCard`, `renderPatternStatementsCard`. SPA + renders via `` wrapper behind `requireAdmin`. +- `src/core/calibration/undo-wave.ts` — D18 CDX-3 resolution. `undoWave` + reverses the wave's mutations: unsets `takes.resolved_*` for + wave-applied resolutions (cross-checks resolved_by so manual writes + persist), deletes calibration_profiles, purges nudge logs, marks + grade-cache rows applied=false. `--dry-run` shows counts without + writing. Idempotent on wave_version match. +- `src/core/calibration/think-ab.ts` — D19 A/B harness. `runAbTrial` + calls thinkRunner twice (baseline + with-calibration), records + preference to `think_ab_results`. `buildAbReport` aggregates over + 30-day window; flags `calibration_net_negative` when n>=20 + win + rate < 45% on decisive trials. +- `src/core/calibration/recall-footer.ts` — formatter for the morning + pulse calibration block. Cold-brain branch when <5 resolved. v0.36 + ship state: opt-in via the wiring layer; auto-on in v0.37+. +- `src/core/eval-contradictions/calibration-join.ts` — E3 cross- + reference. `tagFindingWithCalibration(finding, profile)` returns + bias-tag context for contradictions that match active patterns. + Returns null when profile missing (R2 regression — output + byte-identical to v0.32.6). +- `src/core/think/prompt.ts` extension — E1 anti-bias rewrite. + `withCalibration` option on `buildThinkSystemPrompt` adds anti-bias + rules. New `buildCalibrationBlock()` emits the `` XML. + `buildThinkUserMessage` has TWO shapes: default (question first) for + R1 regression, with-calibration (retrieval → calibration → question + per D22) when opt-in. Wired into `runThink` via + `opts.withCalibration` + `opts.calibrationHolder`. +- `src/commands/calibration.ts` — CLI: `gbrain calibration` (read + + print), `--regenerate`, `--undo-wave ` (T17), `ab-report` (T18). + MCP op `get_calibration_profile` (scope: read) backs the same data + path. Source-scoped via `sourceScopeOpts(ctx)`. +- `src/commands/serve-http.ts` extension — three new admin routes: + `/admin/api/calibration/profile`, `/admin/api/calibration/charts/:type` + (image/svg+xml; type in {brier-trend, domain-bars, + pattern-statements, abandoned-threads}), and + `/admin/api/calibration/pattern/:id` (TD3 drill-down). +- `src/commands/takes.ts` extension — `gbrain takes revisit ` + (TD4 / D30) opens $EDITOR on the source page with a + `` cursor marker. +- `src/commands/doctor.ts` extension — 4 new checks: `abandoned_threads`, + `calibration_freshness`, `grade_confidence_drift` (CDX-11 mitigation + surface; math arrives v0.37+), `voice_gate_health`. +- `admin/src/pages/Calibration.tsx` — Calibration tab. Single-column + Linear-calm-clarity layout matching the approved variant-B mockup. + `` wrapper handles `dangerouslySetInnerHTML` for the + server-rendered SVG. +- `admin/src/index.css` extension — `--text-muted: #555 → #777` (TD2, + WCAG AA contrast bump from 4.0 to ~5.5 on the #0a0a0f bg). +- `test/fixtures/calibration/extract-takes-corpus/` — synthetic prompt- + tuning corpus. v0.36.1.0 ships 5 representative pages; full 50-page + + 10-page holdout generated by `gbrain calibration build-corpus` + (v0.37+ subcommand). All anonymized per CLAUDE.md placeholder list. +- `scripts/check-synthetic-corpus-privacy.sh` — CDX-14 mitigation. CI + guard in `bun run verify`. Greps for explicit dollar amounts + + verifies non-essay fixtures reference at least one placeholder name. +- `test/regressions/v0.36.1.0-iron-rule.test.ts` — R1-R5 regression + inventory test file. Pins all 5 IRON-RULE regressions in one place + for future bisects. +- `DESIGN.md` — repo-root design system. Formalizes the de facto admin + tokens that landed v0.26.0. Calibration target for future + `/plan-design-review` and `/design-review`. + ## Thin-client routing (v0.31.1, Issue #734) `gbrain init --mcp-only` (v0.29.2) sets up a thin-client install: no local diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 000000000..a2a61e7c2 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,148 @@ +# DESIGN.md + +The design system source of truth for gbrain. Born from the de facto tokens +that landed in `admin/src/index.css` during the v0.26.0 admin SPA work and +formalized during the v0.36.1.0 Hindsight calibration wave's design review. + +This doc is the calibration target for `/plan-design-review` and `/design-review`. +When a question is "does this UI fit the system?", the answer is here. + +## Voice + +GBrain talks like a smart friend who knows your past, not a clinical scoring +system. Every user-facing string passes through this filter: + +- Second person, contractions allowed. +- Grounded in concrete data the user can verify ("2 of 3 missed" beats + "Brier 0.31"). +- Never preachy. Never "we recommend." Never "according to your data." +- Short. Under 25 words for narrative; under one line for status. +- Numbers grounded in real outcomes, never abstract metrics without + translation. + +Five surfaces use this voice (v0.36.1.0+): +`pattern_statement`, `nudge`, `forecast_blurb`, `dashboard_caption`, +`morning_pulse`. All five pass through `gateVoice()` in +`src/core/calibration/voice-gate.ts` with mode-specific rubrics. A Haiku +judge rejects academic-sounding candidates; up to 2 regens; then fall +back to a hand-written template from `src/core/calibration/templates.ts`. + +## Color tokens + +CSS variables in `admin/src/index.css`. SVG renderer inlines literals +matching these tokens (`src/core/calibration/svg-renderer.ts`). + +| Token | Value | Use | +|--------------------|-----------|-------------------------------------------| +| `--bg-primary` | `#0a0a0f` | Page background | +| `--bg-secondary` | `#14141f` | Sidebar, cards | +| `--bg-tertiary` | `#1e1e2e` | Subtle surfaces, borders | +| `--text-primary` | `#e0e0e0` | Body text | +| `--text-secondary` | `#888` | Headings, labels | +| `--text-muted` | `#777` | Tertiary text — TD2 bumped from #555 for WCAG AA contrast (~5.5:1) | +| `--accent` | `#3b82f6` | Active states, links, primary CTAs | +| `--success` | `#22c55e` | Healthy / ok status | +| `--warning` | `#f59e0b` | Doctor warnings | +| `--error` | `#ef4444` | Failures, destructive confirmations | + +Dark theme is the only theme. No light mode toggle planned — admin is an +operator tool, not a marketing surface. Users live in the terminal with a +dark theme already. + +WCAG contrast: +- Body text (#e0e0e0 on #0a0a0f) → ~14:1, AAA +- Muted text (#777 on #0a0a0f) → ~5.5:1, AA (was 4.0 / fail before TD2) +- Accent links (#3b82f6 on #0a0a0f) → ~5.7:1, AA + +## Typography + +| Variable | Value | Use | +|--------------------|-----------------------------|---------------------------------| +| `--font-sans` | `Inter, system-ui, sans-serif` | UI text, headings, body | +| `--font-mono` | `JetBrains Mono, monospace` | Numbers, slugs, code, terminal-ish data | + +Type scale (de facto, not formalized yet): +- 18px: sidebar logo / page title +- 14px: body +- 13px: nav items +- 12px: chart captions, secondary labels +- 11px: tertiary labels in dense charts + +Numbers in tables and metrics use JetBrains Mono so column alignment is +mechanical. Avoid mixing Inter and JetBrains Mono in the same line. + +## Spacing scale + +4 / 8 / 16 / 24 / 32px. Linear-app-style density: 24-32px between major +sections, 16px between row groups, 8px within a row. The Calibration tab +(approved variant-B mockup) is the canonical example. + +## Layout + +- Sidebar 200px on the left. Active item gets a 3px left-border in `--accent`. +- Main content area uses the remaining width. +- Max content width: 720px for text-heavy pages (Calibration), 960px for + data tables (Request Log). +- No 3-column feature grids. No icons in colored circles. No decorative blobs. +- Cards earn their existence — heading + content works without a card frame + in most cases. + +## Charts + +Server-rendered SVG via `src/core/calibration/svg-renderer.ts`. Pure +functions: data → SVG string. No DOM, no React component, no chart library. + +XSS posture: server-side `escapeXml()` on every caller-controlled string. +Numeric inputs `.toFixed()`-coerced. Admin SPA renders via +`` wrapper with `dangerouslySetInnerHTML`. Endpoint gated by +`requireAdmin` middleware. + +Why server-rendered SVG (per D23): +- Chart logic stays close to the data math. +- Zero new client-side chart-library dep. +- SVG is accessible (text labels), scalable, copy-paste-friendly to PR + descriptions and docs. +- Sets the precedent for future admin charts (contradictions trend, takes + scorecard, etc.). + +Four chart renderers in v0.36.1.0: +- `renderBrierTrend({ series })` — sparkline + baseline reference at 0.25 +- `renderDomainBars({ bars })` — horizontal accuracy bars +- `renderAbandonedThreadsCard(threads)` — text rows + "revisit now" links +- `renderPatternStatementsCard(statements)` — clickable drill-down anchors + +## Interaction patterns + +- Keyboard navigation is REQUIRED for all CLI interaction surfaces. The + propose-queue review uses J/K/space/u/q shortcuts (gmail-style). +- Loading states: "Loading...". Don't show spinners on sub-200ms operations. +- Empty states ARE features: warmth + primary action + context. Cold-brain + Calibration page tells the user EXACTLY how to build a profile, not + "no data available." +- Error states: name what failed + name the next step. Never "an error + occurred — please try again." + +## What's NOT here yet (v0.37+ roadmap) + +- Type scale formalization (current values are de facto, not enforced) +- Animation tokens (admin SPA has zero animations on purpose; v0.37 may + add subtle progress / loading transitions) +- Print stylesheet +- Light mode (NOT planned — see "Dark theme is the only theme" above) +- Component library extraction (the React components live inline in admin/src/pages/; + no `