Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
b861249
schema: v0.36.0.0 Hindsight calibration tables (migrations v67-v71)
garrytan May 17, 2026
af6fe9d
core: BaseCyclePhase abstract class enforces source-scope + budget co…
garrytan May 17, 2026
0fdcc54
cycle: propose_takes phase + take_proposals queue write path (T3)
garrytan May 17, 2026
b3e4fa5
cycle: grade_takes phase + take_grade_cache verdict pipeline (T4)
garrytan May 17, 2026
fd9a4ae
cycle: grade_takes ensemble tiebreaker for borderline verdicts (T5 / E2)
garrytan May 17, 2026
9f0bdca
cycle: calibration_profile phase + shared voice gate across surfaces …
garrytan May 17, 2026
3224227
cli: gbrain calibration + get_calibration_profile MCP op (T7)
garrytan May 17, 2026
0eb125c
think: --with-calibration + anti-bias prompt rewrite (T8 / E1, D22)
garrytan May 17, 2026
ce1576b
contradictions: calibration-profile join (T9 / E3)
garrytan May 17, 2026
c3bb182
calibration: Brier-trend forecast at write time (T10 / E5)
garrytan May 17, 2026
08f59b0
calibration: gstack-learnings coupling on incorrect resolutions (T11 …
garrytan May 17, 2026
8ae71a4
doctor: 4 calibration checks — abandoned/freshness/drift/voice (T12)
garrytan May 17, 2026
8087a1f
calibration: E7 nudge + 14-day cooldown (T13 / D16 F3)
garrytan May 17, 2026
efa0313
calibration: E8 team-brain sharing + D18 cross-brain query semantics …
garrytan May 17, 2026
344b4b8
admin: E6 Calibration tab + D23 server-rendered SVG + TD2 contrast bu…
garrytan May 17, 2026
5a8b11a
recall: calibration footer formatter for morning pulse (T16)
garrytan May 17, 2026
7eddb4b
calibration: --undo-wave reversal command (T17 / D18 CDX-3)
garrytan May 17, 2026
c45b432
calibration: A/B harness for think + ab-report (T18 / D19 CDX-18)
garrytan May 17, 2026
c4e4c67
calibration: pattern drill-down route + revisit-now CLI (TD3 / D29 + …
garrytan May 18, 2026
d3e5377
docs: DESIGN.md — formalize de facto design tokens (TD1)
garrytan May 18, 2026
c122306
calibration: synthetic corpus scaffold + privacy CI guard (T19 + T20)
garrytan May 18, 2026
30f9be7
test: R1-R5 IRON RULE regression inventory (T21)
garrytan May 18, 2026
2cb9587
docs: v0.36.0.0 CHANGELOG + CLAUDE.md anchors + calibration conventio…
garrytan May 18, 2026
80b2c58
cycle: wire propose_takes / grade_takes / calibration_profile into ru…
garrytan May 18, 2026
63ee1e8
merge: master into garrytan/asuncion, bump to v0.36.1.0
garrytan May 18, 2026
62f194c
merge: master v0.35.8.0 (phantom-page redirect) into garrytan/asuncion
garrytan May 18, 2026
e1140b8
merge: master v0.36.0.0 (skillpack-scaffold) into garrytan/asuncion
garrytan May 18, 2026
69a71c9
calibration: expand synthetic corpus + add hand-labeled ground-truth …
garrytan May 18, 2026
04dbab4
calibration: tune propose_takes prompt against synthetic corpus (cat1…
garrytan May 18, 2026
c8968ed
docs: link cat14/cat15 benchmark report from CHANGELOG + README
garrytan May 18, 2026
32162da
docs: fix benchmark-report links — gbrain-evals uses main not master
garrytan May 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,115 @@

All notable changes to GBrain will be documented in this file.

## [0.36.1.0] - 2026-05-17

**The brain learns how you tend to be wrong, then argues against your blind spots on every advice call.**

Hindsight-inspired calibration loop. Three new cycle phases, eight expansions, one big admin tab. gbrain stops being a smart database and starts being a mirror that knows your track record. The brain extracts gradeable claims from your prose, grades them against reality, aggregates your record into a calibration profile, and surfaces it everywhere advice gets given — `gbrain think --with-calibration` rewrites questions against your known biases, contradictions point at the bias pattern that produced them, real-time nudges fire when you commit a high-conviction take in a domain where you've been wrong before.

### What you can now do

**Run `gbrain calibration`** and see your own track record narrated in conversational voice: "You called early-stage tactics well — 8 of 10 held up. Geography is your blind spot — 4 of 6 missed." Not "Brier 0.18." Not "conviction-bucket 0.8-0.9." Friend-not-doctor by design, voice-gated against academic slop.

**Run `gbrain think --with-calibration "should we hire fast in NY?"`** and the answer surfaces both your prior AND the counter-prior from your hedged-domain self. The bias filter applies to question framing, not evidence interpretation (D22 placement: AFTER retrieval, BEFORE the question). Off by default — flip it when you trust the profile.

**Open the admin SPA Calibration tab** at `gbrain serve --http`'s `/admin` and see your Brier-trend sparkline, per-domain accuracy bars, pattern statements, and "you committed to these and never revisited" abandoned-threads card. Server-rendered SVG, zero new chart library dep, follows the existing Linear-calm-clarity admin design.

**Commit a high-conviction take in a known-biased domain** and a conversational stderr nudge fires: "Hey — you just bet pretty hard on this NY tech thing (0.85). Last year you made three similar geographic calls; two of them missed. Maybe dial it back to a 0.65 hunch?" 14-day cooldown per pattern so you don't get re-nudged on the same call.

**Run `gbrain calibration --undo-wave v0.36.1.0`** to fully reverse the wave's mutations. Every new row carries `wave_version='v0.36.1.0'`; the undo command reverts auto-applied take resolutions, deletes calibration profiles, purges nudge logs, optionally scrubs gstack-coupled learnings. Dry-run mode shows what would change before you commit.

### Validated by published benchmarks

**First published benchmark for AI memory systems that reason about user track records.** Sibling repo [gbrain-evals](https://github.com/garrytan/gbrain-evals) ships two new categories specifically gating this wave:

- **cat14 — advice quality A/B.** `think --with-calibration` vs baseline on 8 hand-authored probes covering 6 calibration scenarios (relevant bias, confidence boost, empty profile, irrelevant bias, multi-bias, voice stress). First live run: **75% calibrated wins / 0% baseline wins / 25% tie**, voice gate **100%**, force-fit prevention **100%**.
- **cat15 — propose_takes F1.** Extract-takes prompt against 48 hand-labeled claims across 8 synthetic pages covering 5 prose genres. First live run: **0.952 F1 training / 0.922 F1 holdout**, train-holdout gap **0.03** (no overfitting). The tuned prompt back-ports verbatim into this wave's `EXTRACT_TAKES_PROMPT`, replacing the v1 stub.

The iteration log at `eval/data/cat14-calibration/iteration-log.md` documents three same-day prompt variants where the gate caught two regressions before either could ship. Full benchmark report: [2026-05-18 BrainBench Cat 14 + Cat 15 Calibration](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-05-18-brainbench-cat14-cat15-calibration.md). No prior published baseline exists in the personal-AI calibration-loop space; Hindsight introduced the concept as a skills demo without quantified evaluation. cat14 + cat15 stake out the category.

Reproduction: ~$0.15 per full eval run (cat14 + cat15), under 5 minutes wallclock on Apple Silicon.

### Itemized changes

#### Three new cycle phases

- **`propose_takes`** — LLM scans markdown prose, proposes gradeable claims to a review queue. Idempotency cache on `(source_id, page_slug, content_hash, prompt_version)` — unchanged pages never re-spend tokens. F2 fence-dedup: the LLM sees existing canonical takes and dedupes. v0.36.1.0 ships the tuned `v0.36.1.0-tuned-cat15` prompt (replaces the v1 stub) — validated at **0.922 F1 holdout** by the cat15 eval against the hand-labeled corpus at `test/fixtures/calibration/`.
- **`grade_takes`** — walks unresolved takes older than 6 months, retrieves evidence, asks a judge model. Auto-resolve DISABLED by default (D17 — earn trust first); operator opts in via `cycle.grade_takes.auto_resolve.enabled: true`. Conservative threshold: confidence >= 0.95 single-model OR >= 0.85 with 3/3 ensemble unanimous. Ensemble tiebreaker (E2) reuses v0.27.x cross-modal-eval substrate; fires on borderline single-model verdicts (0.6-0.95 band).
- **`calibration_profile`** — aggregates resolved takes into 2-4 narrative pattern statements + active bias tags. Voice-gated via shared `gateVoice()` (D24 — five surfaces, one function). Cold-brain branch: skips when <5 resolved takes. `grade_completion` field surfaces partial-grade state to the dashboard "60% graded" badge.

#### Eight expansions

- **E1** Anti-bias prompt rewrite at think time (cathedral moment — see "What you can now do").
- **E2** Multi-judge ensemble grading (3-model parallel via Promise.allSettled, 3/3 unanimous required for auto-apply).
- **E3** Calibration-aware contradictions — v0.32.6 probe + bias-tag join; outputs flag "this contradiction fits your over-confident-geography pattern."
- **E4** Outcome-driven gstack-learnings coupling — incorrect auto-resolutions write to `~/.gstack/projects/<slug>/learnings.jsonl` so plan-ceo-review / ship / investigate skills pull calibration data. Config-gated (off for external users until gstack-API stabilizes).
- **E5** Brier-trend forecasting on new takes — inline blurb at write time. Pure math over existing scorecards, no LLM.
- **E6** Admin SPA Calibration tab with 4 server-rendered SVG charts (Brier trend, per-domain accuracy, abandoned threads, clickable pattern drill-downs).
- **E7** Real-time pattern surfacing on sync — conversational nudges with 14-day cooldown via `take_nudge_log`.
- **E8** Team-brain calibration sharing via mounts — asymmetric publish-opt-in (D15) + D18 cross-brain query semantics (4 rules, pinned by `test/cross-brain-calibration.test.ts`).

#### Schema (6 new migrations)

- v67 `calibration_profiles` — per-(source, holder) row with scorecard + narrative + bias tags + voice-gate audit.
- v68 `take_proposals` — propose_takes queue with composite idempotency unique key.
- v69 `take_grade_cache` — grade_takes verdict cache (composite PK on take + prompt_version + judge + evidence_signature).
- v70 `take_nudge_log` — E7 cooldown state (polymorphic FK: take_id OR proposal_id).
- v71 `takes_resolved_at_idx` — partial index for Brier-trend aggregation (CONCURRENTLY on Postgres, plain on PGLite).
- v72 `think_ab_results` — D19 A/B harness data for `gbrain think --ab`.

Every row carries `wave_version='v0.36.1.0'` so `--undo-wave` can reverse precisely.

#### Architecture

- `BaseCyclePhase` abstract class (D21) enforces source-scope threading, budget metering, error envelope, and progress-reporter integration at the type level. Forgetting `sourceScopeOpts(ctx)` becomes a compile error — closes the v0.34.1 leak class structurally for every new phase.
- Voice gate (D11 + D24): single `gateVoice()` function, mode parameter drives surface-specific rubrics. 2 regenerations then hand-written template fallback. Audit columns persisted on `calibration_profiles` rows so the operator can spot voice-rubric drift.
- Cross-brain query semantics (D18): 4 rules, hermetically tested. Subagent prohibition gates on `ctx.viaSubagent && !allowedSlugPrefixes` — closes the OAuth-token-to-cross-brain-leak escalation surface.

#### New surfaces

- CLI: `gbrain calibration`, `gbrain calibration --regenerate`, `gbrain calibration --undo-wave <version> [--dry-run] [--scrub-gstack]`, `gbrain calibration ab-report [--days N]`, `gbrain takes revisit <slug>`, `gbrain takes nudge --reset <id>`, `gbrain think --with-calibration`.
- MCP op: `get_calibration_profile` (scope: read).
- HTTP routes: `/admin/api/calibration/profile`, `/admin/api/calibration/charts/:type`, `/admin/api/calibration/pattern/:id`.
- Doctor checks (4): `abandoned_threads`, `calibration_freshness`, `grade_confidence_drift` (CDX-11 mitigation), `voice_gate_health`.

#### Privacy

- `test/fixtures/calibration/` ships a synthetic corpus (5 representative pages across 5 genres) for extract-takes prompt tuning. Every page is anonymized per the CLAUDE.md placeholder list — no real names, no real funds, no real deals.
- `scripts/check-synthetic-corpus-privacy.sh` (CDX-14 mitigation) is wired into `bun run verify`. Greps for explicit dollar amounts + verifies every non-essay fixture references at least one canonical placeholder. Fail-fast on accidental leakage.
- E7 nudges route to STDERR only in v0.36.1.0; multi-channel routing (webhook, admin SPA toast) is a v0.37+ follow-up.

#### Tests

- 290+ new tests across 13 new test files. All hermetic (no real LLM, no PGLite contention).
- IRON RULE regression suite in `test/regressions/v0.36.1.0-iron-rule.test.ts` pins R1-R5: think baseline unchanged, contradictions output unchanged, takes resolution flow independent of grade_takes, search/list_pages/get_page unchanged, search modes unchanged.

#### Plan + review trail

- `/plan-ceo-review` cleared SCOPE_EXPANSION mode with 19 decisions + 8 accepted expansions.
- `/plan-eng-review` cleared with 9 findings + 21 implementation tasks across 5 lanes.
- `/plan-design-review` cleared with mockup variant-B (Linear calm clarity) approved.
- Codex outside-voice surfaced 18 findings; 5 spec bugs auto-fixed, 6 strategic tensions resolved via D17/D18/D19.
- 30 decisions D1-D30 resolved with explicit user input. Plan persisted at `~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md`.

#### Design system

- `DESIGN.md` at the repo root formalizes the de facto admin tokens that landed v0.26.0. Calibration target for future `/plan-design-review` and `/design-review`.
- `--text-muted` bumped from #555 to #777 globally for WCAG AA contrast (was 4.0 — below the 4.5 floor; now ~5.5).

## To take advantage of v0.36.1.0

`gbrain upgrade` is all you need for the schema. Calibration data only accumulates as you use the brain.

1. Run `gbrain upgrade`. Six migrations v67-v72 land.
2. Build up resolved takes. The calibration profile needs >= 5 resolved takes before it generates anything. Resolve takes manually via `gbrain takes resolve N --quality correct|incorrect|partial` OR wait for grade_takes auto-grading (opt in via `gbrain config set cycle.grade_takes.auto_resolve.enabled true`).
3. Run a dream cycle: `gbrain dream` (or wait for autopilot). The new phases run in this order: propose_takes → grade_takes → calibration_profile.
4. Read the profile: `gbrain calibration`.
5. Try anti-bias think: `gbrain think --with-calibration "<your question>"`.
6. Open the admin SPA: `gbrain serve --http` → /admin → Calibration tab.
7. To roll back the entire wave: `gbrain calibration --undo-wave v0.36.1.0 --dry-run` first to see what would change, then drop `--dry-run`.
8. If anything looks wrong, file an issue: https://github.com/garrytan/gbrain/issues with `gbrain doctor` output and which step broke.

## [0.36.0.0] - 2026-05-17

**Skillpacks are scaffolding now, not amber. Scaffold once, own the files, fork freely.**
Expand Down Expand Up @@ -44,6 +153,7 @@ Major file/test surface changes:
- Extended: `src/core/skillpack/bundle.ts` (frontmatter `sources:` validation, `enumerateScaffoldEntries`), `src/core/repo-root.ts` (`cwd_walk_up` tier)
- New docs: `docs/guides/skillpacks-as-scaffolding.md` (model + workflow)
- `src/core/skillpack/installer.ts` and `test/skillpack-install.test.ts` survive for now — the v0.32 `gbrain skillpack diff` informational command still uses `diffSkill` from there. Slated for deletion in v0.37 cleanup.

## [0.35.8.0] - 2026-05-17

**Phantom unprefixed entity pages drain automatically on the next autopilot cycle. Your `alice.md` residue gets folded into `people/alice-example.md` with embeddings + strikethrough state preserved, no operator action needed.**
Expand Down Expand Up @@ -134,6 +244,7 @@ The original /plan-eng-review proposed reusing `writeFactsToFence` for the migra
- output of `gbrain doctor`
- contents of `~/.gbrain/audit/phantoms-*.jsonl`
- which step looks wrong

## [0.35.7.0] - 2026-05-17

**The contradiction probe grew up. Typed claims over time, regressions detected automatically, founder scorecards as a one-liner.**
Expand Down
Loading
Loading