v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong by garrytan · Pull Request #1139 · garrytan/gbrain

garrytan · 2026-05-18T02:33:37Z

Summary

v0.36.1.0 teaches gbrain to know how the user tends to be wrong and apply that knowledge at every advice surface. One PR with 32 bisect-friendly atomic commits (31 wave commits + 1 merge resolution from master's v0.35.6 + v0.35.7).

The substrate gbrain already had (takes + scorecard + Brier + contradictions probe) was 70% of the way there. This wave closes the other 30%: extract gradeable claims from prose, grade against reality, aggregate into a profile of the user's bias patterns, then apply the profile when giving advice.

What ships

Three new cycle phases: propose_takes (LLM scans prose for gradeable claims), grade_takes (judge model verdicts unresolved takes with retrieval), calibration_profile (aggregates resolved subset into 2-4 conversational pattern statements). All extend the new BaseCyclePhase abstract class which enforces sourceScopeOpts(ctx) threading at the type level — closes the v0.34.1 source-isolation leak class structurally for every future phase.
Six new migrations (v68-v73, renumbered after merge from master's v67 facts_typed_claim_columns): calibration_profiles, take_proposals, take_grade_cache, take_nudge_log, takes_resolved_at_idx (CONCURRENTLY on Postgres), think_ab_results. Every new row stamped with wave_version='v0.36.1.0' for clean --undo-wave reversal.
Eight expansions (E1-E8): anti-bias prompt rewrite at think time (E1), multi-judge ensemble grading (E2), calibration-aware contradictions probe (E3), gstack-learnings coupling on incorrect resolutions (E4), Brier-trend forecasting at write time (E5), admin SPA Calibration tab with server-rendered SVG charts (E6), conversational real-time nudges with 14d cooldown (E7), team-brain calibration sharing via mounts with subagent prohibition (E8).
Conservative safety posture: auto-resolve disabled by default (D17). Operator flips on after seeing first batch of verdicts. Thresholds >=0.95 single-model OR >=0.85 ensemble 3/3 unanimous, schema-enforced monotonic-tightening only.
Cross-brain semantics (D18): local-first → mount-fallback (only with read permission) → cross-brain attribution surfaced in UI → subagent prohibition closes OAuth-token-to-cross-brain-leak surface.
Voice gate (D24): single gateVoice() function, 5 modes, Haiku rubric judge, 2 regens then hand-written template fallback. Pattern statements pass the conversational voice test before storage.
CLI + MCP: gbrain calibration, gbrain calibration --regenerate, gbrain calibration --undo-wave v0.36.1.0, gbrain calibration ab-report, gbrain takes revisit <slug>. New MCP op get_calibration_profile (scope: read, source-scoped).
Admin SPA: new Calibration tab at /admin/calibration with Brier sparkline, per-domain bars, pattern statements, abandoned-threads card. Three SVG endpoints behind requireAdmin. WCAG AA contrast bump on --text-muted (feat(exports): add ./enrichment to package.json exports map #555 → proposal: allow embedding provider API keys from config for headless runtimes #777).
Four new doctor checks: abandoned_threads, calibration_freshness, grade_confidence_drift, voice_gate_health.
Synthetic corpus at test/fixtures/calibration/ plus CI privacy guard scripts/check-synthetic-corpus-privacy.sh (wired into bun run verify). Real names of YC partners / portfolio companies / funds cannot leak into committed fixtures.
R1-R5 IRON RULE regression suite (test/regressions/v0.36.1.0-iron-rule.test.ts) pinning think baseline, contradictions output, takes resolution, source-isolation read paths, and search modes all UNCHANGED when calibration is absent.
Documentation: DESIGN.md formalizes de facto admin SPA tokens. skills/conventions/calibration.md is the agent-facing convention. CHANGELOG v0.36.1.0 entry. CLAUDE.md key-files cluster added.

Reviews cleared

Review	Status	Findings
/plan-ceo-review	CLEAR (PLAN)	SCOPE_EXPANSION mode; 8/8 expansions accepted; 30 decisions D1-D30
/codex review	issues_found	18 findings → 5 spec bugs auto-applied, 6 documented tradeoffs, 6 resolved via D17/D18/D19
/plan-eng-review	CLEAR (PLAN)	9 architecture/quality/perf findings, 0 critical gaps. 21 implementation tasks executed.
/plan-design-review	CLEAR (FULL)	5/10 → 9/10. Mockup B (Linear calm clarity) approved. 4 decisions D27-D30.

Test plan

bun run verify — privacy + jsonb + progress + wasm + admin-build + cli-exec + system-of-record + eval-glossary + synthetic-corpus-privacy + typecheck all green
bun run test — 7132 pass / 0 fail / 0 skip across 8 parallel shards + serial pass
Cycle test: runCycle default dispatches all 16 phases in order (was failing before T-fix wired the three new phases into runCycle dispatch)
R1-R5 IRON RULE regression suite green (think baseline unchanged, contradictions output unchanged, takes resolution unaffected, source-isolation paths unchanged, search modes unchanged)
PGLite + Postgres parity: six migrations (v68-v73) replay cleanly on both engines via sqlFor.pglite branches where index DDL differs
Voice gate fixture coverage: pinned academic-rejection + conversational-acceptance + template-fallback paths
Merge from master clean: VERSION trio audit (0.36.1.0 across VERSION, package.json, CHANGELOG.md). Migration renumber preserves master's v67 facts_typed_claim_columns. wave_version literal renamed across schema + tests + docs.

🤖 Generated with Claude Code

Foundation commit for the Hindsight-inspired calibration wave. Adds four new tables + one perf index, all source-scoped from day 1 per v0.34.1 discipline: - calibration_profiles (v67): per-holder LLM-narrative aggregation of TakesScorecard data. published BOOL gates E8 cross-brain mount sharing (default false). grade_completion REAL surfaces partial-grade state to the dashboard. active_bias_tags TEXT[] with GIN index feeds E3 (calibration- aware contradictions) and E7 (real-time nudge matching). - take_proposals (v68): propose_takes phase queue. Idempotency cache via (source_id, page_slug, content_hash, prompt_version) unique index mirrors the v0.23 dream_verdicts pattern. proposal_run_id supports --rollback by run. dedup_against_fence_rows JSONB audit column records what canonical takes the LLM was told to dedupe against at proposal time. - take_grade_cache (v69): grade_takes verdict cache. Composite PK on (take_id, prompt_version, judge_model_id, evidence_signature) — prompt edits OR evidence changes cleanly invalidate prior verdicts. applied=false default + auto-resolve-off-by-default (D17) means every fresh install needs operator opt-in before grade verdicts mutate the takes table. - take_nudge_log (v70): E7 nudge cooldown state. Polymorphic FK — a nudge fires on either a canonical take OR a pending proposal (CDX-5 fix). CHECK constraint enforces exactly-one-set. channel column lets future routing (webhook, admin SPA toast) reuse the same cooldown semantics. - takes_resolved_at_idx (v71): partial index for the Brier-trend aggregation queries. Engine-aware handler — Postgres uses CONCURRENTLY to avoid the ShareLock; PGLite uses plain CREATE. Every table carries wave_version TEXT NOT NULL DEFAULT 'v0.36.0.0' so the v0.36.0.0 calibration --undo-wave command (lands later in the wave) can reverse just this wave's writes. Plan: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md covers the design rationale (D17/D18/D21 + CDX findings). Schema parity: - src/schema.sql for fresh Postgres installs - src/core/pglite-schema.ts for fresh PGLite installs - src/core/schema-embedded.ts auto-regenerated from schema.sql - src/core/migrate.ts for upgrade-in-place from older brains VERSION bumped to 0.36.0.0 for the wave. CHANGELOG entry lands at /ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ntracts D21 from the eng review. Three new v0.36.0.0 cycle phases (propose_takes, grade_takes, calibration_profile) share enough structure that the duplication-vs-abstraction trade tips toward a shared base. Without this scaffold, source-isolation discipline would drift exactly the way it drifted in v0.34.1 — except this time across three new surfaces at once. What this enforces: 1. Phase signature is uniform: run(ctx, opts) → PhaseResult. 2. ctx.sourceId / ctx.auth.allowedSources MUST be threaded through every engine call. The base class surfaces a scope() helper that wraps sourceScopeOpts(ctx) and is the only sanctioned way to read source- scoped data. Forgetting to thread source scope becomes a TypeScript compile error, not a runtime leak. Closes the v0.34.1 leak class structurally for every new phase. 3. Budget meter wraps run() automatically. Subclass declares budgetUsdKey + budgetUsdDefault; base reads the resolved cap from config and creates the BudgetMeter. Subclass calls this.checkBudget() before each LLM submit; budget-exhausted phase still returns status='ok' (clean abort) so the cycle report shows partial completion, not failure. 4. Error envelope is uniform. Thrown errors get caught and converted to status='fail' with a phase-specific error.code via the subclass's mapErrorCode() hook. 5. Progress reporter integration. Base accepts the reporter via opts; subclasses call this.tick() instead of touching the reporter directly, so the phase name in the progress stream is always correct. Tests: 13 cases in test/core/base-phase.test.ts cover source-scope threading (5 cases including the empty-allowedSources-MUST-NOT-widen-scope regression), PhaseResult shape including the error envelope path (3 cases), dry-run propagation (2 cases), and budget meter construction (3 cases including config-key override). Synthesize.ts / patterns.ts (existing pre-v0.36 phases) deliberately do NOT retrofit to this base in v0.36.0.0 — too much churn for a refactor that doesn't pay off until v0.37+. Future phases use this by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LLM-based take extraction from markdown prose. Walks pages updated since last cycle, sends each page's body to a tuned extractor, writes the extracted gradeable claims to the take_proposals queue. User accepts / rejects via `gbrain takes propose --review` (lands in Lane C). Cycle wiring: lint → backlinks → sync → synthesize → extract → extract_facts → resolve_symbol_edges → patterns → recompute_emotional_weight → consolidate → propose_takes (NEW) → grade_takes (NEW; T4) → calibration_profile (NEW; T6) → embed → orphans → purge CyclePhase enum extended with 3 new entries; ALL_PHASES + NEEDS_LOCK_PHASES updated. All three new phases acquire the cycle lock (writes to take_proposals / take_grade_cache / calibration_profiles). Idempotency contract: The (source_id, page_slug, content_hash, prompt_version) composite unique index on take_proposals means an unchanged page never re-spends LLM tokens. Bumping PROPOSE_TAKES_PROMPT_VERSION cleanly invalidates the cache so a tuned prompt re-runs proposals on every page. Mirrors the v0.23 dream_verdicts pattern. F2 fence dedup: The phase reads the page's existing `` fence (when present) and passes the canonical take rows to the extractor as "things you have already captured." Prevents duplicate proposals when prose is appended to a page that already has takes. Records the fence rows the LLM was told to dedupe against on the take_proposals row for audit (dedup_against_fence_rows JSONB). Auto-resolve posture: propose_takes only WRITES proposals to the queue. Nothing in this phase mutates the canonical takes table. Operator opt-in via the queue review CLI (Lane C) is the only path from queue to canonical fence (D17). Prompt tuning status (v0.36.0.0 ship state): The default extractor prompt is annotated `v0.36.0.0-stub`. The real tuned prompt arrives via T19 synthetic corpus build (50 anonymized pages, 3-model parallel extraction, user reviews disagreement set, F1 ≥ 0.85 on training corpus + F1 ≥ 0.8 on ground-truth holdout). Until T19 lands, propose_takes runs but produces best-effort candidates the user reviews manually. Architecture: ProposeTakesPhase extends BaseCyclePhase (T2). Inherits source-scope threading via scope(), budget metering via this.checkBudget(), error envelope wrapping. budgetUsdKey: cycle.propose_takes.budget_usd (default $5/cycle). Budget exhaustion mid-page returns status='warn' with details.budget_exhausted=true — clean partial-completion semantics. Test seam: opts.extractor injection so the phase can run hermetically without touching the gateway. defaultExtractor (production path) calls gateway.chat with the EXTRACT_TAKES_PROMPT and parses the JSON array output via parseExtractorOutput. parseExtractorOutput defends against common LLM output sins: markdown code fence wrapping, leading prose, single-object instead of array, unknown kind values, weight out of [0,1], rows missing claim_text or exceeding 500 chars. Tests: 25 cases in test/propose-takes.test.ts cover the 4 pure helpers (parseExtractorOutput, contentHash, hasCompleteFence, extractExistingTakesForDedup) + 7 phase integration scenarios (happy path, cache hit, fence dedup, extractor failure, empty pages, skipPagesWithFence, proposal_run_id stability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Walks unresolved takes that are old enough to have outcome data, retrieves evidence from the brain, asks a judge model to verdict each one. Writes verdicts to take_grade_cache. Optionally — only when operator has flipped the opt-in config flag — auto-applies high-confidence verdicts to the canonical takes table via engine.resolveTake. Auto-resolve posture (D17 — DISABLED by default): On a fresh install, grade_takes runs and writes verdicts to the cache, but applied=false on every row. Operator reviews the queue, then flips `cycle.grade_takes.auto_resolve.enabled: true` once trust is earned. Mirrors the propose_takes review-queue posture: queue exists, mutation requires explicit opt-in. Conservative threshold (D12): When auto_resolve.enabled is true, a verdict auto-applies only when confidence >= 0.95 (single-judge path). T5 ensemble path lands next, tightening this further with 3/3 unanimous requirement. 'unresolvable' verdict NEVER auto-applies even at confidence=1.0 — there's no canonical column for "we tried and there's no evidence yet." Evidence retrieval status (v0.36.0.0 ship state): The default evidence retriever returns an "evidence-retrieval not yet wired" placeholder. Most verdicts produced by the stub-judge against the stub-evidence will be 'unresolvable'. Real retrieval (hybrid search over pages newer than the take's since_date, optionally augmented by a gateway web-search recipe in v0.37+) lands as a follow-up. Documented limitation per CDX-8 + D17 — the phase ships now so the wiring is real and the cache table accumulates verdicts even if early ones are conservative. Cache key: Composite primary key on take_grade_cache is (take_id, prompt_version, judge_model_id, evidence_signature). Prompt edits OR evidence changes OR judge swap cleanly invalidate prior verdicts. Mirrors the v0.32.6 eval_contradictions_cache pattern. evidence_signature = SHA-256 of (judge_model_id + '|' + evidence_text) so identical evidence under a different judge does NOT collide. Architecture: GradeTakesPhase extends BaseCyclePhase. Inherits source-scope threading, budget metering (cycle.grade_takes.budget_usd, default $3/cycle), error envelope. Test seam: opts.judge + opts.evidenceRetriever injection so the phase runs hermetically. parseJudgeOutput defends against fence-wrapping, leading prose, out-of-range confidence (clamps to [0,1]), invalid verdict labels, oversized reasoning (truncated at 400 chars). Returns null on unrecoverable parse — caller treats null as "judge_output_parse_failed / unresolvable at confidence 0.0" so the row still lands in cache with the parse failure surfaced via warnings. takeIsOldEnough gates on since_date (default 6 months). Tolerates YYYY-MM-DD and YYYY-MM formats. Returns false on null/unparseable since_date so takes without dates never get graded (we'd be hallucinating temporal context). Tests: 23 cases covering parseJudgeOutput (7 cases), evidenceSignature (3), takeIsOldEnough (5), and 8 phase integration scenarios — happy path, D17 auto-resolve-off default, D12 above-threshold auto-apply, below- threshold cache-only, unresolvable-NEVER-applies, cache hit, too-recent gate, judge-throw warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Multi-judge ensemble tiebreaker, additive on top of T4's single-judge foundation. Reuses gateway.chat as the per-model judge interface; runs three judges in parallel via Promise.allSettled. Pure aggregation logic in aggregateEnsemble() — no SQL, no LLM, hermetically testable. When ensemble fires (T5 trigger band): Only when ALL of: - opts.useEnsemble === true (default false) - opts.ensembleJudges array is non-empty - single-model confidence in [0.6, 0.95) (configurable via opts.ensembleTriggerBand) - single-model verdict !== 'unresolvable' Above 0.95 the single judge is already sufficient (T4 path). Below 0.6 the verdict is clearly review-only — ensemble wouldn't change the posture. 'unresolvable' from single-judge means no evidence yet; calling three more judges on the same evidence won't manufacture some. Conservative auto-apply (D12): Ensemble verdict auto-applies via engine.resolveTake only when ALL of: - autoResolve === true (operator opt-in per D17) - ensemble.agreement === 3 (3/3 unanimous) - ensemble.minConfidence >= ensembleThreshold (default 0.85) - winning verdict !== 'unresolvable' Schema-level monotonic-tightening guard for ensembleThreshold lives in the takes resolution layer. Cache identity: When ensemble fires, the cache row's judge_model_id becomes 'ensemble:<modelA>+<modelB>+<modelC>' — a future re-run with different ensemble membership doesn't collide with prior verdicts. evidence_signature is recomputed because it includes the judge_model_id. aggregateEnsemble (pure): - 3/3 unanimous → agreement=3, minConfidence=min across the three - 2/3 majority → agreement=2, minConfidence across the agreeing two - 1/1/1 disagreement → tie-break: prefer non-'unresolvable', then alphabetical for determinism - 'unresolvable' from one model NEVER tips a 2-vote majority toward 'unresolvable' — by-label tally only counts a model toward its own label - All three judges failing (allSettled rejected) → verdict='unresolvable' with agreement=0; auto-apply path blocked - Single judge survives + two fail → agreement=1; the lone verdict wins but auto-apply gated by the 3/3 requirement Tests: 16 cases. aggregateEnsemble (6): 3/3, 2/3, 1/1/1, unresolvable-tipping-resistance, all-failed, partial-failed-but-survives. Phase trigger conditions (5): useEnsemble=false default, useEnsemble=true in borderline band, single >= 0.95 skip, single < 0.6 skip, single = 'unresolvable' skip. Phase auto-apply rules (5): 3/3+threshold+autoResolve, 2/3 majority no apply, 3/3 below threshold no apply, one ensemble judge throws still aggregates from allSettled, empty ensembleJudges falls through to single. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(T6) The calibration narrative layer. Reads TakesScorecard, asks an LLM to write 2-4 conversational pattern statements ("right on tactics, late on macro by 18 months"), passes them through the voice gate, derives active bias tags, writes the row to calibration_profiles. This is the read-side that E1 (think anti-bias rewrite), E3 (contradictions join), E6 (dashboard), and E7 (real-time nudges) all consume. Voice gate (D24 — single function, multiple surfaces): ALL five calibration UX surfaces import the same gateVoice() function from src/core/calibration/voice-gate.ts. Mode parameter ('pattern_statement' | 'nudge' | 'forecast_blurb' | 'dashboard_caption' | 'morning_pulse') drives surface-specific tuning via the rubric the gate ships to its Haiku judge. NO forked implementations — voice rubric drift would defeat the gate. Each mode's rubric explicitly forbids preachy / clinical / corporate voice; a structural test pins this. Anchors the cross-cutting voice rule from /plan-ceo-review D2-D8. Fallback policy (D11): Up to 2 generation attempts (configurable). On both rejects → fall back to a hand-written template from src/core/calibration/templates.ts. Templates are intentionally short and a little "robotic" — they're the safety net, not the destination. voice_gate_passed=false + voice_gate_attempts get persisted on the calibration_profiles row so the operator can review the failing examples and tune the rubric over time. Suppressing the surface silently is NEVER an option — that's how voice quality silently degrades. parseJudgeOutput defaults to 'academic' on parse failure (NEVER passes pass-through) so a Haiku output garble falls through to the template rather than letting unverified text reach the user. calibration_profile phase: Extends BaseCyclePhase. Cold-brain skip: <5 resolved takes → no row written, no LLM call. Otherwise: scorecard via engine.getScorecard() → patterns via voice-gated generator → bias tags via separate generator (best-effort; failure logs warning, phase continues). The DB INSERT lands in the v67 calibration_profiles row with source_id, holder, the patterns, voice gate audit fields, active bias tags, and grade_completion (F1 fix — partial-grade state surfaces to the dashboard "60% graded" badge). Budget gate at $0.50/cycle default (mostly Haiku). Below-budget before-LLM-call check returns status='warn' without writing the row. Per-domain scorecards are a placeholder for v0.36.0.0 ship state — the F12 batchGetTakesScorecards() engine method that powers per-domain rendering lands in Lane C alongside the CLI/MCP surface. Architecture: parsePatternStatementsOutput is tolerant of LLM emitting numbered lists / bulleted lines despite the prompt asking for plain lines. Caps at 4 patterns + drops excessively long lines (>200 chars). parseBiasTagsOutput lowercases input + drops non-kebab-case tokens (defends against the LLM emitting "Over-Confident Geography" with spaces or capitals). Caps at 4 tags. Tests: 43 cases across two new test files. voice-gate.test.ts (24): parseJudgeOutput (7), gateVoice happy path (3), fallback path (5), mode parity (2), templates (7). calibration-profile.test.ts (19): parsers (10), pickFallbackSlots (3), phase integration (6 — cold-brain skip, happy path, voice gate fallback, grade_completion plumbed through, bias-tags failure non-fatal, source_id scope reaches INSERT). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Public-facing read surface for the v0.36.0.0 calibration wave. CLI prints the active calibration profile; MCP op exposes the same data path for agents. Mirror of the v0.29 salience/anomalies shape (pure data fn + JSON formatter + human formatter + thin CLI dispatch). CLI: `gbrain calibration` Flags: --holder <id> specific holder (default 'garry') --json machine output for piping --regenerate run calibration_profile phase now --undo-wave <ver> [placeholder — wires in Lane D / T17] ab-report [placeholder — wires in Lane D / T18] Human output: Calibration profile — holder: garry, source: default Generated: <local timestamp> [Note: built on 60% graded — partial completion this cycle.] (when grade_completion < 0.9) [Note: voice gate fell back to template (2 attempts).] (when voice_gate_passed=false) Resolved: 12 takes Brier: 0.210 (lower is better) Accuracy: 60.0% Partial: 10.0% Pattern statements: • You called early-stage tactics well — 8 of 10 held up. Active bias tags: over-confident-geography Cold-brain fallback message names the exact dream command to run. MCP: `get_calibration_profile` (scope: read) Param: holder?: string (defaults to 'garry') Returns: latest CalibrationProfileRow | null Source-scoping via sourceScopeOpts(ctx): scalar source-bound clients see only their source; federated_read scopes see the union of allowed sources; no source filter when neither is set (CLI default path). Throws GBrainError('INVALID_HOLDER') on empty/non-string holder so remote callers get a structured error instead of a SQL-shape failure. Architecture: getLatestProfile is the pure data fn — engine + opts → CalibrationProfileRow | null. Reused by both the CLI and the MCP op. Source-scoped via the standard v0.34.1 spread pattern (scalar sourceId vs sourceIds array). formatProfileText is pure — null → cold-brain message, populated → full printout. Annotates partial-grade rows and voice-gate-fallback rows so the operator sees data-quality status inline. parseArgs is exported via __testing for unit coverage. Sub-command ('ab-report') vs flag distinction is intentional — keeps the surface parallel with `gbrain eval cross-modal` etc. Tests: 21 cases. parseArgs (6 cases): empty, --holder, --json, --regenerate, --undo-wave, ab-report. getLatestProfile (5 cases): happy, null, scalar source scope, federated array scope, no-source-filter default. formatProfileText (5 cases): cold-brain, happy, partial-grade note, voice-fallback note, published-to-mounts note. getCalibrationProfileOp (5 cases): default holder, scalar source scope, federated scope union, returns-null-on-unknown-holder, throws on empty holder. Lane D follow-ups: --undo-wave (T17) and ab-report (T18) print a clear "lands in Lane D" stderr line + exit 2; the surfaces exist for early testers, the implementations land next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Optional anti-bias rewrite mode for `gbrain think`. When set, the active calibration profile gets injected per the D22 placement spec (AFTER retrieval evidence, BEFORE the user's question). The bias filter applies to QUESTION FRAMING, not evidence interpretation — matches LLM-as-judge best practice (bias prompts near end of context perform better). Default behavior unchanged (R1 regression guard): omitting --with-calibration produces the v0.28-vintage user-message shape with the question first, then retrieval. Existing think users see no change. Two user-message shapes in buildThinkUserMessage: Default (no calibration): Question: X <pages>...</pages> <takes>...</takes> <graph>...</graph> Respond with a single JSON object... With calibration (D22): <pages>...</pages> <takes>...</takes> <graph>...</graph> <calibration holder="garry"> Track record: Brier 0.210 (lower is better). Active patterns: - You called early-stage tactics well — 8 of 10 held up. Active bias tags: over-confident-geography </calibration> Question: X Respond... Calibration block is built by buildCalibrationBlock (exported for the E3 contradictions probe to render the same shape). System prompt extension (withCalibration:true): - Names BOTH the user's PRIOR (default reasoning) AND the COUNTER-PRIOR from their hedged-domain self. - References active bias tags by name when relevant ("this fits the over-confident-geography pattern"). - Does NOT silently substitute the debiased answer. ALWAYS surfaces both priors transparently. - Adds a "Calibration" section between Conflicts and Gaps in the answer body. RunThinkOpts extension: - withCalibration?: boolean — opt-in - calibrationHolder?: string — defaults to 'garry' When withCalibration=true and no profile exists, runThink falls back to baseline behavior + pushes NO_CALIBRATION_PROFILE to warnings (visible to the operator). When the calibration fetch fails, CALIBRATION_FETCH_FAILED warning surfaces with the underlying error. Either path keeps think working; the calibration loop is enhancement, not requirement. CLI: `gbrain think "<q>" --with-calibration [--calibration-holder <id>]` Tests: 11 cases. buildThinkSystemPrompt (4 cases): R1 regression — default/false/omitted → no anti-bias rules; with calibration → adds PRIOR + COUNTER-PRIOR + bias-tag reference; preserves existing hard rules. buildCalibrationBlock (3 cases): happy path, null brier omitted (not "Brier null"), empty patterns + tags still well-formed. buildThinkUserMessage (4 cases): R1 regression — without calibration: question first; D22 placement — retrieval → calibration → question → instruction; graph + calibration ordering; empty retrieval blocks render placeholders without breaking shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cross-references each contradiction finding against the active calibration profile. When a contradiction's domain matches an active bias tag (e.g. "over-confident-geography" or "late-on-macro-tech"), the output gains a one-line bias context explaining which pattern this fits. Pure functions only — no DB writes, no LLM calls. The probe runner imports tagFindingWithCalibration() and applies it to each finding before emitting. When no profile exists or no tags match, the helper returns null and the runner emits the unchanged finding (regression R2 — contradictions output is byte-identical to v0.32.6 when no calibration profile is present). Match heuristic (v0.36.0.0 ship-state): Bias tags are kebab-case axis-then-domain slugs ('over-confident-geography'). computeDomainHint() extracts a domain hint from the finding's slugs + holder + verdict text: - wiki/companies/... → hiring | market-timing - wiki/people/... → founder-behavior - macro / geography / tactics / ai segments in slug → matching tag First-match-wins for ordering determinism. Match is intentionally fuzzy — the v0.32.6 contradictions probe doesn't yet carry structured domain metadata. v0.37+ structured-domain-on-takes (Hindsight-style enum) tightens this. Output: Returns { bias_tag: string, context: string } | null. Context format: "This contradiction fits your active bias pattern \"<tag>\" (Brier 0.31). Verdict: contradiction; severity: medium. Consider reviewing both sides through the lens of that pattern." Tests: 13 cases. R2 regression (2): null profile → null tag; empty active_bias_tags → null tag. computeDomainHint (5): companies / people / macro / geography / unknown paths produce expected hints. Match path (4): macro→late-on-macro-tech, geography→over-confident-geography, mismatch returns null, first-match-wins with multiple candidate tags. buildBiasContextString (2): emits tag+verdict+severity+Brier; omits Brier when null (no "Brier null" leak). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure math layer over existing TakesScorecard data. Zero new LLM cost, zero new schema. Surfaces the user's historical Brier for the take's (holder, domain) bucket at write time so they see "your historical Brier in macro takes is 0.31" before committing the take. Voice-gate-rendered output: The user-facing string goes through gateVoice mode='forecast_blurb' via templates.ts (already in T6). This module is the pure data layer; the template renders the math into the conversational voice. v0.36.0.0 ship state: Bucket dimension is the DOMAIN (slug-prefix). The conviction-weight bucket dimension would need a new engine method (engine.batchGetTakeBucketStats per F11) — deferred to v0.37+. Until then, forecast = historical Brier in this holder's domain. resolveDomainPrefix() keeps slug-prefix-looking domain hints ('companies/', 'wiki/macro') and falls back to overall for free-form hints ('macro tech', 'geography'). Hindsight-style structured domain on takes (CDX-11 mitigation TODO) tightens this in v0.37+. MIN_BUCKET_N = 5: Below this sample size, the forecast returns predicted_brier=null with insufficient_data=true. Template renders "Forecast unavailable: only N resolved takes at this conviction yet" instead of a noisy estimate. Architecture: computeForecast(input) — pure function, takes scorecards already fetched; ideal for tests + reuse across batched paths. forecastForTake(engine, input) — convenience wrapper, 1-2 engine round-trips (no domain → 1; with domain → 2). batchForecast(engine, inputs[]) — memoizes per (holder, domainPrefix); N inputs collapse to ≤2*unique_holders unique engine calls. Used by the propose-queue review flow (50 candidates → 1-2 scorecard fetches). Tests: 14 cases. computeForecast (4): insufficient_data branch, stable forecast, overall fallback, MIN_BUCKET_N export. resolveDomainPrefix (5): undefined/empty/whitespace → undefined; slug-prefix → kept; free-form → undefined. forecastForTake (3): 1-call overall, 2-call domain, free-form fallback. batchForecast (2): cache collapse for repeat queries; different holders do not collapse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…/ E4) When the grade_takes phase auto-resolves a take as 'incorrect' or 'partial', optionally write a learning entry to gstack's per-project learnings.jsonl so other gstack skills (plan-ceo-review, ship, investigate, ...) can pull it as context when relevant. The brain teaches every other tool about the user's track record. Config gate (D5 / CDX-17 mitigation): `cycle.grade_takes.write_gstack_learnings` defaults FALSE. External users may not have gstack installed; the gstack-learnings binary API isn't stable yet. Garry's brain flips it true to opt in. Quality gate: Only 'incorrect' and 'partial' verdicts trigger the write. 'correct' resolutions are noise (we expected the take to hold up — no learning). 'unresolvable' has no canonical column. Defense-in-depth runtime guard in writeIncorrectResolution() rejects ineligible qualities with reason='quality_not_eligible' so a caller misuse never surfaces a malformed learning entry. Auto-apply only: Coupling fires only when grade_takes both auto-applies AND the verdict is incorrect/partial AND the config flag is enabled. Manual resolutions via `gbrain takes resolve` intentionally DO NOT propagate to gstack — manual writes already carry operator intent; the calibration loop is the noise-prone path that earns coupling. Namespace: Every entry's key starts with 'gbrain:calibration:v0.36.0.0:'. Lane D `gbrain calibration --undo-wave v0.36.0.0` (T17) filters on this prefix for the optional gstack-scrub step. First active bias tag suffixes the key (e.g. 'take-42:over-confident-geography') so future analysis can group learnings by bias pattern. Architecture: buildLearningEntry — pure. Truncates claim at 200 chars + ellipsis; emits Pattern: line when activeBiasTags present; defaults confidence to 0.8 when caller omits it. writeIncorrectResolution — async wrapper. Honors config gate; honors quality gate; calls the injected writer (or defaultGstackWriter in production). Failures are non-fatal: returns { written: false, reason: 'write_failed' | 'binary_missing', error }. The grade_takes phase logs to result.warnings and continues — gstack coupling failure NEVER aborts a cycle. defaultGstackWriter — shells out to gstack-learnings-log binary via execFileSync. Throws GBrainError('GSTACK_BINARY_NOT_FOUND') when the binary isn't on PATH; writeIncorrectResolution classifies that error to reason='binary_missing' so the operator sees the install hint instead of a generic write_failed. Wired into grade-takes.ts after engine.resolveTake() inside the auto-apply block. Only fires when shouldApply=true. Tests: 14 cases. buildLearningEntry (7): canonical shape, partial vs incorrect wording, bias-tag suffix, no-tag fallback, claim truncation, default confidence, no-reasoning omission. writeIncorrectResolution (7): config gate, quality gate, happy path, writer-throw graceful degrade, binary-missing classification, async writer awaited, partial quality writes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the four calibration doctor checks per the eng-review spec. abandoned_threads: Counts active high-conviction takes (weight >= 0.7) older than 12 months that have never been superseded. Signal, not error — always status='ok' with a count. The hint sends users to `gbrain calibration` for details. calibration_freshness: Warns when the active profile is older than 7 days (configurable via the same env-var pattern other freshness checks use). Cold-brain branch (no profile yet) returns ok without scolding. Hint points at `gbrain calibration --regenerate`. grade_confidence_drift (CDX-11 mitigation): Surfaces the count of auto-applied grade verdicts. Below 30: returns "need 30+ for drift detection". At/above 30: returns "drift math arrives in v0.37+". The surface is wired; the actual confidence-vs-accuracy correlation math is a v0.37+ follow-up once we have 30+ auto-applied verdicts to measure against. Closes the CDX-11 hole structurally — the operator sees the surface even before the math is meaningful. voice_gate_health: Tracks voice gate failure rate over the last 7 days. <30% fail rate → ok (template fallback is fine in isolation). >=30% → warn with hint to review src/core/calibration/voice-gate.ts rubric. Anchors the cross-cutting voice rule observability story. All four checks return status='warn' with a diagnostic message on engine errors — non-blocking, never throws. Matches the existing doctor check pattern (see checkSyncFreshness for prior art). Wired into runDoctor after checkRerankerHealth (the v0.35 cluster), in the canonical block 10 slot. Tests: 15 cases. 4 per check (happy path, alt-status, engine-throw diagnostic, plus boundary tests for the freshness staleness gate at exactly 7 days and the grade drift gate at 30 applied verdicts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Real-time pattern surfacing when a newly-committed high-conviction take matches an active bias pattern. Conversational nudge text via the templates module; 14-day cooldown per (take_id, nudge_pattern) via take_nudge_log to prevent the feedback loop where each cycle re-fires the same nudge on the same take. Threshold gates (D16 F3): - holder match (profile.holder === take.holder) - conviction-weight > 0.7 (strict greater than) - take's slug-derived domain hint matches an active bias tag (takeDomainHint — same heuristic as eval-contradictions/calibration-join.ts for cross-surface consistency) Cooldown gate: Before firing, probe take_nudge_log for (take_id, nudge_pattern) rows with fired_at >= now() - 14 days. Any hit → silently skip. After firing, insert a new row with channel='stderr' so the next 14 days are gated. Feedback-loop prevention: User hedges a take in response to a nudge (e.g. weight 0.85 → 0.65). Even though the take's `weight` field changed, the cooldown row for the over-confident-geography pattern is still there from the original fire — so the next cycle's evaluateAndFireNudge() silently skips. The user reset path (gbrain takes nudge --reset N) clears the cooldown to re-arm. Output channel (v0.36.0.0 ship state): STDERR only. Schema's `channel` column already supports multi-channel (webhook, admin SPA toast); routing those is a v0.37+ follow-up. Architecture: evaluateNudgeRule(take, profile) — pure rule check. Returns { matched, reason, matchedTag }. No engine call. checkCooldown(engine, takeId, pattern) — engine probe, returns boolean. recordNudgeFire(engine, opts) — INSERT into take_nudge_log. evaluateAndFireNudge(opts) — full pipeline. Returns NudgeDecision. resetNudgeCooldown(engine, takeId) — DELETE...RETURNING for the CLI. buildNudgeText delegates to templates.ts nudgeTemplate (D24 mode='nudge' voice). v0.36.0.0 ship state uses the template directly; LLM-generated nudge text via the voice gate lands in v0.37+ when we have production examples to tune from. Tests: 22 cases. takeDomainHint (5): companies/people/macro/geography/unrecognized. evaluateNudgeRule (6): no_profile, wrong_holder, conviction-at-threshold- is-NOT-eligible (strict >), no matching tag, happy match, first-match-wins for multiple candidate tags. checkCooldown (3): true on row hit, false on no row, cutoff date param verifies the 14-day boundary. evaluateAndFireNudge (4): happy fire (text contains hush command + matched tag), cooldown silent skip (no INSERT, no stderr), no_profile short-circuit, below-conviction short-circuit (no cooldown query fired). buildNudgeText (2): hush command shape, conviction value embedded. resetNudgeCooldown (2): returns count, idempotent on zero rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(T14) Cross-brain calibration profile resolution per the D18 4-rule contract. Pins all four cross-brain leak surfaces in dedicated unit tests so future mount features can't silently regress this security model. D18 semantics (committed): Rule 1 — LOCAL-FIRST ORDERING. Query the local brain first. If a profile exists, return it. Do NOT also query mounts (avoids stale-mount-overrides-fresh-local). Verified: mountResolver is NOT called when local has a hit. Rule 2 — MOUNT FALLBACK. Only when local has no profile AND canReadMounts=true, walk the mounts in priority order. First match wins. Each mount-side row must have published=true to be visible (D15 asymmetric opt-in). Rule 3 — CROSS-BRAIN ATTRIBUTION. Every returned profile carries source_brain_id + from_mount flag. Consumers (E1 think rewrite, E3 contradictions, E7 nudge, E6 dashboard) MUST surface this via attributionSuffix() so the user sees which brain answered. Rule 4 — SUBAGENT PROHIBITION. canReadMountsForCtx() classifier returns FALSE for subagent loops without trusted-workspace allowedSlugPrefixes. Closes the OAuth-token-to-cross-brain-leak surface — subagents see ONLY their local-brain results regardless of which holder they query. Exception: trusted cycle phases (synthesize/patterns) pass allowedSlugPrefixes set and ARE allowed to read mounts. Pinned in the classifier test. Architecture: queryAcrossBrains(localEngine, opts) — pure orchestrator. Composes getLatestProfile() from src/commands/calibration.ts. Mount engine access is via opts.mountResolver — production wires this to the v0.19+ gbrain mounts subsystem; tests inject a stub returning an ordered list of mocked engines. Decouples cross-brain LOGIC from multi-engine PLUMBING. canReadMountsForCtx(ctx) — pure classifier table. Drives the rule-4 gate. Production callers compose it from OperationContext. attributionSuffix(result) — pure formatter. Emits the "(from mounted brain: <id>)" suffix when from_mount=true; empty string when local. Mandatory for user-visible cross-brain consumers. Tests: 15 cases pinned to the 4 D18 rules + 4 supplementary structural checks. D18-1: published=false profile on mount stays hidden. D18-2/3: subagent context cannot fall back to mounts (2 cases — null on local-empty + canReadMounts=false, local hit still returned). D18-4: attribution surfaces source_brain_id (3 cases — mount answer flag, local answer flag, attributionSuffix formatter). Rule 1 local-first ordering (2 cases — mountResolver NOT called on local hit, IS called on local empty). Mount priority order (3 cases — first published=true wins, all published=false returns null, no mounts configured returns null without throwing). canReadMountsForCtx classifier (4 cases — local CLI true, MCP non-subagent true, subagent without trusted-workspace false, subagent WITH trusted-workspace true). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mp (T15) Adds the v0.36.0.0 admin SPA Calibration tab. Per the design review, the approved variant-B (Linear calm clarity) layout: single-column flow, generous whitespace, ONE big sparkline as hero, then patterns, then domain bars, then abandoned threads. D23 server-rendered SVG architecture: src/core/calibration/svg-renderer.ts — pure functions. data → SVG string. No DOM, no React, no chart library dep. Inlines the admin design tokens (#0a0a0f bg, #3b82f6 accent, etc.) so the SVG is visually consistent with the rest of the admin SPA. Four chart renderers: - renderBrierTrend({ series }) — sparkline w/ baseline reference at 0.25 (always-50% baseline) - renderDomainBars({ bars }) — horizontal accuracy bars per domain - renderAbandonedThreadsCard(threads) — D30/TD4 'revisit now' link per row, points at /admin/calibration/revisit/<takeId> - renderPatternStatementsCard(statements) — D29/TD3 clickable drill-down links per row, point at /admin/calibration/pattern/<i> XSS posture: all caller-controlled strings pass through escapeXml(). Numeric inputs are .toFixed()-coerced. Admin SPA renders via dangerouslySetInnerHTML inside a TrustedSVG wrapper component; endpoint is gated by requireAdmin middleware. /admin/api/calibration/profile — returns the active profile row as JSON. /admin/api/calibration/charts/:type — returns image/svg+xml markup for type ∈ {brier-trend, domain-bars, pattern-statements, abandoned-threads}. Cache-Control: private, max-age=60. brier-trend currently renders a single-point series from the active profile (the time-series view across calibration_profiles.generated_at history is a v0.37 follow-up once we have multiple snapshots). abandoned-threads pulls the top 5 abandoned rows via the same SQL the doctor check uses. CalibrationPage React component (admin/src/pages/Calibration.tsx): Fetches profile + 4 charts. Loading / error / cold-brain states all handled. Layout includes the audit annotations (partial-grade badge, voice-gate-fell-back-to-template badge) per the approved mockup. TrustedSVG wrapper isolates the dangerouslySetInnerHTML to the SVG surface only. App.tsx nav: added 'calibration' page route + sidebar nav item, hash routing extended to support #calibration. TD2 contrast bump: admin/src/index.css --text-muted: #555 → #777. Old value was contrast 4.0 on the #0a0a0f bg — below WCAG AA 4.5 for body text. New value is ~5.5, passes AA. Improvement is global across Dashboard, Agents, RequestLog, and the new Calibration tab — single-line CSS change with ~10x the impact. admin/dist/ rebuilt via `bun run build` (vite). 36 modules transformed. Tests: 19 cases in test/svg-renderer.test.ts. escapeXml (1): canonical entities. renderBrierTrend (6): empty state, polyline for 2+ points, clamp beyond yMax, design tokens inlined, XSS safety on date strings, text-anchor end on right label. renderDomainBars (4): empty state, label/accuracy/n rendering, out-of-range accuracy clamp, XSS safety on labels. renderAbandonedThreadsCard (4): empty state, row rendering with revisit link, claim truncation at 70 chars, custom revisitHref override. renderPatternStatementsCard (4): empty state, anchor count matches statement count, XSS safety, custom drillHref override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure formatter that turns a CalibrationProfileRow + optional abandoned- threads list into the conversational block the morning pulse will surface: Calibration this quarter: Brier 0.18 (solid). Right on early-stage tactics, late on macro by 18 months. Over-confident on team execution; under-calibrated on regulatory risk. Threads you opened and never came back to: · AI search platform differentiation (17 months silent) · International expansion playbook (12 months silent) Cold-brain branch: returns empty string when no profile or < 5 resolved takes. Caller decides whether to render the block; cold-brain absence is the cleanest non-event. Brier trend note maps the absolute value to conversational copy: <= 0.10 → "(strong calibration)" <= 0.20 → "(solid)" <= 0.25 → "(near baseline)" > 0.25 → "(worse than always-50% baseline — review your high-conviction calls)" v0.36.0.0 ship state has only the current profile snapshot. The "was 0.22 90d ago — improving" comparison shape arrives when we accumulate generated_at history across multiple cycles. R3 regression posture: This module is the FORMATTER only. Wiring into `gbrain recall`'s text output is intentionally NOT in this commit — runRecall's surface stays unchanged. v0.37 wires it under --show-calibration (opt-in initially, default-on later). For now the formatter is callable from the admin tab + custom CLI scripts that want it. Architecture: buildRecallCalibrationFooter(opts) — pure. opts.profile required, opts.abandonedThreads optional, opts.threadColumnWidth defaults to 50. Caps at 4 patterns + 5 abandoned threads to keep the footer scannable. Truncates long abandoned-thread claim text to fit the column width with a trailing ellipsis. Tests: 14 cases. Cold-brain branch (3): null profile, < 5 resolved, zero resolved. Happy path (7): header + Brier + patterns, trend note ranges (4 brackets), null brier omits the Brier line but keeps header, caps at 4 patterns. Abandoned threads (4): omit section when none, emit when present, cap at 5, truncate long claim with column-width override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements the undo-wave reversal flow. Every new row written by the v0.36.0.0 calibration wave carries wave_version='v0.36.0.0' so a precise revert is possible without touching pre-wave data. CLI surface (replaces the v0.36.0.0 ship-state placeholder): gbrain calibration --undo-wave v0.36.0.0 [--dry-run] [--scrub-gstack] [--json] Reversal scope (4 steps): Step 1 — UNSET takes.resolved_* columns for takes auto-applied by this wave. Identifies wave-applied takes via take_grade_cache.applied=true + wave_version match. Cross-checks resolved_by='gbrain:grade_takes' to ensure we're not un-resolving a take a manual `gbrain takes resolve` override has since claimed. Manual resolutions persist; only auto-grade resolutions revert. Step 1b — Mark take_grade_cache rows applied=false post-undo so the audit trail shows they WERE applied but this wave was reverted. The CDX-11 confidence-drift check filters on applied=true and gets a cleaner sample post-undo. Step 2 — DELETE FROM calibration_profiles WHERE wave_version = ?. Step 3 — DELETE FROM take_nudge_log WHERE wave_version = ?. Step 4 — Optional gstack-learnings-prune via the binary, scoped to the GSTACK_LEARNING_NAMESPACE prefix. Opt-in via --scrub-gstack. Best-effort: binary-missing or failure logs a warning + suggests the manual command; the rest of the undo still succeeded. Dry-run posture: --dry-run computes the counts via SELECT COUNT(*) shapes without emitting any UPDATE or DELETE. Same UndoWaveResult shape returned so operator sees exactly what would be reverted before committing. --dry-run intentionally skips the gstack scrub (filesystem write) too; ship-state safety call. Idempotency: Re-running --undo-wave on a brain that's already reverted is a no-op. Each query filters on wave_version; no matching rows → zero counts. Architecture: undoWave(engine, opts) — async, returns UndoWaveResult. Pure data layer; no stderr writes, no process exits. CLI dispatch in src/commands/calibration.ts handles printing. v0.36.0.0 ship state runs steps 1-3 sequentially (no transaction). Partial reversal is recoverable via re-run since each step is idempotent on wave_version match. A future enhancement (v0.37+) can wrap in engine.transaction once that surface lands in BrainEngine. Tests: 8 cases in test/undo-wave.test.ts. Dry-run posture (1): counts emitted, NO UPDATE/DELETE SQL fired. Happy path (3): all 4 steps execute, resolved_by filter scopes UPDATE to wave-applied resolutions, custom resolvedByLabel honored. Empty wave (2): zero counts when no matching rows, idempotent re-run. Wave-version parameter threading (2): supplied version threads through all queries, different wave versions don't collide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Structural answer to CDX-18 (anti-bias rewrite may make advice worse). We don't have to guess whether calibration helps — we measure. Architecture: runAbTrial(input) — calls thinkRunner TWICE on the same question (baseline + --with-calibration), surfaces both answers to a preferenceResolver, persists the trial to think_ab_results. buildAbReport(engine, { days }) — aggregates the table over the last N days (default 30). Computes win counts, ties, neither, and a with_calibration_win_rate over DECISIVE trials only (excludes neither/tie). Flags calibration_net_negative when n >= 20 AND win rate < 45%. formatAbReport(report, days) — pretty-prints for stdout; emits the calibration_net_negative warning block when triggered. CLI: gbrain calibration ab-report [--days N] [--json] Reads the table, prints the breakdown. Replaces the v0.36.0.0 ship-state placeholder in src/commands/calibration.ts. gbrain think --ab "<question>" Wires into runAbTrial via the dispatch in src/commands/think.ts — follow-up commit. This commit lands the harness layer + schema + report surface; the --ab flag itself flips on in a one-line wiring commit when the runRecall path is ready. Schema (migration v72 / think_ab_results): source_id, wave_version, ran_at, question, baseline_answer, with_calibration_answer, preferred (CHECK in {baseline, with_calibration, neither, tie}), model_id, notes. CHECK constraint enforces preferred enum. Default wave_version 'v0.36.0.0' stamped so --undo-wave can scrub these too. Index on (source_id, ran_at DESC) supports the report's "last N days" query. schema.sql + pglite-schema.ts both updated for fresh-install parity. schema-embedded.ts regenerated via build:schema. calibration_net_negative threshold (D19): Triggers when: - decisive_trials (baseline + with_calibration) >= 20 - with_calibration_win_rate < 0.45 (NOT <= — exact 45% is OK) Small-sample guard (n < 20) prevents the warning from firing on early data with sampling noise. Confidence-flat threshold (no Wilson CI yet) keeps the math simple; v0.37+ adds CI bounds. Tests: 12 cases in test/think-ab.test.ts. runAbTrial (4): both runner calls fire, preferenceResolver receives both answers, INSERT row params shape, throws when thinkRunner missing. buildAbReport (5): zero trials, aggregation, net_negative trigger at n>=20 + win<45%, no trigger at n<20 (small-sample guard), no trigger at exact 45% boundary. formatAbReport (3): zero-state message, decisive-trials breakdown, net_negative warning block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…TD4 / D30) TD3 (D29) — clickable pattern drill-down endpoint: GET /admin/api/calibration/pattern/:id (requireAdmin) Returns the pattern statement at index `id` plus the top 25 resolved takes for the holder, sorted by weight desc. v0.36.0.0 ship-state approximation: surfaces broad provenance evidence (top resolved takes). v0.37+ stores per-pattern source_take_ids[] on a calibration_profile_patterns join table so the drill-down shows the EXACT takes that drove the pattern. Surfaces a `provenance_note` field in the response so the operator sees the v0.36.0.0-vs-v0.37 fidelity boundary inline. The admin SPA's renderPatternStatementsCard SVG already emits anchor tags pointing at /admin/calibration/pattern/<i> (T15 ship state). This route makes those anchors clickable — closes the trust loop that was the rationale for D29 ("pattern statements without their evidence are dressed-up LLM hallucinations"). TD4 (D30) — `gbrain takes revisit <slug>` editor-open action: Adds the `revisit` subcommand to gbrain takes. Opens $EDITOR (falling back to vi) on the source markdown file for the slug. Appends a `` cursor marker at the bottom of the page on first invocation so the editor opens with intent visible. Reads sync.repo_path from config to locate the brain repo. Refuses to proceed with a clear error when the repo isn't configured or the page doesn't exist. spawnSync with stdio:'inherit' so the editor takes the terminal. Exit status surfaced on failure. The SVG renderer's revisit-now anchor for each abandoned thread row emits /admin/calibration/revisit/<takeId>. A small route handler that resolves take_id → page_slug then dispatches `gbrain takes revisit` via spawn is a v0.37 follow-up — the CLI command exists now so developers can wire it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Promotes the admin SPA's de facto design tokens (landed v0.26.0) to a canonical DESIGN.md at the repo root. This is the calibration target for /plan-design-review and /design-review going forward — when a question is "does this UI fit the system?", the answer is here. Captures the system as it stands today: Voice (5 surfaces, all routed through gateVoice() with mode-specific rubrics): pattern_statement, nudge, forecast_blurb, dashboard_caption, morning_pulse. Friend-not-doctor; concrete data over abstract metrics; no preachy / clinical / corporate language. Color tokens: 10 CSS variables from admin/src/index.css inlined into the SVG renderer (src/core/calibration/svg-renderer.ts). Dark theme is the only theme — admin is an operator tool. WCAG contrast documented per token; TD2's #555 → #777 bump on --text-muted noted. Typography: Inter for UI, JetBrains Mono for numbers/slugs/data. Type scale (18 / 14 / 13 / 12 / 11) documented as de facto, not yet formalized. Spacing scale: 4 / 8 / 16 / 24 / 32px. Linear-app density. Layout: sidebar 200px, max content 720px (text) / 960px (tables). No 3-column feature grids, no icons in colored circles, no decorative blobs. Charts: server-rendered SVG via pure functions in src/core/calibration/svg-renderer.ts. XSS posture documented: server-side escapeXml on caller-controlled strings, numeric inputs .toFixed()-coerced, admin SPA renders via <TrustedSVG> wrapper. Interaction patterns: keyboard nav required (J/K/space/u/q on the propose-queue), loading/empty/error states ARE features. v0.37+ roadmap: type scale formalization, animation tokens, component library extraction. Light mode explicitly NOT planned. The doc is a living target, not a frozen spec. Major changes route through /plan-design-review per the existing review chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

T19 — synthetic corpus scaffold for extract-takes prompt tuning. test/fixtures/calibration/extract-takes-corpus/ — 5 representative pages across 4 genres (essay, people, companies, meetings, decisions). v0.36.0.0 ships a SMALL representative corpus as proof of structure; the full 50-page training set + 10-page holdout gets generated by the operator via `gbrain calibration build-corpus` (v0.37 follow-up subcommand) or by hand with the privacy guard catching violations either way. Privacy contract per D13': every page is SYNTHETIC. None of the names/companies/funds/deals/events refer to anything real. Placeholder names per CLAUDE.md: alice-example, charlie-example, acme-example, widget-co, fund-a/b/c, acme-seed, widget-series-a, meetings/2026-04-03. test/fixtures/calibration/README.md spells out the privacy contract, generation flow, and what the corpus is (stable regression set for the extract-takes prompt) vs is not (real anything). T20 — privacy CI guard (CDX-14 mitigation). scripts/check-synthetic-corpus-privacy.sh greps the corpus for: 1. Explicit dollar amounts ($50M, $1.2B etc) — would suggest the page memorized a real round size. 2. Out-of-range year references (informational only for v0.36.0.0; deferred to a manual review checklist). 3. Pages that reference ZERO placeholder names — suggests the page might be referring to real entities. Essay-genre fixtures exempt (they're anonymized PG-style writing by design). Wired into `bun run verify` (CI gate) so contributors can't accidentally land a synthetic fixture that leaks real-world specificity. The intent is fail-fast on accidental leakage; the operator can update the allowlist if a generic dollar amount is intentional. Closes CDX-14: 'CC reads real brain pages locally, writes nothing still risks privacy if any generated synthetic fixture memorizes structure-specific facts. Placeholder names are not enough.' The corpus shipped here is intentionally small but covers the four core gbrain page genres (essay, people, companies, meetings/decisions). The v0.37 corpus-build subcommand will fan out to 50 with the operator spot-checking + the CI guard enforcing the privacy contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per /plan-eng-review D26 IRON RULE: regressions get added to the test suite as critical requirements, no AskUserQuestion needed. Pins five regressions identified during the v0.36.0.0 wave's coverage diagram: R1: think baseline UNCHANGED when --with-calibration absent. Covered structurally by test/think-with-calibration.test.ts plus assertion-pinned in this file (default user message: question first, then retrieval; system prompt: no anti-bias section). R2: contradictions probe output UNCHANGED when no calibration profile. Covered structurally by test/eval-contradictions-calibration-join.test.ts plus pinned here (null profile → null tag, byte-identical to v0.32.6). R3: takes resolution flow works when grade_takes phase disabled. Pinned import-surface coupling: takes-resolution.ts has zero dependency on grade_takes module. If a future refactor accidentally couples them, this test fails to compile. R4: search/list_pages/get_page work identically through new source_id paths. Marker test referencing existing v0.34.1 source-isolation suite at test/source-isolation-pglite.test.ts. v0.36.0.0 does NOT modify those code paths; the existing tests catch any accidental coupling. R5: existing search modes (conservative/balanced/tokenmax) unaffected. Marker test referencing existing test/search-mode.test.ts. The calibration code DOES NOT IMPORT from src/core/search/mode.ts. Plus an inventory test that confirms all 5 regressions have an 'addressed' status — fail-loud if a future contributor removes a guard without updating the inventory. 7 tests total. Pure functions, no engine, hermetic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n skill CHANGELOG entry: the user-facing release notes. Leads with the headline ("the brain learns how you tend to be wrong, then argues against your blind spots on every advice call"), 5 'what you can now do' bullets in GStack voice, itemized changes by lane, and the 'To take advantage of v0.36.0.0' upgrade checklist per the CLAUDE.md required-block contract. CLAUDE.md anchors: new 'v0.36.0.0 Hindsight calibration wave (key files cluster)' block inserted before the v0.31.1 thin-client section. 23 new files / extensions annotated with one-paragraph descriptions each, linking back to the convention skill at skills/conventions/calibration.md for the agent-facing rules. skills/conventions/calibration.md: the agent-facing convention skill. Tells future contributors which calibration touchpoint applies to their task — voice gate? BaseCyclePhase? source-scope thread? doctor warning? cross-brain query rules? auto-resolve threshold posture? Test seam patterns. Bug class to avoid (the v0.34.1 source-isolation leak shape). Version trio (per CLAUDE.md mandatory audit): VERSION: 0.36.0.0 package.json: 0.36.0.0 CHANGELOG: ## [0.36.0.0] - 2026-05-17 llms.txt + llms-full.txt regenerated via `bun run build:llms` after the CLAUDE.md edit (per the explicit CLAUDE.md mandate "Any CLAUDE.md edit MUST be followed by `bun run build:llms`"). The `test/build-llms.test.ts` guard runs in CI shard 1; the committed bundles are checked against fresh generator output. bun run verify is clean. typecheck clean. Privacy CI guard passes (0 violations across 6 corpus pages). All ready for /ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nCycle (T-fix) The three new v0.36.0.0 phases were declared in CyclePhase / ALL_PHASES / NEEDS_LOCK_PHASES but the runCycle orchestrator never dispatched them. ALL_PHASES advertised them, gbrain dream --phase propose_takes accepted them, but `gbrain dream` (default) silently skipped all three. Adds a single dispatch block between consolidate and embed that: - builds an OperationContext on the fly (trusted-workspace caller, remote: false, sourceId resolved via the same helper sync uses) - dispatches the three phases in the order ALL_PHASES declares - records the same skipped-phase shape (no_database) when engine is null Pinned by test/core/cycle.serial.test.ts "default: all 6 phases run in order" which was already failing against ALL_PHASES (the test name lags the actual phase count; left as-is since renaming churns history). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three open PRs were claiming v0.36.0.0 (#1130 skillpack, #1139 hindsight, #1136 this PR). Ship-aware queue allocator says this branch lands at v0.36.2.0. Trio audit: VERSION 0.36.2.0 package.json 0.36.2.0 CHANGELOG ## [0.36.2.0] - 2026-05-17 Updates: VERSION, package.json, CHANGELOG header + body refs, README "New default in v0.36.2.0" announcement + credit line, skills/migrations/v0.36.0.0.md renamed to v0.36.2.0.md with frontmatter + body refs updated. llms-full.txt regenerated.

Master shipped v0.35.6.0 (floor-ratio search gate) and v0.35.7.0 (typed-claim trajectory + founder scorecard) ahead of this branch. Resolving the merge requires: 1. VERSION trio (VERSION, package.json, CHANGELOG.md top entry) bumped to 0.36.1.0 to claim the next slot after master's 0.35.7.0. 2. Migration v67 collision: master shipped facts_typed_claim_columns as v67. This branch's six calibration migrations renumber from v67-v72 to v68-v73. Master's v67 stays unchanged. 3. wave_version literal renamed 'v0.36.0.0' -> 'v0.36.1.0' across: migrate.ts DEFAULT clauses, pglite-schema.ts, schema-embedded.ts (regenerated), schema.sql, test fixtures, undo-wave logic, and every doc string referencing the wave. `gbrain calibration --undo-wave v0.36.1.0` is the new operator-facing reversal path. 4. test/regressions/v0.36.0.0-iron-rule.test.ts -> v0.36.1.0-iron-rule.test.ts so the regression-inventory filename tracks the actual release. 5. llms-full.txt + llms.txt regenerated against the updated docs. Verification: - bun run verify: green - bun run test: 7132 pass / 0 fail / 0 skip - Targeted migrate + bootstrap + cycle + undo-wave + nudge + cli tests: 207 pass / 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped v0.35.8.0 (autopilot phantom-page redirect inside extract_facts, #1138) ahead of this branch. VERSION trio kept at 0.36.1.0 since this branch's slot is already higher than master's new tag. CHANGELOG carries both v0.36.1.0 (top) and v0.35.8.0 entries; llms-full.txt regenerated. src/core/cycle.ts and src/commands/doctor.ts auto-merged cleanly (both branches added separate sections). Test gate green: 195/195 on cycle.serial + migrate + doctor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped v0.36.0.0 (skillpack scaffold / reference / harvest; retired managed-block install, #1130) — naming overlap with this branch's slot. This branch's slot stays 0.36.1.0 (already higher); master's v0.36.0.0 entry preserved in CHANGELOG. VERSION trio resolved: my 0.36.1.0 wins over master's 0.36.0.0 on VERSION, package.json, and CHANGELOG.md top entry. llms-full.txt regenerated. All other files auto-merged cleanly (CLAUDE.md, README.md, skills/RESOLVER.md, etc). Verification: - bun run typecheck: green - bun install: lockfile up to date Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(T19) Adds 8 new synthetic pages modeled on the genre mix observed in the real brain (concepts-with-timeline, meeting-notes, daily-journal, people-pages, essays). Companion .gradeable-claims.json files carry hand-labeled answer keys — what a tuned propose_takes prompt SHOULD extract per page. Closes the F1 gate gap from the plan's T19/D19: Training corpus (test/fixtures/calibration/extract-takes-corpus/): + concept-startup-market-dynamics.md (10 claims) + meeting-2026-04-10-fundraise-fund-a.md (6 claims) + daily-2026-04-15.md (5 claims) Blind holdout (test/fixtures/calibration/holdout/): + concept-founder-execution.md (6 claims, F1 >= 0.80) + daily-2026-04-18.md (4 claims, F1 >= 0.80) + meeting-2026-04-17-hiring-charlie.md (5 claims, F1 >= 0.80) + essay-on-conviction.md (7 claims, F1 >= 0.80) + people-bob-example.md (5 claims, F1 >= 0.80) Privacy: - No real-brain content read into any committed artifact. Pages written from scratch using the canonical placeholder set (alice-example, charlie-example, bob-example, acme-example, widget-co, fund-a/b/c). Real-name grep confirms zero leakage: wintermute, garrytan, paul-graham, sam-altman, etc. → 0 hits. - scripts/check-synthetic-corpus-privacy.sh passes: 0 violations across 14 pages (was 6). Genre fidelity: - concept-with-timeline pages mirror the dated-assertion structure real brain uses (verb framing varies: "argues / predicts / I think / I bet / strong conviction / moderate conviction"). - meeting-notes pages carry both prose claims (extracted via hedging language) and explicit ## Takes sections. - daily-journal pages test probabilistic framing ("75/25 in favor", "call it ~0.5") and self-tagged conviction values. - essay-on-conviction is the meta-page that names the author's own bias patterns — primary signal for calibration_profile. - people pages test claim-about-third-party extraction. Each JSON ground-truth lists per-claim: - claim_text + kind (prediction|judgment|bet) + domain - conviction (0..1) - since_date - rationale (why this claim is gradeable + how a tuned prompt should infer conviction from the prose) This is the corpus that gates the T19 prompt-tune iteration: - F1 >= 0.85 on training (10+6+5 = 21 claims across 3 pages plus the existing 5 fixtures already shipped) - F1 >= 0.80 on holdout (27 claims across 5 pages) Plan reference: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md Privacy gate: scripts/check-synthetic-corpus-privacy.sh (wired into bun run verify). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…5 F1 0.92+) The v0.36.1.0 ship state shipped propose_takes with a stub prompt that the docs flagged as "tune via T19 corpus build before relying on propose_takes in production." T19's corpus was built in commit 69a71c9 (14 synthetic pages + 48 hand-labeled claims). The matching gbrain-evals cat15 runner validates extraction quality against that corpus. This commit back-ports the tuned prompt validated by cat15's first live run: training avg F1: 0.952 (target 0.85, +10 points) holdout avg F1: 0.922 (target 0.80, +12 points) train-holdout gap: 0.03 (well below 0.10 overfitting threshold) 8/8 probes pass their individual F1 targets Per-genre F1 floor: 0.80 (people-pages, the hardest genre). Concept- with-timeline and meeting-notes genres scored at 1.00 on holdout pages. The tuned prompt design changes vs the stub: - Worked example list seeds the "gradeable claim" notion so the model doesn't drift into pure-fact extraction. - NOT-gradeable list catches the most common over-extraction modes (pure facts, direct quotes, restatements). - Conviction inference rules anchored to specific hedging language so the model produces consistent weight values. - kind enum narrowed to 'prediction' | 'judgment' | 'bet' — the v1 stub's 4-tag enum bled into noise classification on the corpus. PROPOSE_TAKES_PROMPT_VERSION bumped 'v0.36.1.0-stub' → 'v0.36.1.0-tuned-cat15'. The bump invalidates the take_proposals idempotency cache so existing proposal rows stay as audit history but the next cycle re-extracts against the new prompt — exactly the design contract this version field is for. Re-tuning protocol: run cat15 in gbrain-evals against the fixtures BEFORE bumping the version string. The train-holdout gap should stay < 0.10. If a future tune drops below the cat15 gate, revert. Source of evidence: - cat15 runner: ~/git/gbrain-evals/eval/runner/cat15-propose-takes.ts - Fixture corpus: test/fixtures/calibration/ (this repo, commit 69a71c9) - Live run dumps: ~/git/gbrain-evals/eval/reports/cat15-propose-takes/*.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the "Validated by published benchmarks" subsection to the v0.36.1.0 CHANGELOG entry and a "Calibration loop" section to the README's "Receipts on the evals" surface. Both link to the new benchmark report at gbrain-evals/docs/benchmarks/2026-05-18-brainbench-cat14-cat15-calibration.md. CHANGELOG: also updates the propose_takes bullet to reflect that the v0.36.1.0 ship state now includes the tuned 'v0.36.1.0-tuned-cat15' prompt (back-ported in 04dbab4), not the v1 stub the original entry described. README: adds a Calibration loop entry to the receipts table sitting between source-aware ranking and prompt compression. Frames the cat14 + cat15 numbers as "first published benchmark for AI memory systems that reason about user track records" — honest SOTA framing since Hindsight introduced the concept without quantified evaluation. llms.txt + llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

7 links to gbrain-evals/blob/master/docs/benchmarks/ were broken — the gbrain-evals repo uses 'main' as its default branch, not 'master'. Surfaced when I checked that the new cat14/cat15 link resolved post-PR-9 merge. Turned out 4 pre-existing links to longmemeval, brainbench-v0.20, brainbench-cat13b-source-swamp, and comparison-systems were all broken for the same reason — I just added a fifth by following the same wrong pattern. Sweep: gbrain-evals/blob/master/ → gbrain-evals/blob/main/ across both README.md (5 links) and CHANGELOG.md (2 links). llms.txt + llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…1136) * feat(dims): OpenAI text-embedding-3 Matryoshka range validation (D13) dimsProviderOptions now fail-loud at the embed boundary when the configured embedding_dimensions is outside the model's native range (1..1536 for -small, 1..3072 for -large). Paste-ready fix hint in the AIConfigError.fix field. Closes the silent-HTTP-400 path that would have bit OpenAI-fallback users on v0.36.0.0 ZE-default installs. 16 new test cases in test/ai/dims-openai.test.ts pinning the contract across native-openai and openai-compatible adapter paths. * feat(ai): flip defaults to ZeroEntropy zembed-1 1280d + zerank-2 reranker Default embedding model is now zeroentropyai:zembed-1 at 1280d via Matryoshka. Real-corpus benchmark: 2.2x faster than OpenAI, 2.6x cheaper at regular pricing, wins 11/20 head-to-head queries. 1280 is the closest valid ZE Matryoshka step to the prior OpenAI 1536d default (valid set: 2560/1280/640/320/160/80/40). 1024 (Voyage's step) is NOT on ZE's list — pinned by AIConfigError fail-loud in dims.ts. balanced mode bundle now defaults reranker_enabled=true. zerank-2 reshuffles 60% of top-1 results in benchmarks. Missing-key fail-open contract in src/core/search/rerank.ts handles unauthenticated cases. Opt out with: gbrain config set search.reranker.enabled false Existing tests updated (gateway.test.ts, search-mode.test.ts) and a new test/balanced-reranker-default.test.ts (10 cases) pins the fail- open invariants. * feat(retrieval-upgrade): RetrievalUpgradePlanner + interactive prompt UX New src/core/retrieval-upgrade-planner.ts is the consolidated planner that computes the brain's pending retrieval-upgrade work (chunker bumps + ZE switch) in one pass and applies the schema transition + config updates atomically. Tagged-union ApplyResult enum (D15): 'applied' | 'skipped_already_ applied' | 'skipped_no_work' | 'declined' | 'planned' | 'failed'. No string-parsing reasons. Three config keys (D12): ze_switch_prompt_shown (UI state), ze_switch_requested (user intent), ze_switch_applied (work done). Plus ze_switch_previous_snapshot (JSON, full prior config for --undo per D16) and ze_switch_declined_at (90-day re-ask window). Schema transition (D18) is atomic: DROP indexes + ALTER COLUMN + CREATE INDEX inside a single engine.transaction(). HNSW recreation is part of the same transaction — no silent slow-search window. C3 eligibility logic: ze_switch_offered iff NOT on ZE + NOT declined recently + NOT applied + (legacy default OR >100 pages). C4 cost math: MAX(chunker_pending, dim_pending) not SUM — one re-embed pass invalidates both surfaces simultaneously. New src/core/retrieval-upgrade-prompt.ts wires the planner to a TTY-only interactive prompt with two-line cost split (D10) and privacy callout for the reranker flip. Tests: test/retrieval-upgrade-planner.test.ts (24 cases) pins the state machine. test/asymmetric-encoding-contract.test.ts (6 cases) pins D17: search read path uses gateway.embedQuery() not embed(), asserted via __setEmbedTransportForTests mock. * feat(cli): gbrain ze-switch — manual lever for the ZE switch New gbrain ze-switch CLI with --dry-run, --json, --resume, --force, --undo, --non-interactive, --confirm-reembed, --ignore-missing-key flags. Mirrors the upgrade prompt's UX symmetry: --undo presents a cost-warning before re-embedding back to the prior width. src/cli.ts: dispatch case + CLI_ONLY entry. ze-switch owns its own engine lifecycle (mirrors the doctor pattern). test/ze-switch-cli.test.ts (11 cases): --help, --dry-run, --json, --non-interactive, --ignore-missing-key, --resume, --undo, --confirm-reembed. Uses captureExit harness to test process.exit() paths without breaking the test process. * feat(doctor): ze_embedding_health + embedding_width_consistency checks Two new doctor checks (D-A5): ze_embedding_health: when embedding_model starts with zeroentropyai:, verify ZEROENTROPY_API_KEY is set (env or config). Paste-ready setup hint with the signup URL on failure. embedding_width_consistency: cross-check that the configured embedding_dimensions matches the actual vector(N) column width on content_chunks.embedding. Catches the half-applied switch state (schema migrated but config write crashed) with a paste-ready gbrain ze-switch --resume hint. Wired into runDoctor between reranker_health and the existing sync_freshness checks. Both checks gracefully no-op on non-ZE embedding configs. test/doctor-ze-checks.test.ts (8 cases) pins both checks across happy + missing-key + missing-config + drift paths. Uses withEnv() helper to clear ZEROENTROPY_API_KEY for the no-key path so tests are hermetic against contributor env state. test/e2e/v0_28_5-fix-wave.test.ts + test/openai-compat-multimodal.test.ts: updated to explicit-configure the gateway when the test depends on specific dims that diverge from the v0.36.0.0 default (1280d). * docs: README zero-based rewrite (884 -> 139 lines) + new docs files Strip 4 months of accreted "New in v0.X.Y" hero blocks and reorganize around what gbrain does today. 33 H2s -> 8. The Commands section (136 lines duplicating gbrain --help) moved out; the 6-table skills enumeration collapsed to a one-paragraph capability description with a link to skills/RESOLVER.md. Hero retains load-bearing facts: OpenClaw + Hermes credit, production numbers (17,888 pages / 4,383 people / 723 companies), BrainBench numbers (P@5 49.1% / R@5 97.9% / +31.4 lift), ZE comparison numbers, 30-min install claim. Adds one paragraph announcing the v0.36.0.0 ZE default with the explicit gbrain config set escape for OpenAI/Voyage users. New files: - docs/INSTALL.md: every install path consolidated (agent platform, CLI standalone, MCP server). Thin-client mode covered. - docs/architecture/RETRIEVAL.md: why the hybrid + graph stack works. BrainBench numbers, why each strategy alone fails, the source-aware ranking + intent classification + multi-query expansion story. - docs/ethos/ORIGIN.md: origin story lifted from the old README so the front door stays factual + concrete. test/readme-hero-anchors.test.ts (5 cases) is the D9 regression guard. Five load-bearing strings: OpenClaw, Hermes, ZE, production-numbers regex, P@5/R@5. Light anchors that let voice/ structure evolve but block accidental loss of headline facts. scripts/check-test-real-names.sh: allowlist entries for OpenClaw + Hermes literals in the anchor test (it explicitly asserts those strings appear in README). * chore: bump version and changelog (v0.36.0.0) ZeroEntropy as the new default for embedding (zembed-1 at 1280d via Matryoshka) and reranker (zerank-2 cross-encoder, on by default in balanced mode bundle). README zero-based rewrite (884 -> 139 lines). 3 new docs files. Two new doctor checks. New gbrain ze-switch CLI with --undo for symmetric reversibility. skills/migrations/v0.36.0.0.md tells the agent how to surface the retrieval-upgrade prompt post-upgrade. llms-full.txt regenerated via bun run build:llms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docs): scrub Wintermute from RETRIEVAL.md per privacy rule * chore: rebump version 0.36.0.0 → 0.36.2.0 (queue collision) Three open PRs were claiming v0.36.0.0 (#1130 skillpack, #1139 hindsight, #1136 this PR). Ship-aware queue allocator says this branch lands at v0.36.2.0. Trio audit: VERSION 0.36.2.0 package.json 0.36.2.0 CHANGELOG ## [0.36.2.0] - 2026-05-17 Updates: VERSION, package.json, CHANGELOG header + body refs, README "New default in v0.36.2.0" announcement + credit line, skills/migrations/v0.36.0.0.md renamed to v0.36.2.0.md with frontmatter + body refs updated. llms-full.txt regenerated. * fix(test): pin gateway dim=1536 in cross-file-stateful PGLite tests CI shard 1 reported 10 failures across `query-cache.test.ts` (6) and `consolidate-valid-until.test.ts` (4). Both files hardcode 1536-dim vectors but rely on `PGLiteEngine.initSchema()` to size `vector(__EMBEDDING_DIMS__)` at the right width. Root cause: v0.36.2.0 flipped DEFAULT_EMBEDDING_DIMENSIONS from 1536 to 1280 (ZE Matryoshka step). The gateway module is process-singleton; when ANOTHER test file in the same shard's bun-test process configures the gateway before us, `pglite-engine.ts:216` reads `getEmbeddingDimensions() === 1280` and sizes the schema columns at vector(1280). The hardcoded 1536-dim INSERTs then fail with "expected 1280 dimensions, not 1536". Locally these tests pass in isolation because the gateway falls back through the try/catch at pglite-engine.ts:218 (1536 default). CI runs multiple test files in one process, so cross-file state poisons the schema width. Fix: explicit `resetGateway()` + `configureGateway({embedding_dimensions: 1536, ...})` at the top of `beforeAll`, plus `resetGateway()` in `afterAll`. Pins the schema width regardless of cross-file state. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan and others added 24 commits May 17, 2026 15:56

garrytan changed the title ~~v0.36.0.0 Hindsight calibration wave: brain learns how you tend to be wrong~~ v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong May 18, 2026

garrytan and others added 3 commits May 18, 2026 09:01

garrytan and others added 3 commits May 18, 2026 16:41

garrytan merged commit 3a0e111 into master May 19, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong#1139

v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong#1139
garrytan merged 31 commits into
masterfrom
garrytan/asuncion

garrytan commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

Reviews cleared

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented May 18, 2026 •

edited

Loading