v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong#1139
Merged
Conversation
Foundation commit for the Hindsight-inspired calibration wave. Adds four new tables + one perf index, all source-scoped from day 1 per v0.34.1 discipline: - calibration_profiles (v67): per-holder LLM-narrative aggregation of TakesScorecard data. published BOOL gates E8 cross-brain mount sharing (default false). grade_completion REAL surfaces partial-grade state to the dashboard. active_bias_tags TEXT[] with GIN index feeds E3 (calibration- aware contradictions) and E7 (real-time nudge matching). - take_proposals (v68): propose_takes phase queue. Idempotency cache via (source_id, page_slug, content_hash, prompt_version) unique index mirrors the v0.23 dream_verdicts pattern. proposal_run_id supports --rollback by run. dedup_against_fence_rows JSONB audit column records what canonical takes the LLM was told to dedupe against at proposal time. - take_grade_cache (v69): grade_takes verdict cache. Composite PK on (take_id, prompt_version, judge_model_id, evidence_signature) — prompt edits OR evidence changes cleanly invalidate prior verdicts. applied=false default + auto-resolve-off-by-default (D17) means every fresh install needs operator opt-in before grade verdicts mutate the takes table. - take_nudge_log (v70): E7 nudge cooldown state. Polymorphic FK — a nudge fires on either a canonical take OR a pending proposal (CDX-5 fix). CHECK constraint enforces exactly-one-set. channel column lets future routing (webhook, admin SPA toast) reuse the same cooldown semantics. - takes_resolved_at_idx (v71): partial index for the Brier-trend aggregation queries. Engine-aware handler — Postgres uses CONCURRENTLY to avoid the ShareLock; PGLite uses plain CREATE. Every table carries wave_version TEXT NOT NULL DEFAULT 'v0.36.0.0' so the v0.36.0.0 calibration --undo-wave command (lands later in the wave) can reverse just this wave's writes. Plan: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md covers the design rationale (D17/D18/D21 + CDX findings). Schema parity: - src/schema.sql for fresh Postgres installs - src/core/pglite-schema.ts for fresh PGLite installs - src/core/schema-embedded.ts auto-regenerated from schema.sql - src/core/migrate.ts for upgrade-in-place from older brains VERSION bumped to 0.36.0.0 for the wave. CHANGELOG entry lands at /ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntracts D21 from the eng review. Three new v0.36.0.0 cycle phases (propose_takes, grade_takes, calibration_profile) share enough structure that the duplication-vs-abstraction trade tips toward a shared base. Without this scaffold, source-isolation discipline would drift exactly the way it drifted in v0.34.1 — except this time across three new surfaces at once. What this enforces: 1. Phase signature is uniform: run(ctx, opts) → PhaseResult. 2. ctx.sourceId / ctx.auth.allowedSources MUST be threaded through every engine call. The base class surfaces a scope() helper that wraps sourceScopeOpts(ctx) and is the only sanctioned way to read source- scoped data. Forgetting to thread source scope becomes a TypeScript compile error, not a runtime leak. Closes the v0.34.1 leak class structurally for every new phase. 3. Budget meter wraps run() automatically. Subclass declares budgetUsdKey + budgetUsdDefault; base reads the resolved cap from config and creates the BudgetMeter. Subclass calls this.checkBudget() before each LLM submit; budget-exhausted phase still returns status='ok' (clean abort) so the cycle report shows partial completion, not failure. 4. Error envelope is uniform. Thrown errors get caught and converted to status='fail' with a phase-specific error.code via the subclass's mapErrorCode() hook. 5. Progress reporter integration. Base accepts the reporter via opts; subclasses call this.tick() instead of touching the reporter directly, so the phase name in the progress stream is always correct. Tests: 13 cases in test/core/base-phase.test.ts cover source-scope threading (5 cases including the empty-allowedSources-MUST-NOT-widen-scope regression), PhaseResult shape including the error envelope path (3 cases), dry-run propagation (2 cases), and budget meter construction (3 cases including config-key override). Synthesize.ts / patterns.ts (existing pre-v0.36 phases) deliberately do NOT retrofit to this base in v0.36.0.0 — too much churn for a refactor that doesn't pay off until v0.37+. Future phases use this by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LLM-based take extraction from markdown prose. Walks pages updated since
last cycle, sends each page's body to a tuned extractor, writes the
extracted gradeable claims to the take_proposals queue. User accepts /
rejects via `gbrain takes propose --review` (lands in Lane C).
Cycle wiring:
lint → backlinks → sync → synthesize → extract → extract_facts →
resolve_symbol_edges → patterns → recompute_emotional_weight →
consolidate → propose_takes (NEW) → grade_takes (NEW; T4) →
calibration_profile (NEW; T6) → embed → orphans → purge
CyclePhase enum extended with 3 new entries; ALL_PHASES + NEEDS_LOCK_PHASES
updated. All three new phases acquire the cycle lock (writes to
take_proposals / take_grade_cache / calibration_profiles).
Idempotency contract:
The (source_id, page_slug, content_hash, prompt_version) composite unique
index on take_proposals means an unchanged page never re-spends LLM
tokens. Bumping PROPOSE_TAKES_PROMPT_VERSION cleanly invalidates the
cache so a tuned prompt re-runs proposals on every page. Mirrors the
v0.23 dream_verdicts pattern.
F2 fence dedup:
The phase reads the page's existing `<!-- gbrain:takes:begin -->` fence
(when present) and passes the canonical take rows to the extractor as
"things you have already captured." Prevents duplicate proposals when
prose is appended to a page that already has takes. Records the fence
rows the LLM was told to dedupe against on the take_proposals row for
audit (dedup_against_fence_rows JSONB).
Auto-resolve posture:
propose_takes only WRITES proposals to the queue. Nothing in this phase
mutates the canonical takes table. Operator opt-in via the queue review
CLI (Lane C) is the only path from queue to canonical fence (D17).
Prompt tuning status (v0.36.0.0 ship state):
The default extractor prompt is annotated `v0.36.0.0-stub`. The real
tuned prompt arrives via T19 synthetic corpus build (50 anonymized
pages, 3-model parallel extraction, user reviews disagreement set,
F1 ≥ 0.85 on training corpus + F1 ≥ 0.8 on ground-truth holdout).
Until T19 lands, propose_takes runs but produces best-effort candidates
the user reviews manually.
Architecture:
ProposeTakesPhase extends BaseCyclePhase (T2). Inherits source-scope
threading via scope(), budget metering via this.checkBudget(), error
envelope wrapping. budgetUsdKey: cycle.propose_takes.budget_usd
(default $5/cycle). Budget exhaustion mid-page returns status='warn'
with details.budget_exhausted=true — clean partial-completion semantics.
Test seam: opts.extractor injection so the phase can run hermetically
without touching the gateway. defaultExtractor (production path) calls
gateway.chat with the EXTRACT_TAKES_PROMPT and parses the JSON array
output via parseExtractorOutput.
parseExtractorOutput defends against common LLM output sins: markdown
code fence wrapping, leading prose, single-object instead of array,
unknown kind values, weight out of [0,1], rows missing claim_text or
exceeding 500 chars.
Tests: 25 cases in test/propose-takes.test.ts cover the 4 pure helpers
(parseExtractorOutput, contentHash, hasCompleteFence,
extractExistingTakesForDedup) + 7 phase integration scenarios (happy path,
cache hit, fence dedup, extractor failure, empty pages, skipPagesWithFence,
proposal_run_id stability).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks unresolved takes that are old enough to have outcome data, retrieves evidence from the brain, asks a judge model to verdict each one. Writes verdicts to take_grade_cache. Optionally — only when operator has flipped the opt-in config flag — auto-applies high-confidence verdicts to the canonical takes table via engine.resolveTake. Auto-resolve posture (D17 — DISABLED by default): On a fresh install, grade_takes runs and writes verdicts to the cache, but applied=false on every row. Operator reviews the queue, then flips `cycle.grade_takes.auto_resolve.enabled: true` once trust is earned. Mirrors the propose_takes review-queue posture: queue exists, mutation requires explicit opt-in. Conservative threshold (D12): When auto_resolve.enabled is true, a verdict auto-applies only when confidence >= 0.95 (single-judge path). T5 ensemble path lands next, tightening this further with 3/3 unanimous requirement. 'unresolvable' verdict NEVER auto-applies even at confidence=1.0 — there's no canonical column for "we tried and there's no evidence yet." Evidence retrieval status (v0.36.0.0 ship state): The default evidence retriever returns an "evidence-retrieval not yet wired" placeholder. Most verdicts produced by the stub-judge against the stub-evidence will be 'unresolvable'. Real retrieval (hybrid search over pages newer than the take's since_date, optionally augmented by a gateway web-search recipe in v0.37+) lands as a follow-up. Documented limitation per CDX-8 + D17 — the phase ships now so the wiring is real and the cache table accumulates verdicts even if early ones are conservative. Cache key: Composite primary key on take_grade_cache is (take_id, prompt_version, judge_model_id, evidence_signature). Prompt edits OR evidence changes OR judge swap cleanly invalidate prior verdicts. Mirrors the v0.32.6 eval_contradictions_cache pattern. evidence_signature = SHA-256 of (judge_model_id + '|' + evidence_text) so identical evidence under a different judge does NOT collide. Architecture: GradeTakesPhase extends BaseCyclePhase. Inherits source-scope threading, budget metering (cycle.grade_takes.budget_usd, default $3/cycle), error envelope. Test seam: opts.judge + opts.evidenceRetriever injection so the phase runs hermetically. parseJudgeOutput defends against fence-wrapping, leading prose, out-of-range confidence (clamps to [0,1]), invalid verdict labels, oversized reasoning (truncated at 400 chars). Returns null on unrecoverable parse — caller treats null as "judge_output_parse_failed / unresolvable at confidence 0.0" so the row still lands in cache with the parse failure surfaced via warnings. takeIsOldEnough gates on since_date (default 6 months). Tolerates YYYY-MM-DD and YYYY-MM formats. Returns false on null/unparseable since_date so takes without dates never get graded (we'd be hallucinating temporal context). Tests: 23 cases covering parseJudgeOutput (7 cases), evidenceSignature (3), takeIsOldEnough (5), and 8 phase integration scenarios — happy path, D17 auto-resolve-off default, D12 above-threshold auto-apply, below- threshold cache-only, unresolvable-NEVER-applies, cache hit, too-recent gate, judge-throw warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-judge ensemble tiebreaker, additive on top of T4's single-judge
foundation. Reuses gateway.chat as the per-model judge interface; runs
three judges in parallel via Promise.allSettled. Pure aggregation logic
in aggregateEnsemble() — no SQL, no LLM, hermetically testable.
When ensemble fires (T5 trigger band):
Only when ALL of:
- opts.useEnsemble === true (default false)
- opts.ensembleJudges array is non-empty
- single-model confidence in [0.6, 0.95) (configurable via
opts.ensembleTriggerBand)
- single-model verdict !== 'unresolvable'
Above 0.95 the single judge is already sufficient (T4 path). Below 0.6
the verdict is clearly review-only — ensemble wouldn't change the
posture. 'unresolvable' from single-judge means no evidence yet; calling
three more judges on the same evidence won't manufacture some.
Conservative auto-apply (D12):
Ensemble verdict auto-applies via engine.resolveTake only when ALL of:
- autoResolve === true (operator opt-in per D17)
- ensemble.agreement === 3 (3/3 unanimous)
- ensemble.minConfidence >= ensembleThreshold (default 0.85)
- winning verdict !== 'unresolvable'
Schema-level monotonic-tightening guard for ensembleThreshold lives in
the takes resolution layer.
Cache identity:
When ensemble fires, the cache row's judge_model_id becomes
'ensemble:<modelA>+<modelB>+<modelC>' — a future re-run with different
ensemble membership doesn't collide with prior verdicts. evidence_signature
is recomputed because it includes the judge_model_id.
aggregateEnsemble (pure):
- 3/3 unanimous → agreement=3, minConfidence=min across the three
- 2/3 majority → agreement=2, minConfidence across the agreeing two
- 1/1/1 disagreement → tie-break: prefer non-'unresolvable', then
alphabetical for determinism
- 'unresolvable' from one model NEVER tips a 2-vote majority toward
'unresolvable' — by-label tally only counts a model toward its own
label
- All three judges failing (allSettled rejected) → verdict='unresolvable'
with agreement=0; auto-apply path blocked
- Single judge survives + two fail → agreement=1; the lone verdict wins
but auto-apply gated by the 3/3 requirement
Tests: 16 cases.
aggregateEnsemble (6): 3/3, 2/3, 1/1/1, unresolvable-tipping-resistance,
all-failed, partial-failed-but-survives.
Phase trigger conditions (5): useEnsemble=false default, useEnsemble=true
in borderline band, single >= 0.95 skip, single < 0.6 skip, single =
'unresolvable' skip.
Phase auto-apply rules (5): 3/3+threshold+autoResolve, 2/3 majority no
apply, 3/3 below threshold no apply, one ensemble judge throws still
aggregates from allSettled, empty ensembleJudges falls through to
single.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(T6)
The calibration narrative layer. Reads TakesScorecard, asks an LLM to
write 2-4 conversational pattern statements ("right on tactics, late on
macro by 18 months"), passes them through the voice gate, derives active
bias tags, writes the row to calibration_profiles. This is the read-side
that E1 (think anti-bias rewrite), E3 (contradictions join), E6
(dashboard), and E7 (real-time nudges) all consume.
Voice gate (D24 — single function, multiple surfaces):
ALL five calibration UX surfaces import the same gateVoice() function
from src/core/calibration/voice-gate.ts. Mode parameter
('pattern_statement' | 'nudge' | 'forecast_blurb' | 'dashboard_caption'
| 'morning_pulse') drives surface-specific tuning via the rubric the
gate ships to its Haiku judge. NO forked implementations — voice
rubric drift would defeat the gate.
Each mode's rubric explicitly forbids preachy / clinical / corporate
voice; a structural test pins this. Anchors the cross-cutting voice
rule from /plan-ceo-review D2-D8.
Fallback policy (D11):
Up to 2 generation attempts (configurable). On both rejects → fall back
to a hand-written template from src/core/calibration/templates.ts.
Templates are intentionally short and a little "robotic" — they're the
safety net, not the destination. voice_gate_passed=false +
voice_gate_attempts get persisted on the calibration_profiles row so
the operator can review the failing examples and tune the rubric over
time. Suppressing the surface silently is NEVER an option — that's how
voice quality silently degrades.
parseJudgeOutput defaults to 'academic' on parse failure (NEVER passes
pass-through) so a Haiku output garble falls through to the template
rather than letting unverified text reach the user.
calibration_profile phase:
Extends BaseCyclePhase. Cold-brain skip: <5 resolved takes → no row
written, no LLM call. Otherwise: scorecard via engine.getScorecard()
→ patterns via voice-gated generator → bias tags via separate
generator (best-effort; failure logs warning, phase continues).
The DB INSERT lands in the v67 calibration_profiles row with
source_id, holder, the patterns, voice gate audit fields, active bias
tags, and grade_completion (F1 fix — partial-grade state surfaces to
the dashboard "60% graded" badge).
Budget gate at $0.50/cycle default (mostly Haiku). Below-budget
before-LLM-call check returns status='warn' without writing the row.
Per-domain scorecards are a placeholder for v0.36.0.0 ship state —
the F12 batchGetTakesScorecards() engine method that powers per-domain
rendering lands in Lane C alongside the CLI/MCP surface.
Architecture:
parsePatternStatementsOutput is tolerant of LLM emitting numbered
lists / bulleted lines despite the prompt asking for plain lines.
Caps at 4 patterns + drops excessively long lines (>200 chars).
parseBiasTagsOutput lowercases input + drops non-kebab-case tokens
(defends against the LLM emitting "Over-Confident Geography" with
spaces or capitals). Caps at 4 tags.
Tests: 43 cases across two new test files.
voice-gate.test.ts (24): parseJudgeOutput (7), gateVoice happy path
(3), fallback path (5), mode parity (2), templates (7).
calibration-profile.test.ts (19): parsers (10), pickFallbackSlots
(3), phase integration (6 — cold-brain skip, happy path, voice gate
fallback, grade_completion plumbed through, bias-tags failure
non-fatal, source_id scope reaches INSERT).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Public-facing read surface for the v0.36.0.0 calibration wave. CLI prints
the active calibration profile; MCP op exposes the same data path for
agents. Mirror of the v0.29 salience/anomalies shape (pure data fn + JSON
formatter + human formatter + thin CLI dispatch).
CLI: `gbrain calibration`
Flags:
--holder <id> specific holder (default 'garry')
--json machine output for piping
--regenerate run calibration_profile phase now
--undo-wave <ver> [placeholder — wires in Lane D / T17]
ab-report [placeholder — wires in Lane D / T18]
Human output:
Calibration profile — holder: garry, source: default
Generated: <local timestamp>
[Note: built on 60% graded — partial completion this cycle.] (when grade_completion < 0.9)
[Note: voice gate fell back to template (2 attempts).] (when voice_gate_passed=false)
Resolved: 12 takes
Brier: 0.210 (lower is better)
Accuracy: 60.0%
Partial: 10.0%
Pattern statements:
• You called early-stage tactics well — 8 of 10 held up.
Active bias tags: over-confident-geography
Cold-brain fallback message names the exact dream command to run.
MCP: `get_calibration_profile` (scope: read)
Param: holder?: string (defaults to 'garry')
Returns: latest CalibrationProfileRow | null
Source-scoping via sourceScopeOpts(ctx): scalar source-bound clients see
only their source; federated_read scopes see the union of allowed sources;
no source filter when neither is set (CLI default path).
Throws GBrainError('INVALID_HOLDER') on empty/non-string holder so
remote callers get a structured error instead of a SQL-shape failure.
Architecture:
getLatestProfile is the pure data fn — engine + opts → CalibrationProfileRow | null.
Reused by both the CLI and the MCP op. Source-scoped via the standard
v0.34.1 spread pattern (scalar sourceId vs sourceIds array).
formatProfileText is pure — null → cold-brain message, populated → full
printout. Annotates partial-grade rows and voice-gate-fallback rows so
the operator sees data-quality status inline.
parseArgs is exported via __testing for unit coverage. Sub-command
('ab-report') vs flag distinction is intentional — keeps the surface
parallel with `gbrain eval cross-modal` etc.
Tests: 21 cases.
parseArgs (6 cases): empty, --holder, --json, --regenerate, --undo-wave, ab-report.
getLatestProfile (5 cases): happy, null, scalar source scope, federated array
scope, no-source-filter default.
formatProfileText (5 cases): cold-brain, happy, partial-grade note, voice-fallback
note, published-to-mounts note.
getCalibrationProfileOp (5 cases): default holder, scalar source scope,
federated scope union, returns-null-on-unknown-holder, throws on empty holder.
Lane D follow-ups: --undo-wave (T17) and ab-report (T18) print a clear
"lands in Lane D" stderr line + exit 2; the surfaces exist for early
testers, the implementations land next.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Optional anti-bias rewrite mode for `gbrain think`. When set, the active
calibration profile gets injected per the D22 placement spec (AFTER
retrieval evidence, BEFORE the user's question). The bias filter applies
to QUESTION FRAMING, not evidence interpretation — matches LLM-as-judge
best practice (bias prompts near end of context perform better).
Default behavior unchanged (R1 regression guard): omitting
--with-calibration produces the v0.28-vintage user-message shape with the
question first, then retrieval. Existing think users see no change.
Two user-message shapes in buildThinkUserMessage:
Default (no calibration):
Question: X
<pages>...</pages>
<takes>...</takes>
<graph>...</graph>
Respond with a single JSON object...
With calibration (D22):
<pages>...</pages>
<takes>...</takes>
<graph>...</graph>
<calibration holder="garry">
Track record: Brier 0.210 (lower is better).
Active patterns:
- You called early-stage tactics well — 8 of 10 held up.
Active bias tags: over-confident-geography
</calibration>
Question: X
Respond...
Calibration block is built by buildCalibrationBlock (exported for the
E3 contradictions probe to render the same shape).
System prompt extension (withCalibration:true):
- Names BOTH the user's PRIOR (default reasoning) AND the COUNTER-PRIOR
from their hedged-domain self.
- References active bias tags by name when relevant ("this fits the
over-confident-geography pattern").
- Does NOT silently substitute the debiased answer. ALWAYS surfaces
both priors transparently.
- Adds a "Calibration" section between Conflicts and Gaps in the
answer body.
RunThinkOpts extension:
- withCalibration?: boolean — opt-in
- calibrationHolder?: string — defaults to 'garry'
When withCalibration=true and no profile exists, runThink falls back to
baseline behavior + pushes NO_CALIBRATION_PROFILE to warnings (visible
to the operator). When the calibration fetch fails, CALIBRATION_FETCH_FAILED
warning surfaces with the underlying error. Either path keeps think working;
the calibration loop is enhancement, not requirement.
CLI: `gbrain think "<q>" --with-calibration [--calibration-holder <id>]`
Tests: 11 cases.
buildThinkSystemPrompt (4 cases): R1 regression — default/false/omitted
→ no anti-bias rules; with calibration → adds PRIOR + COUNTER-PRIOR +
bias-tag reference; preserves existing hard rules.
buildCalibrationBlock (3 cases): happy path, null brier omitted (not
"Brier null"), empty patterns + tags still well-formed.
buildThinkUserMessage (4 cases): R1 regression — without calibration:
question first; D22 placement — retrieval → calibration → question →
instruction; graph + calibration ordering; empty retrieval blocks render
placeholders without breaking shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-references each contradiction finding against the active calibration
profile. When a contradiction's domain matches an active bias tag (e.g.
"over-confident-geography" or "late-on-macro-tech"), the output gains a
one-line bias context explaining which pattern this fits.
Pure functions only — no DB writes, no LLM calls. The probe runner imports
tagFindingWithCalibration() and applies it to each finding before emitting.
When no profile exists or no tags match, the helper returns null and the
runner emits the unchanged finding (regression R2 — contradictions output
is byte-identical to v0.32.6 when no calibration profile is present).
Match heuristic (v0.36.0.0 ship-state):
Bias tags are kebab-case axis-then-domain slugs ('over-confident-geography').
computeDomainHint() extracts a domain hint from the finding's slugs +
holder + verdict text:
- wiki/companies/... → hiring | market-timing
- wiki/people/... → founder-behavior
- macro / geography / tactics / ai segments in slug → matching tag
First-match-wins for ordering determinism.
Match is intentionally fuzzy — the v0.32.6 contradictions probe doesn't
yet carry structured domain metadata. v0.37+ structured-domain-on-takes
(Hindsight-style enum) tightens this.
Output:
Returns { bias_tag: string, context: string } | null.
Context format: "This contradiction fits your active bias pattern
\"<tag>\" (Brier 0.31). Verdict: contradiction; severity: medium.
Consider reviewing both sides through the lens of that pattern."
Tests: 13 cases.
R2 regression (2): null profile → null tag; empty active_bias_tags → null tag.
computeDomainHint (5): companies / people / macro / geography / unknown
paths produce expected hints.
Match path (4): macro→late-on-macro-tech, geography→over-confident-geography,
mismatch returns null, first-match-wins with multiple candidate tags.
buildBiasContextString (2): emits tag+verdict+severity+Brier; omits
Brier when null (no "Brier null" leak).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure math layer over existing TakesScorecard data. Zero new LLM cost, zero
new schema. Surfaces the user's historical Brier for the take's
(holder, domain) bucket at write time so they see "your historical Brier
in macro takes is 0.31" before committing the take.
Voice-gate-rendered output:
The user-facing string goes through gateVoice mode='forecast_blurb' via
templates.ts (already in T6). This module is the pure data layer; the
template renders the math into the conversational voice.
v0.36.0.0 ship state:
Bucket dimension is the DOMAIN (slug-prefix). The conviction-weight
bucket dimension would need a new engine method
(engine.batchGetTakeBucketStats per F11) — deferred to v0.37+. Until
then, forecast = historical Brier in this holder's domain.
resolveDomainPrefix() keeps slug-prefix-looking domain hints
('companies/', 'wiki/macro') and falls back to overall for free-form
hints ('macro tech', 'geography'). Hindsight-style structured domain
on takes (CDX-11 mitigation TODO) tightens this in v0.37+.
MIN_BUCKET_N = 5:
Below this sample size, the forecast returns predicted_brier=null with
insufficient_data=true. Template renders "Forecast unavailable: only N
resolved takes at this conviction yet" instead of a noisy estimate.
Architecture:
computeForecast(input) — pure function, takes scorecards already
fetched; ideal for tests + reuse across batched paths.
forecastForTake(engine, input) — convenience wrapper, 1-2 engine
round-trips (no domain → 1; with domain → 2).
batchForecast(engine, inputs[]) — memoizes per (holder, domainPrefix);
N inputs collapse to ≤2*unique_holders unique engine calls. Used by
the propose-queue review flow (50 candidates → 1-2 scorecard fetches).
Tests: 14 cases.
computeForecast (4): insufficient_data branch, stable forecast,
overall fallback, MIN_BUCKET_N export.
resolveDomainPrefix (5): undefined/empty/whitespace → undefined;
slug-prefix → kept; free-form → undefined.
forecastForTake (3): 1-call overall, 2-call domain, free-form fallback.
batchForecast (2): cache collapse for repeat queries; different holders
do not collapse.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/ E4)
When the grade_takes phase auto-resolves a take as 'incorrect' or 'partial',
optionally write a learning entry to gstack's per-project learnings.jsonl
so other gstack skills (plan-ceo-review, ship, investigate, ...) can pull
it as context when relevant. The brain teaches every other tool about
the user's track record.
Config gate (D5 / CDX-17 mitigation):
`cycle.grade_takes.write_gstack_learnings` defaults FALSE. External
users may not have gstack installed; the gstack-learnings binary API
isn't stable yet. Garry's brain flips it true to opt in.
Quality gate:
Only 'incorrect' and 'partial' verdicts trigger the write. 'correct'
resolutions are noise (we expected the take to hold up — no learning).
'unresolvable' has no canonical column. Defense-in-depth runtime guard
in writeIncorrectResolution() rejects ineligible qualities with
reason='quality_not_eligible' so a caller misuse never surfaces a
malformed learning entry.
Auto-apply only:
Coupling fires only when grade_takes both auto-applies AND the verdict
is incorrect/partial AND the config flag is enabled. Manual resolutions
via `gbrain takes resolve` intentionally DO NOT propagate to gstack —
manual writes already carry operator intent; the calibration loop is
the noise-prone path that earns coupling.
Namespace:
Every entry's key starts with 'gbrain:calibration:v0.36.0.0:'. Lane D
`gbrain calibration --undo-wave v0.36.0.0` (T17) filters on this prefix
for the optional gstack-scrub step. First active bias tag suffixes the
key (e.g. 'take-42:over-confident-geography') so future analysis can
group learnings by bias pattern.
Architecture:
buildLearningEntry — pure. Truncates claim at 200 chars + ellipsis;
emits Pattern: line when activeBiasTags present; defaults confidence
to 0.8 when caller omits it.
writeIncorrectResolution — async wrapper. Honors config gate; honors
quality gate; calls the injected writer (or defaultGstackWriter in
production). Failures are non-fatal: returns
{ written: false, reason: 'write_failed' | 'binary_missing', error }.
The grade_takes phase logs to result.warnings and continues — gstack
coupling failure NEVER aborts a cycle.
defaultGstackWriter — shells out to gstack-learnings-log binary via
execFileSync. Throws GBrainError('GSTACK_BINARY_NOT_FOUND') when the
binary isn't on PATH; writeIncorrectResolution classifies that error
to reason='binary_missing' so the operator sees the install hint
instead of a generic write_failed.
Wired into grade-takes.ts after engine.resolveTake() inside the
auto-apply block. Only fires when shouldApply=true.
Tests: 14 cases.
buildLearningEntry (7): canonical shape, partial vs incorrect wording,
bias-tag suffix, no-tag fallback, claim truncation, default confidence,
no-reasoning omission.
writeIncorrectResolution (7): config gate, quality gate, happy path,
writer-throw graceful degrade, binary-missing classification, async
writer awaited, partial quality writes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the four calibration doctor checks per the eng-review spec. abandoned_threads: Counts active high-conviction takes (weight >= 0.7) older than 12 months that have never been superseded. Signal, not error — always status='ok' with a count. The hint sends users to `gbrain calibration` for details. calibration_freshness: Warns when the active profile is older than 7 days (configurable via the same env-var pattern other freshness checks use). Cold-brain branch (no profile yet) returns ok without scolding. Hint points at `gbrain calibration --regenerate`. grade_confidence_drift (CDX-11 mitigation): Surfaces the count of auto-applied grade verdicts. Below 30: returns "need 30+ for drift detection". At/above 30: returns "drift math arrives in v0.37+". The surface is wired; the actual confidence-vs-accuracy correlation math is a v0.37+ follow-up once we have 30+ auto-applied verdicts to measure against. Closes the CDX-11 hole structurally — the operator sees the surface even before the math is meaningful. voice_gate_health: Tracks voice gate failure rate over the last 7 days. <30% fail rate → ok (template fallback is fine in isolation). >=30% → warn with hint to review src/core/calibration/voice-gate.ts rubric. Anchors the cross-cutting voice rule observability story. All four checks return status='warn' with a diagnostic message on engine errors — non-blocking, never throws. Matches the existing doctor check pattern (see checkSyncFreshness for prior art). Wired into runDoctor after checkRerankerHealth (the v0.35 cluster), in the canonical block 10 slot. Tests: 15 cases. 4 per check (happy path, alt-status, engine-throw diagnostic, plus boundary tests for the freshness staleness gate at exactly 7 days and the grade drift gate at 30 applied verdicts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-time pattern surfacing when a newly-committed high-conviction take
matches an active bias pattern. Conversational nudge text via the
templates module; 14-day cooldown per (take_id, nudge_pattern) via
take_nudge_log to prevent the feedback loop where each cycle re-fires
the same nudge on the same take.
Threshold gates (D16 F3):
- holder match (profile.holder === take.holder)
- conviction-weight > 0.7 (strict greater than)
- take's slug-derived domain hint matches an active bias tag
(takeDomainHint — same heuristic as eval-contradictions/calibration-join.ts
for cross-surface consistency)
Cooldown gate:
Before firing, probe take_nudge_log for (take_id, nudge_pattern) rows
with fired_at >= now() - 14 days. Any hit → silently skip. After firing,
insert a new row with channel='stderr' so the next 14 days are gated.
Feedback-loop prevention:
User hedges a take in response to a nudge (e.g. weight 0.85 → 0.65).
Even though the take's `weight` field changed, the cooldown row for
the over-confident-geography pattern is still there from the original
fire — so the next cycle's evaluateAndFireNudge() silently skips. The
user reset path (gbrain takes nudge --reset N) clears the cooldown to
re-arm.
Output channel (v0.36.0.0 ship state):
STDERR only. Schema's `channel` column already supports multi-channel
(webhook, admin SPA toast); routing those is a v0.37+ follow-up.
Architecture:
evaluateNudgeRule(take, profile) — pure rule check. Returns
{ matched, reason, matchedTag }. No engine call.
checkCooldown(engine, takeId, pattern) — engine probe, returns boolean.
recordNudgeFire(engine, opts) — INSERT into take_nudge_log.
evaluateAndFireNudge(opts) — full pipeline. Returns NudgeDecision.
resetNudgeCooldown(engine, takeId) — DELETE...RETURNING for the CLI.
buildNudgeText delegates to templates.ts nudgeTemplate (D24 mode='nudge'
voice). v0.36.0.0 ship state uses the template directly; LLM-generated
nudge text via the voice gate lands in v0.37+ when we have production
examples to tune from.
Tests: 22 cases.
takeDomainHint (5): companies/people/macro/geography/unrecognized.
evaluateNudgeRule (6): no_profile, wrong_holder, conviction-at-threshold-
is-NOT-eligible (strict >), no matching tag, happy match,
first-match-wins for multiple candidate tags.
checkCooldown (3): true on row hit, false on no row, cutoff date param
verifies the 14-day boundary.
evaluateAndFireNudge (4): happy fire (text contains hush command +
matched tag), cooldown silent skip (no INSERT, no stderr), no_profile
short-circuit, below-conviction short-circuit (no cooldown query fired).
buildNudgeText (2): hush command shape, conviction value embedded.
resetNudgeCooldown (2): returns count, idempotent on zero rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(T14)
Cross-brain calibration profile resolution per the D18 4-rule contract.
Pins all four cross-brain leak surfaces in dedicated unit tests so future
mount features can't silently regress this security model.
D18 semantics (committed):
Rule 1 — LOCAL-FIRST ORDERING.
Query the local brain first. If a profile exists, return it. Do NOT
also query mounts (avoids stale-mount-overrides-fresh-local).
Verified: mountResolver is NOT called when local has a hit.
Rule 2 — MOUNT FALLBACK.
Only when local has no profile AND canReadMounts=true, walk the
mounts in priority order. First match wins. Each mount-side row
must have published=true to be visible (D15 asymmetric opt-in).
Rule 3 — CROSS-BRAIN ATTRIBUTION.
Every returned profile carries source_brain_id + from_mount flag.
Consumers (E1 think rewrite, E3 contradictions, E7 nudge, E6
dashboard) MUST surface this via attributionSuffix() so the user
sees which brain answered.
Rule 4 — SUBAGENT PROHIBITION.
canReadMountsForCtx() classifier returns FALSE for subagent loops
without trusted-workspace allowedSlugPrefixes. Closes the
OAuth-token-to-cross-brain-leak surface — subagents see ONLY their
local-brain results regardless of which holder they query.
Exception: trusted cycle phases (synthesize/patterns) pass
allowedSlugPrefixes set and ARE allowed to read mounts. Pinned in
the classifier test.
Architecture:
queryAcrossBrains(localEngine, opts) — pure orchestrator. Composes
getLatestProfile() from src/commands/calibration.ts. Mount engine
access is via opts.mountResolver — production wires this to the
v0.19+ gbrain mounts subsystem; tests inject a stub returning an
ordered list of mocked engines. Decouples cross-brain LOGIC from
multi-engine PLUMBING.
canReadMountsForCtx(ctx) — pure classifier table. Drives the rule-4
gate. Production callers compose it from OperationContext.
attributionSuffix(result) — pure formatter. Emits the "(from mounted
brain: <id>)" suffix when from_mount=true; empty string when local.
Mandatory for user-visible cross-brain consumers.
Tests: 15 cases pinned to the 4 D18 rules + 4 supplementary structural
checks.
D18-1: published=false profile on mount stays hidden.
D18-2/3: subagent context cannot fall back to mounts (2 cases — null
on local-empty + canReadMounts=false, local hit still returned).
D18-4: attribution surfaces source_brain_id (3 cases — mount answer
flag, local answer flag, attributionSuffix formatter).
Rule 1 local-first ordering (2 cases — mountResolver NOT called on
local hit, IS called on local empty).
Mount priority order (3 cases — first published=true wins, all
published=false returns null, no mounts configured returns null
without throwing).
canReadMountsForCtx classifier (4 cases — local CLI true, MCP
non-subagent true, subagent without trusted-workspace false,
subagent WITH trusted-workspace true).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mp (T15)
Adds the v0.36.0.0 admin SPA Calibration tab. Per the design review,
the approved variant-B (Linear calm clarity) layout: single-column flow,
generous whitespace, ONE big sparkline as hero, then patterns, then
domain bars, then abandoned threads.
D23 server-rendered SVG architecture:
src/core/calibration/svg-renderer.ts — pure functions. data → SVG
string. No DOM, no React, no chart library dep. Inlines the admin
design tokens (#0a0a0f bg, #3b82f6 accent, etc.) so the SVG is
visually consistent with the rest of the admin SPA.
Four chart renderers:
- renderBrierTrend({ series }) — sparkline w/ baseline reference
at 0.25 (always-50% baseline)
- renderDomainBars({ bars }) — horizontal accuracy bars per domain
- renderAbandonedThreadsCard(threads) — D30/TD4 'revisit now' link
per row, points at /admin/calibration/revisit/<takeId>
- renderPatternStatementsCard(statements) — D29/TD3 clickable
drill-down links per row, point at /admin/calibration/pattern/<i>
XSS posture: all caller-controlled strings pass through escapeXml().
Numeric inputs are .toFixed()-coerced. Admin SPA renders via
dangerouslySetInnerHTML inside a TrustedSVG wrapper component;
endpoint is gated by requireAdmin middleware.
/admin/api/calibration/profile — returns the active profile row as JSON.
/admin/api/calibration/charts/:type — returns image/svg+xml markup
for type ∈ {brier-trend, domain-bars, pattern-statements,
abandoned-threads}. Cache-Control: private, max-age=60.
brier-trend currently renders a single-point series from the active
profile (the time-series view across calibration_profiles.generated_at
history is a v0.37 follow-up once we have multiple snapshots).
abandoned-threads pulls the top 5 abandoned rows via the same SQL the
doctor check uses.
CalibrationPage React component (admin/src/pages/Calibration.tsx):
Fetches profile + 4 charts. Loading / error / cold-brain states all
handled. Layout includes the audit annotations (partial-grade badge,
voice-gate-fell-back-to-template badge) per the approved mockup.
TrustedSVG wrapper isolates the dangerouslySetInnerHTML to the SVG
surface only.
App.tsx nav: added 'calibration' page route + sidebar nav item, hash
routing extended to support #calibration.
TD2 contrast bump:
admin/src/index.css --text-muted: #555 → #777. Old value was contrast
4.0 on the #0a0a0f bg — below WCAG AA 4.5 for body text. New value is
~5.5, passes AA. Improvement is global across Dashboard, Agents,
RequestLog, and the new Calibration tab — single-line CSS change with
~10x the impact.
admin/dist/ rebuilt via `bun run build` (vite). 36 modules transformed.
Tests: 19 cases in test/svg-renderer.test.ts.
escapeXml (1): canonical entities.
renderBrierTrend (6): empty state, polyline for 2+ points, clamp
beyond yMax, design tokens inlined, XSS safety on date strings,
text-anchor end on right label.
renderDomainBars (4): empty state, label/accuracy/n rendering,
out-of-range accuracy clamp, XSS safety on labels.
renderAbandonedThreadsCard (4): empty state, row rendering with
revisit link, claim truncation at 70 chars, custom revisitHref override.
renderPatternStatementsCard (4): empty state, anchor count matches
statement count, XSS safety, custom drillHref override.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure formatter that turns a CalibrationProfileRow + optional abandoned-
threads list into the conversational block the morning pulse will surface:
Calibration this quarter:
Brier 0.18 (solid).
Right on early-stage tactics, late on macro by 18 months.
Over-confident on team execution; under-calibrated on regulatory risk.
Threads you opened and never came back to:
· AI search platform differentiation (17 months silent)
· International expansion playbook (12 months silent)
Cold-brain branch: returns empty string when no profile or < 5 resolved
takes. Caller decides whether to render the block; cold-brain absence
is the cleanest non-event.
Brier trend note maps the absolute value to conversational copy:
<= 0.10 → "(strong calibration)"
<= 0.20 → "(solid)"
<= 0.25 → "(near baseline)"
> 0.25 → "(worse than always-50% baseline — review your high-conviction calls)"
v0.36.0.0 ship state has only the current profile snapshot. The
"was 0.22 90d ago — improving" comparison shape arrives when we
accumulate generated_at history across multiple cycles.
R3 regression posture:
This module is the FORMATTER only. Wiring into `gbrain recall`'s text
output is intentionally NOT in this commit — runRecall's surface
stays unchanged. v0.37 wires it under --show-calibration (opt-in
initially, default-on later). For now the formatter is callable from
the admin tab + custom CLI scripts that want it.
Architecture:
buildRecallCalibrationFooter(opts) — pure. opts.profile required,
opts.abandonedThreads optional, opts.threadColumnWidth defaults to 50.
Caps at 4 patterns + 5 abandoned threads to keep the footer scannable.
Truncates long abandoned-thread claim text to fit the column width with
a trailing ellipsis.
Tests: 14 cases.
Cold-brain branch (3): null profile, < 5 resolved, zero resolved.
Happy path (7): header + Brier + patterns, trend note ranges (4
brackets), null brier omits the Brier line but keeps header, caps at
4 patterns.
Abandoned threads (4): omit section when none, emit when present,
cap at 5, truncate long claim with column-width override.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the undo-wave reversal flow. Every new row written by the v0.36.0.0 calibration wave carries wave_version='v0.36.0.0' so a precise revert is possible without touching pre-wave data. CLI surface (replaces the v0.36.0.0 ship-state placeholder): gbrain calibration --undo-wave v0.36.0.0 [--dry-run] [--scrub-gstack] [--json] Reversal scope (4 steps): Step 1 — UNSET takes.resolved_* columns for takes auto-applied by this wave. Identifies wave-applied takes via take_grade_cache.applied=true + wave_version match. Cross-checks resolved_by='gbrain:grade_takes' to ensure we're not un-resolving a take a manual `gbrain takes resolve` override has since claimed. Manual resolutions persist; only auto-grade resolutions revert. Step 1b — Mark take_grade_cache rows applied=false post-undo so the audit trail shows they WERE applied but this wave was reverted. The CDX-11 confidence-drift check filters on applied=true and gets a cleaner sample post-undo. Step 2 — DELETE FROM calibration_profiles WHERE wave_version = ?. Step 3 — DELETE FROM take_nudge_log WHERE wave_version = ?. Step 4 — Optional gstack-learnings-prune via the binary, scoped to the GSTACK_LEARNING_NAMESPACE prefix. Opt-in via --scrub-gstack. Best-effort: binary-missing or failure logs a warning + suggests the manual command; the rest of the undo still succeeded. Dry-run posture: --dry-run computes the counts via SELECT COUNT(*) shapes without emitting any UPDATE or DELETE. Same UndoWaveResult shape returned so operator sees exactly what would be reverted before committing. --dry-run intentionally skips the gstack scrub (filesystem write) too; ship-state safety call. Idempotency: Re-running --undo-wave on a brain that's already reverted is a no-op. Each query filters on wave_version; no matching rows → zero counts. Architecture: undoWave(engine, opts) — async, returns UndoWaveResult. Pure data layer; no stderr writes, no process exits. CLI dispatch in src/commands/calibration.ts handles printing. v0.36.0.0 ship state runs steps 1-3 sequentially (no transaction). Partial reversal is recoverable via re-run since each step is idempotent on wave_version match. A future enhancement (v0.37+) can wrap in engine.transaction once that surface lands in BrainEngine. Tests: 8 cases in test/undo-wave.test.ts. Dry-run posture (1): counts emitted, NO UPDATE/DELETE SQL fired. Happy path (3): all 4 steps execute, resolved_by filter scopes UPDATE to wave-applied resolutions, custom resolvedByLabel honored. Empty wave (2): zero counts when no matching rows, idempotent re-run. Wave-version parameter threading (2): supplied version threads through all queries, different wave versions don't collide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Structural answer to CDX-18 (anti-bias rewrite may make advice worse).
We don't have to guess whether calibration helps — we measure.
Architecture:
runAbTrial(input) — calls thinkRunner TWICE on the same question
(baseline + --with-calibration), surfaces both answers to a
preferenceResolver, persists the trial to think_ab_results.
buildAbReport(engine, { days }) — aggregates the table over the last
N days (default 30). Computes win counts, ties, neither, and a
with_calibration_win_rate over DECISIVE trials only (excludes
neither/tie). Flags calibration_net_negative when n >= 20 AND win
rate < 45%.
formatAbReport(report, days) — pretty-prints for stdout; emits the
calibration_net_negative warning block when triggered.
CLI:
gbrain calibration ab-report [--days N] [--json]
Reads the table, prints the breakdown. Replaces the v0.36.0.0
ship-state placeholder in src/commands/calibration.ts.
gbrain think --ab "<question>"
Wires into runAbTrial via the dispatch in src/commands/think.ts —
follow-up commit. This commit lands the harness layer + schema +
report surface; the --ab flag itself flips on in a one-line wiring
commit when the runRecall path is ready.
Schema (migration v72 / think_ab_results):
source_id, wave_version, ran_at, question, baseline_answer,
with_calibration_answer, preferred (CHECK in {baseline,
with_calibration, neither, tie}), model_id, notes.
CHECK constraint enforces preferred enum. Default wave_version
'v0.36.0.0' stamped so --undo-wave can scrub these too.
Index on (source_id, ran_at DESC) supports the report's
"last N days" query.
schema.sql + pglite-schema.ts both updated for fresh-install parity.
schema-embedded.ts regenerated via build:schema.
calibration_net_negative threshold (D19):
Triggers when:
- decisive_trials (baseline + with_calibration) >= 20
- with_calibration_win_rate < 0.45 (NOT <= — exact 45% is OK)
Small-sample guard (n < 20) prevents the warning from firing on
early data with sampling noise. Confidence-flat threshold (no Wilson
CI yet) keeps the math simple; v0.37+ adds CI bounds.
Tests: 12 cases in test/think-ab.test.ts.
runAbTrial (4): both runner calls fire, preferenceResolver receives
both answers, INSERT row params shape, throws when thinkRunner
missing.
buildAbReport (5): zero trials, aggregation, net_negative trigger at
n>=20 + win<45%, no trigger at n<20 (small-sample guard), no
trigger at exact 45% boundary.
formatAbReport (3): zero-state message, decisive-trials breakdown,
net_negative warning block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TD4 / D30)
TD3 (D29) — clickable pattern drill-down endpoint:
GET /admin/api/calibration/pattern/:id (requireAdmin)
Returns the pattern statement at index `id` plus the top 25 resolved
takes for the holder, sorted by weight desc. v0.36.0.0 ship-state
approximation: surfaces broad provenance evidence (top resolved
takes). v0.37+ stores per-pattern source_take_ids[] on a
calibration_profile_patterns join table so the drill-down shows the
EXACT takes that drove the pattern.
Surfaces a `provenance_note` field in the response so the operator
sees the v0.36.0.0-vs-v0.37 fidelity boundary inline.
The admin SPA's renderPatternStatementsCard SVG already emits anchor
tags pointing at /admin/calibration/pattern/<i> (T15 ship state).
This route makes those anchors clickable — closes the trust loop that
was the rationale for D29 ("pattern statements without their evidence
are dressed-up LLM hallucinations").
TD4 (D30) — `gbrain takes revisit <slug>` editor-open action:
Adds the `revisit` subcommand to gbrain takes. Opens $EDITOR (falling
back to vi) on the source markdown file for the slug. Appends a
`<!-- gbrain:revisit -->` cursor marker at the bottom of the page on
first invocation so the editor opens with intent visible.
Reads sync.repo_path from config to locate the brain repo. Refuses to
proceed with a clear error when the repo isn't configured or the page
doesn't exist.
spawnSync with stdio:'inherit' so the editor takes the terminal. Exit
status surfaced on failure.
The SVG renderer's revisit-now anchor for each abandoned thread row
emits /admin/calibration/revisit/<takeId>. A small route handler that
resolves take_id → page_slug then dispatches `gbrain takes revisit`
via spawn is a v0.37 follow-up — the CLI command exists now so
developers can wire it directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the admin SPA's de facto design tokens (landed v0.26.0) to a canonical DESIGN.md at the repo root. This is the calibration target for /plan-design-review and /design-review going forward — when a question is "does this UI fit the system?", the answer is here. Captures the system as it stands today: Voice (5 surfaces, all routed through gateVoice() with mode-specific rubrics): pattern_statement, nudge, forecast_blurb, dashboard_caption, morning_pulse. Friend-not-doctor; concrete data over abstract metrics; no preachy / clinical / corporate language. Color tokens: 10 CSS variables from admin/src/index.css inlined into the SVG renderer (src/core/calibration/svg-renderer.ts). Dark theme is the only theme — admin is an operator tool. WCAG contrast documented per token; TD2's #555 → #777 bump on --text-muted noted. Typography: Inter for UI, JetBrains Mono for numbers/slugs/data. Type scale (18 / 14 / 13 / 12 / 11) documented as de facto, not yet formalized. Spacing scale: 4 / 8 / 16 / 24 / 32px. Linear-app density. Layout: sidebar 200px, max content 720px (text) / 960px (tables). No 3-column feature grids, no icons in colored circles, no decorative blobs. Charts: server-rendered SVG via pure functions in src/core/calibration/svg-renderer.ts. XSS posture documented: server-side escapeXml on caller-controlled strings, numeric inputs .toFixed()-coerced, admin SPA renders via <TrustedSVG> wrapper. Interaction patterns: keyboard nav required (J/K/space/u/q on the propose-queue), loading/empty/error states ARE features. v0.37+ roadmap: type scale formalization, animation tokens, component library extraction. Light mode explicitly NOT planned. The doc is a living target, not a frozen spec. Major changes route through /plan-design-review per the existing review chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T19 — synthetic corpus scaffold for extract-takes prompt tuning.
test/fixtures/calibration/extract-takes-corpus/ — 5 representative
pages across 4 genres (essay, people, companies, meetings, decisions).
v0.36.0.0 ships a SMALL representative corpus as proof of structure;
the full 50-page training set + 10-page holdout gets generated by the
operator via `gbrain calibration build-corpus` (v0.37 follow-up
subcommand) or by hand with the privacy guard catching violations
either way.
Privacy contract per D13': every page is SYNTHETIC. None of the
names/companies/funds/deals/events refer to anything real. Placeholder
names per CLAUDE.md: alice-example, charlie-example, acme-example,
widget-co, fund-a/b/c, acme-seed, widget-series-a, meetings/2026-04-03.
test/fixtures/calibration/README.md spells out the privacy contract,
generation flow, and what the corpus is (stable regression set for
the extract-takes prompt) vs is not (real anything).
T20 — privacy CI guard (CDX-14 mitigation).
scripts/check-synthetic-corpus-privacy.sh greps the corpus for:
1. Explicit dollar amounts ($50M, $1.2B etc) — would suggest the
page memorized a real round size.
2. Out-of-range year references (informational only for v0.36.0.0;
deferred to a manual review checklist).
3. Pages that reference ZERO placeholder names — suggests the page
might be referring to real entities. Essay-genre fixtures
exempt (they're anonymized PG-style writing by design).
Wired into `bun run verify` (CI gate) so contributors can't accidentally
land a synthetic fixture that leaks real-world specificity. The intent
is fail-fast on accidental leakage; the operator can update the
allowlist if a generic dollar amount is intentional.
Closes CDX-14: 'CC reads real brain pages locally, writes nothing
still risks privacy if any generated synthetic fixture memorizes
structure-specific facts. Placeholder names are not enough.'
The corpus shipped here is intentionally small but covers the four
core gbrain page genres (essay, people, companies, meetings/decisions).
The v0.37 corpus-build subcommand will fan out to 50 with the operator
spot-checking + the CI guard enforcing the privacy contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per /plan-eng-review D26 IRON RULE: regressions get added to the test
suite as critical requirements, no AskUserQuestion needed. Pins five
regressions identified during the v0.36.0.0 wave's coverage diagram:
R1: think baseline UNCHANGED when --with-calibration absent.
Covered structurally by test/think-with-calibration.test.ts plus
assertion-pinned in this file (default user message: question
first, then retrieval; system prompt: no anti-bias section).
R2: contradictions probe output UNCHANGED when no calibration profile.
Covered structurally by test/eval-contradictions-calibration-join.test.ts
plus pinned here (null profile → null tag, byte-identical to v0.32.6).
R3: takes resolution flow works when grade_takes phase disabled.
Pinned import-surface coupling: takes-resolution.ts has zero
dependency on grade_takes module. If a future refactor accidentally
couples them, this test fails to compile.
R4: search/list_pages/get_page work identically through new source_id paths.
Marker test referencing existing v0.34.1 source-isolation suite at
test/source-isolation-pglite.test.ts. v0.36.0.0 does NOT modify
those code paths; the existing tests catch any accidental coupling.
R5: existing search modes (conservative/balanced/tokenmax) unaffected.
Marker test referencing existing test/search-mode.test.ts. The
calibration code DOES NOT IMPORT from src/core/search/mode.ts.
Plus an inventory test that confirms all 5 regressions have an
'addressed' status — fail-loud if a future contributor removes a
guard without updating the inventory.
7 tests total. Pure functions, no engine, hermetic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n skill
CHANGELOG entry: the user-facing release notes. Leads with the headline
("the brain learns how you tend to be wrong, then argues against your
blind spots on every advice call"), 5 'what you can now do' bullets in
GStack voice, itemized changes by lane, and the 'To take advantage of
v0.36.0.0' upgrade checklist per the CLAUDE.md required-block contract.
CLAUDE.md anchors: new 'v0.36.0.0 Hindsight calibration wave (key files
cluster)' block inserted before the v0.31.1 thin-client section. 23 new
files / extensions annotated with one-paragraph descriptions each,
linking back to the convention skill at skills/conventions/calibration.md
for the agent-facing rules.
skills/conventions/calibration.md: the agent-facing convention skill.
Tells future contributors which calibration touchpoint applies to
their task — voice gate? BaseCyclePhase? source-scope thread? doctor
warning? cross-brain query rules? auto-resolve threshold posture? Test
seam patterns. Bug class to avoid (the v0.34.1 source-isolation leak
shape).
Version trio (per CLAUDE.md mandatory audit):
VERSION: 0.36.0.0
package.json: 0.36.0.0
CHANGELOG: ## [0.36.0.0] - 2026-05-17
llms.txt + llms-full.txt regenerated via `bun run build:llms` after
the CLAUDE.md edit (per the explicit CLAUDE.md mandate "Any CLAUDE.md
edit MUST be followed by `bun run build:llms`"). The `test/build-llms.test.ts`
guard runs in CI shard 1; the committed bundles are checked against
fresh generator output.
bun run verify is clean. typecheck clean. Privacy CI guard passes
(0 violations across 6 corpus pages). All ready for /ship.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nCycle (T-fix)
The three new v0.36.0.0 phases were declared in CyclePhase / ALL_PHASES /
NEEDS_LOCK_PHASES but the runCycle orchestrator never dispatched them.
ALL_PHASES advertised them, gbrain dream --phase propose_takes accepted
them, but `gbrain dream` (default) silently skipped all three.
Adds a single dispatch block between consolidate and embed that:
- builds an OperationContext on the fly (trusted-workspace caller,
remote: false, sourceId resolved via the same helper sync uses)
- dispatches the three phases in the order ALL_PHASES declares
- records the same skipped-phase shape (no_database) when engine is null
Pinned by test/core/cycle.serial.test.ts "default: all 6 phases run in
order" which was already failing against ALL_PHASES (the test name lags
the actual phase count; left as-is since renaming churns history).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 18, 2026
Three open PRs were claiming v0.36.0.0 (#1130 skillpack, #1139 hindsight, #1136 this PR). Ship-aware queue allocator says this branch lands at v0.36.2.0. Trio audit: VERSION 0.36.2.0 package.json 0.36.2.0 CHANGELOG ## [0.36.2.0] - 2026-05-17 Updates: VERSION, package.json, CHANGELOG header + body refs, README "New default in v0.36.2.0" announcement + credit line, skills/migrations/v0.36.0.0.md renamed to v0.36.2.0.md with frontmatter + body refs updated. llms-full.txt regenerated.
Master shipped v0.35.6.0 (floor-ratio search gate) and v0.35.7.0 (typed-claim trajectory + founder scorecard) ahead of this branch. Resolving the merge requires: 1. VERSION trio (VERSION, package.json, CHANGELOG.md top entry) bumped to 0.36.1.0 to claim the next slot after master's 0.35.7.0. 2. Migration v67 collision: master shipped facts_typed_claim_columns as v67. This branch's six calibration migrations renumber from v67-v72 to v68-v73. Master's v67 stays unchanged. 3. wave_version literal renamed 'v0.36.0.0' -> 'v0.36.1.0' across: migrate.ts DEFAULT clauses, pglite-schema.ts, schema-embedded.ts (regenerated), schema.sql, test fixtures, undo-wave logic, and every doc string referencing the wave. `gbrain calibration --undo-wave v0.36.1.0` is the new operator-facing reversal path. 4. test/regressions/v0.36.0.0-iron-rule.test.ts -> v0.36.1.0-iron-rule.test.ts so the regression-inventory filename tracks the actual release. 5. llms-full.txt + llms.txt regenerated against the updated docs. Verification: - bun run verify: green - bun run test: 7132 pass / 0 fail / 0 skip - Targeted migrate + bootstrap + cycle + undo-wave + nudge + cli tests: 207 pass / 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master shipped v0.35.8.0 (autopilot phantom-page redirect inside extract_facts, #1138) ahead of this branch. VERSION trio kept at 0.36.1.0 since this branch's slot is already higher than master's new tag. CHANGELOG carries both v0.36.1.0 (top) and v0.35.8.0 entries; llms-full.txt regenerated. src/core/cycle.ts and src/commands/doctor.ts auto-merged cleanly (both branches added separate sections). Test gate green: 195/195 on cycle.serial + migrate + doctor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master shipped v0.36.0.0 (skillpack scaffold / reference / harvest; retired managed-block install, #1130) — naming overlap with this branch's slot. This branch's slot stays 0.36.1.0 (already higher); master's v0.36.0.0 entry preserved in CHANGELOG. VERSION trio resolved: my 0.36.1.0 wins over master's 0.36.0.0 on VERSION, package.json, and CHANGELOG.md top entry. llms-full.txt regenerated. All other files auto-merged cleanly (CLAUDE.md, README.md, skills/RESOLVER.md, etc). Verification: - bun run typecheck: green - bun install: lockfile up to date Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(T19)
Adds 8 new synthetic pages modeled on the genre mix observed in the
real brain (concepts-with-timeline, meeting-notes, daily-journal,
people-pages, essays). Companion .gradeable-claims.json files carry
hand-labeled answer keys — what a tuned propose_takes prompt SHOULD
extract per page. Closes the F1 gate gap from the plan's T19/D19:
Training corpus (test/fixtures/calibration/extract-takes-corpus/):
+ concept-startup-market-dynamics.md (10 claims)
+ meeting-2026-04-10-fundraise-fund-a.md (6 claims)
+ daily-2026-04-15.md (5 claims)
Blind holdout (test/fixtures/calibration/holdout/):
+ concept-founder-execution.md (6 claims, F1 >= 0.80)
+ daily-2026-04-18.md (4 claims, F1 >= 0.80)
+ meeting-2026-04-17-hiring-charlie.md (5 claims, F1 >= 0.80)
+ essay-on-conviction.md (7 claims, F1 >= 0.80)
+ people-bob-example.md (5 claims, F1 >= 0.80)
Privacy:
- No real-brain content read into any committed artifact. Pages
written from scratch using the canonical placeholder set
(alice-example, charlie-example, bob-example, acme-example,
widget-co, fund-a/b/c). Real-name grep confirms zero leakage:
wintermute, garrytan, paul-graham, sam-altman, etc. → 0 hits.
- scripts/check-synthetic-corpus-privacy.sh passes: 0 violations
across 14 pages (was 6).
Genre fidelity:
- concept-with-timeline pages mirror the dated-assertion structure
real brain uses (verb framing varies: "argues / predicts / I
think / I bet / strong conviction / moderate conviction").
- meeting-notes pages carry both prose claims (extracted via
hedging language) and explicit ## Takes sections.
- daily-journal pages test probabilistic framing ("75/25 in favor",
"call it ~0.5") and self-tagged conviction values.
- essay-on-conviction is the meta-page that names the author's
own bias patterns — primary signal for calibration_profile.
- people pages test claim-about-third-party extraction.
Each JSON ground-truth lists per-claim:
- claim_text + kind (prediction|judgment|bet) + domain
- conviction (0..1)
- since_date
- rationale (why this claim is gradeable + how a tuned prompt
should infer conviction from the prose)
This is the corpus that gates the T19 prompt-tune iteration:
- F1 >= 0.85 on training (10+6+5 = 21 claims across 3 pages
plus the existing 5 fixtures already shipped)
- F1 >= 0.80 on holdout (27 claims across 5 pages)
Plan reference: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md
Privacy gate: scripts/check-synthetic-corpus-privacy.sh (wired into bun run verify).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5 F1 0.92+) The v0.36.1.0 ship state shipped propose_takes with a stub prompt that the docs flagged as "tune via T19 corpus build before relying on propose_takes in production." T19's corpus was built in commit 69a71c9 (14 synthetic pages + 48 hand-labeled claims). The matching gbrain-evals cat15 runner validates extraction quality against that corpus. This commit back-ports the tuned prompt validated by cat15's first live run: training avg F1: 0.952 (target 0.85, +10 points) holdout avg F1: 0.922 (target 0.80, +12 points) train-holdout gap: 0.03 (well below 0.10 overfitting threshold) 8/8 probes pass their individual F1 targets Per-genre F1 floor: 0.80 (people-pages, the hardest genre). Concept- with-timeline and meeting-notes genres scored at 1.00 on holdout pages. The tuned prompt design changes vs the stub: - Worked example list seeds the "gradeable claim" notion so the model doesn't drift into pure-fact extraction. - NOT-gradeable list catches the most common over-extraction modes (pure facts, direct quotes, restatements). - Conviction inference rules anchored to specific hedging language so the model produces consistent weight values. - kind enum narrowed to 'prediction' | 'judgment' | 'bet' — the v1 stub's 4-tag enum bled into noise classification on the corpus. PROPOSE_TAKES_PROMPT_VERSION bumped 'v0.36.1.0-stub' → 'v0.36.1.0-tuned-cat15'. The bump invalidates the take_proposals idempotency cache so existing proposal rows stay as audit history but the next cycle re-extracts against the new prompt — exactly the design contract this version field is for. Re-tuning protocol: run cat15 in gbrain-evals against the fixtures BEFORE bumping the version string. The train-holdout gap should stay < 0.10. If a future tune drops below the cat15 gate, revert. Source of evidence: - cat15 runner: ~/git/gbrain-evals/eval/runner/cat15-propose-takes.ts - Fixture corpus: test/fixtures/calibration/ (this repo, commit 69a71c9) - Live run dumps: ~/git/gbrain-evals/eval/reports/cat15-propose-takes/*.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the "Validated by published benchmarks" subsection to the v0.36.1.0 CHANGELOG entry and a "Calibration loop" section to the README's "Receipts on the evals" surface. Both link to the new benchmark report at gbrain-evals/docs/benchmarks/2026-05-18-brainbench-cat14-cat15-calibration.md. CHANGELOG: also updates the propose_takes bullet to reflect that the v0.36.1.0 ship state now includes the tuned 'v0.36.1.0-tuned-cat15' prompt (back-ported in 04dbab4), not the v1 stub the original entry described. README: adds a Calibration loop entry to the receipts table sitting between source-aware ranking and prompt compression. Frames the cat14 + cat15 numbers as "first published benchmark for AI memory systems that reason about user track records" — honest SOTA framing since Hindsight introduced the concept without quantified evaluation. llms.txt + llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 links to gbrain-evals/blob/master/docs/benchmarks/ were broken — the gbrain-evals repo uses 'main' as its default branch, not 'master'. Surfaced when I checked that the new cat14/cat15 link resolved post-PR-9 merge. Turned out 4 pre-existing links to longmemeval, brainbench-v0.20, brainbench-cat13b-source-swamp, and comparison-systems were all broken for the same reason — I just added a fifth by following the same wrong pattern. Sweep: gbrain-evals/blob/master/ → gbrain-evals/blob/main/ across both README.md (5 links) and CHANGELOG.md (2 links). llms.txt + llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 19, 2026
…1136) * feat(dims): OpenAI text-embedding-3 Matryoshka range validation (D13) dimsProviderOptions now fail-loud at the embed boundary when the configured embedding_dimensions is outside the model's native range (1..1536 for -small, 1..3072 for -large). Paste-ready fix hint in the AIConfigError.fix field. Closes the silent-HTTP-400 path that would have bit OpenAI-fallback users on v0.36.0.0 ZE-default installs. 16 new test cases in test/ai/dims-openai.test.ts pinning the contract across native-openai and openai-compatible adapter paths. * feat(ai): flip defaults to ZeroEntropy zembed-1 1280d + zerank-2 reranker Default embedding model is now zeroentropyai:zembed-1 at 1280d via Matryoshka. Real-corpus benchmark: 2.2x faster than OpenAI, 2.6x cheaper at regular pricing, wins 11/20 head-to-head queries. 1280 is the closest valid ZE Matryoshka step to the prior OpenAI 1536d default (valid set: 2560/1280/640/320/160/80/40). 1024 (Voyage's step) is NOT on ZE's list — pinned by AIConfigError fail-loud in dims.ts. balanced mode bundle now defaults reranker_enabled=true. zerank-2 reshuffles 60% of top-1 results in benchmarks. Missing-key fail-open contract in src/core/search/rerank.ts handles unauthenticated cases. Opt out with: gbrain config set search.reranker.enabled false Existing tests updated (gateway.test.ts, search-mode.test.ts) and a new test/balanced-reranker-default.test.ts (10 cases) pins the fail- open invariants. * feat(retrieval-upgrade): RetrievalUpgradePlanner + interactive prompt UX New src/core/retrieval-upgrade-planner.ts is the consolidated planner that computes the brain's pending retrieval-upgrade work (chunker bumps + ZE switch) in one pass and applies the schema transition + config updates atomically. Tagged-union ApplyResult enum (D15): 'applied' | 'skipped_already_ applied' | 'skipped_no_work' | 'declined' | 'planned' | 'failed'. No string-parsing reasons. Three config keys (D12): ze_switch_prompt_shown (UI state), ze_switch_requested (user intent), ze_switch_applied (work done). Plus ze_switch_previous_snapshot (JSON, full prior config for --undo per D16) and ze_switch_declined_at (90-day re-ask window). Schema transition (D18) is atomic: DROP indexes + ALTER COLUMN + CREATE INDEX inside a single engine.transaction(). HNSW recreation is part of the same transaction — no silent slow-search window. C3 eligibility logic: ze_switch_offered iff NOT on ZE + NOT declined recently + NOT applied + (legacy default OR >100 pages). C4 cost math: MAX(chunker_pending, dim_pending) not SUM — one re-embed pass invalidates both surfaces simultaneously. New src/core/retrieval-upgrade-prompt.ts wires the planner to a TTY-only interactive prompt with two-line cost split (D10) and privacy callout for the reranker flip. Tests: test/retrieval-upgrade-planner.test.ts (24 cases) pins the state machine. test/asymmetric-encoding-contract.test.ts (6 cases) pins D17: search read path uses gateway.embedQuery() not embed(), asserted via __setEmbedTransportForTests mock. * feat(cli): gbrain ze-switch — manual lever for the ZE switch New gbrain ze-switch CLI with --dry-run, --json, --resume, --force, --undo, --non-interactive, --confirm-reembed, --ignore-missing-key flags. Mirrors the upgrade prompt's UX symmetry: --undo presents a cost-warning before re-embedding back to the prior width. src/cli.ts: dispatch case + CLI_ONLY entry. ze-switch owns its own engine lifecycle (mirrors the doctor pattern). test/ze-switch-cli.test.ts (11 cases): --help, --dry-run, --json, --non-interactive, --ignore-missing-key, --resume, --undo, --confirm-reembed. Uses captureExit harness to test process.exit() paths without breaking the test process. * feat(doctor): ze_embedding_health + embedding_width_consistency checks Two new doctor checks (D-A5): ze_embedding_health: when embedding_model starts with zeroentropyai:, verify ZEROENTROPY_API_KEY is set (env or config). Paste-ready setup hint with the signup URL on failure. embedding_width_consistency: cross-check that the configured embedding_dimensions matches the actual vector(N) column width on content_chunks.embedding. Catches the half-applied switch state (schema migrated but config write crashed) with a paste-ready gbrain ze-switch --resume hint. Wired into runDoctor between reranker_health and the existing sync_freshness checks. Both checks gracefully no-op on non-ZE embedding configs. test/doctor-ze-checks.test.ts (8 cases) pins both checks across happy + missing-key + missing-config + drift paths. Uses withEnv() helper to clear ZEROENTROPY_API_KEY for the no-key path so tests are hermetic against contributor env state. test/e2e/v0_28_5-fix-wave.test.ts + test/openai-compat-multimodal.test.ts: updated to explicit-configure the gateway when the test depends on specific dims that diverge from the v0.36.0.0 default (1280d). * docs: README zero-based rewrite (884 -> 139 lines) + new docs files Strip 4 months of accreted "New in v0.X.Y" hero blocks and reorganize around what gbrain does today. 33 H2s -> 8. The Commands section (136 lines duplicating gbrain --help) moved out; the 6-table skills enumeration collapsed to a one-paragraph capability description with a link to skills/RESOLVER.md. Hero retains load-bearing facts: OpenClaw + Hermes credit, production numbers (17,888 pages / 4,383 people / 723 companies), BrainBench numbers (P@5 49.1% / R@5 97.9% / +31.4 lift), ZE comparison numbers, 30-min install claim. Adds one paragraph announcing the v0.36.0.0 ZE default with the explicit gbrain config set escape for OpenAI/Voyage users. New files: - docs/INSTALL.md: every install path consolidated (agent platform, CLI standalone, MCP server). Thin-client mode covered. - docs/architecture/RETRIEVAL.md: why the hybrid + graph stack works. BrainBench numbers, why each strategy alone fails, the source-aware ranking + intent classification + multi-query expansion story. - docs/ethos/ORIGIN.md: origin story lifted from the old README so the front door stays factual + concrete. test/readme-hero-anchors.test.ts (5 cases) is the D9 regression guard. Five load-bearing strings: OpenClaw, Hermes, ZE, production-numbers regex, P@5/R@5. Light anchors that let voice/ structure evolve but block accidental loss of headline facts. scripts/check-test-real-names.sh: allowlist entries for OpenClaw + Hermes literals in the anchor test (it explicitly asserts those strings appear in README). * chore: bump version and changelog (v0.36.0.0) ZeroEntropy as the new default for embedding (zembed-1 at 1280d via Matryoshka) and reranker (zerank-2 cross-encoder, on by default in balanced mode bundle). README zero-based rewrite (884 -> 139 lines). 3 new docs files. Two new doctor checks. New gbrain ze-switch CLI with --undo for symmetric reversibility. skills/migrations/v0.36.0.0.md tells the agent how to surface the retrieval-upgrade prompt post-upgrade. llms-full.txt regenerated via bun run build:llms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docs): scrub Wintermute from RETRIEVAL.md per privacy rule * chore: rebump version 0.36.0.0 → 0.36.2.0 (queue collision) Three open PRs were claiming v0.36.0.0 (#1130 skillpack, #1139 hindsight, #1136 this PR). Ship-aware queue allocator says this branch lands at v0.36.2.0. Trio audit: VERSION 0.36.2.0 package.json 0.36.2.0 CHANGELOG ## [0.36.2.0] - 2026-05-17 Updates: VERSION, package.json, CHANGELOG header + body refs, README "New default in v0.36.2.0" announcement + credit line, skills/migrations/v0.36.0.0.md renamed to v0.36.2.0.md with frontmatter + body refs updated. llms-full.txt regenerated. * fix(test): pin gateway dim=1536 in cross-file-stateful PGLite tests CI shard 1 reported 10 failures across `query-cache.test.ts` (6) and `consolidate-valid-until.test.ts` (4). Both files hardcode 1536-dim vectors but rely on `PGLiteEngine.initSchema()` to size `vector(__EMBEDDING_DIMS__)` at the right width. Root cause: v0.36.2.0 flipped DEFAULT_EMBEDDING_DIMENSIONS from 1536 to 1280 (ZE Matryoshka step). The gateway module is process-singleton; when ANOTHER test file in the same shard's bun-test process configures the gateway before us, `pglite-engine.ts:216` reads `getEmbeddingDimensions() === 1280` and sizes the schema columns at vector(1280). The hardcoded 1536-dim INSERTs then fail with "expected 1280 dimensions, not 1536". Locally these tests pass in isolation because the gateway falls back through the try/catch at pglite-engine.ts:218 (1536 default). CI runs multiple test files in one process, so cross-file state poisons the schema width. Fix: explicit `resetGateway()` + `configureGateway({embedding_dimensions: 1536, ...})` at the top of `beforeAll`, plus `resetGateway()` in `afterAll`. Pins the schema width regardless of cross-file state. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v0.36.1.0 teaches gbrain to know how the user tends to be wrong and apply that knowledge at every advice surface. One PR with 32 bisect-friendly atomic commits (31 wave commits + 1 merge resolution from master's v0.35.6 + v0.35.7).
The substrate gbrain already had (takes + scorecard + Brier + contradictions probe) was 70% of the way there. This wave closes the other 30%: extract gradeable claims from prose, grade against reality, aggregate into a profile of the user's bias patterns, then apply the profile when giving advice.
What ships
propose_takes(LLM scans prose for gradeable claims),grade_takes(judge model verdicts unresolved takes with retrieval),calibration_profile(aggregates resolved subset into 2-4 conversational pattern statements). All extend the newBaseCyclePhaseabstract class which enforcessourceScopeOpts(ctx)threading at the type level — closes the v0.34.1 source-isolation leak class structurally for every future phase.calibration_profiles,take_proposals,take_grade_cache,take_nudge_log,takes_resolved_at_idx(CONCURRENTLY on Postgres),think_ab_results. Every new row stamped withwave_version='v0.36.1.0'for clean--undo-wavereversal.>=0.95single-model OR>=0.85ensemble 3/3 unanimous, schema-enforced monotonic-tightening only.gateVoice()function, 5 modes, Haiku rubric judge, 2 regens then hand-written template fallback. Pattern statements pass the conversational voice test before storage.gbrain calibration,gbrain calibration --regenerate,gbrain calibration --undo-wave v0.36.1.0,gbrain calibration ab-report,gbrain takes revisit <slug>. New MCP opget_calibration_profile(scope: read, source-scoped)./admin/calibrationwith Brier sparkline, per-domain bars, pattern statements, abandoned-threads card. Three SVG endpoints behindrequireAdmin. WCAG AA contrast bump on--text-muted(feat(exports): add ./enrichment to package.json exports map #555 → proposal: allow embedding provider API keys from config for headless runtimes #777).abandoned_threads,calibration_freshness,grade_confidence_drift,voice_gate_health.test/fixtures/calibration/plus CI privacy guardscripts/check-synthetic-corpus-privacy.sh(wired intobun run verify). Real names of YC partners / portfolio companies / funds cannot leak into committed fixtures.test/regressions/v0.36.1.0-iron-rule.test.ts) pinning think baseline, contradictions output, takes resolution, source-isolation read paths, and search modes all UNCHANGED when calibration is absent.DESIGN.mdformalizes de facto admin SPA tokens.skills/conventions/calibration.mdis the agent-facing convention. CHANGELOG v0.36.1.0 entry. CLAUDE.md key-files cluster added.Reviews cleared
Test plan
bun run verify— privacy + jsonb + progress + wasm + admin-build + cli-exec + system-of-record + eval-glossary + synthetic-corpus-privacy + typecheck all greenbun run test— 7132 pass / 0 fail / 0 skip across 8 parallel shards + serial passrunCycledefault dispatches all 16 phases in order (was failing before T-fix wired the three new phases into runCycle dispatch)sqlFor.pglitebranches where index DDL differs🤖 Generated with Claude Code