Merge upstream GBrain v0.35.1.1 while preserving Eva OpenClaw defaults#102
Merged
Conversation
…ity routing (garrytan#881) * feat(v0.33): add SearchOpts.types multi-type filter to searchHybrid Push the page-type filter into SQL via AND p.type = ANY(\$N::text[]) in both engines' searchKeyword + searchVector + searchKeywordChunks paths. Primary consumer is the upcoming gbrain whoknows command (filters to ['person','company']); the limit budget then goes to typed candidates instead of being eaten by note/transcript/article pages. Future entity-only search in v0.34+ reuses the parameter for free. AND-applies alongside the existing single-value type filter (callers can use either or both). HybridSearchOpts threads opts.types into the underlying searchOpts so hybridSearch callers get the SQL-level filter without any post-filter waste. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.33): whoknows core ranking function + 10 locked unit tests Implements ENG-D1's locked spec: score = log(1 + raw_match) × max(0.1, exp(-days/180)) × (0.5 + 0.5 × salience). raw_match comes from hybridSearch's RRF + source-boost-adjusted score; salience and recency boosts in hybridSearch are intentionally disabled so the formula applies on a clean signal. rankCandidates() is the pure function the eval grades against; findExperts() is the public entrypoint that wires hybrid search + batch salience/effective_date fetches; runWhoknows() is the CLI. Test/whoknows.test.ts covers the 10 ENG-D3 cases (zero results, negative recency floor, NaN salience neutral default, NaN match zeros gracefully, type preservation, --explain factor breakdown, top-K limit clamping, recency-floor extreme-days safety, alphabetical tie-break determinism, public-surface contract). Plus four sanity asserts (higher-match outranks, more-recent outranks, higher-salience outranks, all-zero candidate appears with score 0). Plus one factor decomposition assertion that pins the exact formula numerically. Plus a composite-key safety case (Codex F1). 22 expect calls across 16 tests. All passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.33): register find_experts MCP op + gbrain whoknows CLI Wires both surfaces per ENG-D5: MCP op = find_experts (matches find_anomalies naming convention; agent-facing); CLI command = gbrain whoknows (memorable, user-facing). One findExperts() core function backs both paths. The op is scope:'read', localOnly:false — accessible over HTTP MCP to read-scoped OAuth clients like the salience/anomalies family. Op handler validates non-empty topic and dispatches to the same findExperts() pure function the CLI uses. CLI dispatch in src/cli.ts:case 'whoknows' calls runWhoknows; thin- client routing happens inside runWhoknows via isThinClient(cfg) — remote MCP installs route through the v0.31.1 routing seam to callRemoteTool('find_experts', ...). FIND_EXPERTS_DESCRIPTION in operations-descriptions.ts mirrors the v0.29 redirect-hint style: leads with what the tool does, lists explicit user-intent triggers ("who should I talk to about X", "who knows about Y"), notes the type-filter behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.33): gbrain eval whoknows — two-layer eval gate (ENG-D2) Implements the locked spec: Layer 1 hand-labeled fixture (>=80% top-3 hit rate) is the primary ship-blocking gate; Layer 2 eval_candidates replay (>=0.4 mean set-Jaccard@3) is the regression gate that auto-skips when < 20 replay-eligible rows exist (CONTRIBUTOR_MODE sparseness fallback). Dispatch lands as `gbrain eval whoknows <fixture.jsonl>` sub-subcommand in src/commands/eval.ts (mirrors v0.25.0 export/prune/replay and v0.27.x cross-modal pattern). Exits 0/1/2 for pass/fail/usage so CI gates can consume. JSON output (--json) ships schema_version: 1 for stable consumer contract (mirrors v0.25.0 eval-replay.ts). Human output groups by layer + emits a per-miss diagnostic table so failures are self-debugging. Unit tests pin: - jaccardAtK math (7 cases — identical, disjoint, partial, k cutoff, empty-empty vacuous-stable, empty-vs-non-empty, Set dedup) - topKHit (7 cases — position 1, 3, 4, miss, multi-expected, empty actual, empty expected) - readFixture (6 cases — well-formed, comments/blanks, missing file, malformed JSON, missing required fields, non-string filter) - Locked thresholds (HIT_RATE=0.8, REGRESSION=0.4, MIN_REPLAY_ROWS=20) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.33): gbrain doctor adds whoknows_health check Per CEO-D7 (substrate-conditional v0.33 doctor check, but the fixture-presence sub-check ships in week 1 regardless — it's the "did you do the assignment?" signal). When the eval fixture is missing, empty, or undersized (< 5 rows), doctor warns with the exact path the user should populate. The check is intentionally lightweight: it does NOT run the eval itself or measure hit-rate regression. That's the job of `gbrain eval whoknows`, called from CI/ship time. This check is the cheap always-runs signal that surfaces in `gbrain doctor` and on the ship review dashboard. 5 unit cases pin the four-status behavior (missing/empty/undersized/ ok) plus the comment-and-blank-line filtering so users can comment out queries during iteration without breaking the row count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.33): synthetic whoknows eval fixture + E2E quality gate test test/fixtures/whoknows-eval.jsonl ships as a 10-query placeholder demonstrating the schema. Comments document the assignment for end users: they replace these with their own real queries before shipping their gbrain install. The placeholder uses obviously- example slugs (wiki/people/example-alice, etc.) so nobody mistakes it for production data. test/e2e/whoknows.test.ts seeds a synthetic PGLite brain that matches the placeholder fixture, then runs findExperts on every fixture query and asserts >=80% top-3 hit rate per ENG-D2 quality gate. Also exercises the typeFilter (concept-decoy pages filtered out), empty-result graceful return, --explain factor breakdown, and top-K limit honoring. Basis-vector embeddings (no API key) follow the existing pattern from test/e2e/search-quality.test.ts. 5 test cases, 23 expect calls, all passing against PGLite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v0.33): VERSION bump + CHANGELOG + CLAUDE.md + llms regen Bumps VERSION 0.31.11 → 0.33.0 and package.json to match. CHANGELOG entry leads with the headline use ("ask gbrain who knows about X") and the locked ENG-D1 ranking formula. "Numbers that matter" replaced with a "what ships on which eval outcome" table — honest about the eval-gated trajectory rather than fabricating benchmarks before the release has been graded against a real brain. CLAUDE.md Key Files annotations added for src/commands/whoknows.ts, src/commands/eval-whoknows.ts, and test/fixtures/whoknows-eval.jsonl. src/core/search/hybrid.ts entry extended with the new types parameter documentation (push the type filter to SQL, no post-filter waste, AND-applies alongside the existing single-value type field). bun run build:llms ran the chaser; llms.txt + llms-full.txt regenerated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(v0.33): unit-test gap fill — engine typeFilter + find_experts op Two new files filling the gaps Garry called out: test/search-types-filter.test.ts — engine-level coverage on PGLite for the new SearchOpts.types filter. Asserts the SQL-clause behavior directly so a regression in the AND p.type = ANY(...) emission gets caught here with a tight assertion rather than as part of a longer findExperts pipeline. 9 cases across searchKeyword + searchVector + chunk-grain documentation. Documents the pre-existing PGLite parity gap (single-value `type` field is Postgres-only; `types` is the v0.33 multi-type filter that BOTH engines honor). test/find-experts-op.test.ts — MCP-op contract test for find_experts. Pins: - Registered in the operations array + operationsByName - scope: 'read', localOnly false (HTTP-MCP accessible per ENG-D5) - Documented params (topic / limit / explain) with correct types - cliHints.name === 'whoknows' (CLI surface bridge) - Non-trivial description that references the use case - Handler rejects empty / whitespace / missing topic with invalid_params - Handler returns array shape on valid topic - Handler honors limit param 11 op-contract cases + 9 engine-clause cases. All passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version to v0.33.1.0 Garry asked for v0.33.1 instead of v0.33.0 (queue collision with unrelated 0.33.0 work). 4-digit format: 0.33.1.0. CHANGELOG header and "To take advantage of" block updated. llms.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(v0.33.1.1): cliHints.positional on find_experts so CLI accepts <topic> Without `cliHints.positional: ['topic']`, the op-dispatch path in src/cli.ts couldn't parse `gbrain whoknows "ai agents"` and threw `invalid_params: topic is required`. Found while testing the v0.33.1.0 build against a real brain. The op handler validates topic; the CLI just needed to know the positional shape so the dispatcher could hand it through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(v0.33.1.2): real-brain whoknows-eval fixture from VC intro network Replaces the synthetic 10-row placeholder with 10 real expertise-routing queries mined from Garry's actual brain via thin-client connection to Wintermute (v0.32.2). Source: reference/vc-intro-network ("Who Takes Intros from Garry") + adjacent routing context. All 15 unique expected person slugs verified against ~/git/brain/people/<slug>.md source markdown: people/amit-kumar Accel partner, 102 YC deals people/diana-hu YC GP people/elad-gil Angel, top-rated people/eric-vishria Benchmark, healthtech people/gokul-rajaram Angel, 57 YC deals people/joff-redfern Menlo Ventures, ex-CPO Atlassian people/jon-xu YC GP people/kristina-shen Chemistry, healthtech people/lachy-groom Angel, 43 YC deals people/lee-edwards Quiet Capital, 52 YC deals people/nick-shalek Ribbit Capital, fintech people/nina-achadian Index Ventures, 69 YC deals (note: slug uses 'achadian' not 'achadjian') people/parul-singh 645 Ventures people/rebecca-kaden USV people/trae-stephens Founders Fund, defense/deep-tech Eval cannot run yet against Wintermute thin-client: server is v0.32.2, find_experts MCP op was added in v0.33. Once Wintermute upgrades the eval will run end-to-end via the v0.31.1 thin-client routing seam. Local eval works once the brain is indexed with find_experts available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.33.1.3): wire thin-client routing into eval-whoknows `gbrain eval whoknows` now works against a thin-client install. When isThinClient(cfg), each fixture query routes through the remote find_experts MCP op via callRemoteTool — same v0.31.1 routing seam runWhoknows already uses. Local mode unchanged: findExperts(engine, ...) called directly. Server prerequisite: the brain must be v0.33+ for find_experts to be registered. Wintermute (currently v0.32.2) gets it on next upgrade and then the eval runs end-to-end with zero client-side changes. Mechanics: - `WhoknowsFn` callable abstraction so the gates are impl-agnostic - runEvalWhoknows(engine: BrainEngine | null, args) — null engine allowed in thin-client mode - Regression gate auto-skips in thin-client mode (no DB access to eval_candidates; quality gate alone gates ship) - cli.ts adds a thin-client bypass before connectEngine for `gbrain eval whoknows`, matching the longmemeval/cross-modal no-DB pattern E2E test updated to use an inline synthetic fixture (the shipped fixture is real-brain data now, doesn't match the seeded test brain). Sanity-check the shipped fixture parses cleanly in a separate case. Tests: 25 unit cases (+2 for null-engine signature contract) + 6 E2E cases. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… rethrow (garrytan#962) * fix: send Voyage output_dimension on embedding requests * fixup: drop voyage-4-nano from flexible-dim set Voyage's hosted /embeddings endpoint accepts `output_dimension` only for the seven flexible-dim models (voyage-4-large, voyage-4, voyage-4-lite, voyage-3-large, voyage-3.5, voyage-3.5-lite, voyage-code-3). voyage-4-nano is an open-weight variant Voyage lists separately as fixed 1024-dim — the hosted API rejects the parameter for it. The recipe docstring previously claimed "all v4 variants" have flexible dims, which is what led to nano being added to the allowlist in the first place. Tighten the comment to name the hosted trio explicitly and call out nano-as-open-weight. Convert the test case at test/ai/gateway.test.ts from a positive assertion (voyage-4-nano returns { dimensions: 512 }) to a negative regression pin (voyage-4-nano returns undefined), so a future contributor can't silently re-add nano without breaking this test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: Voyage OOM-cap rethrow + flexible-dim runtime validation (Codex P3 follow-ups) Two follow-ups from Codex's adversarial review of PR garrytan#962, both Voyage-adjacent correctness fixes that the original PR scope had filed as TODOs. 1. gateway.ts:619 Voyage OOM cap was theatrical ------------------------------------------------- voyageCompatFetch's inbound response rewriter is wrapped in a try/catch that falls back to the original response on parse failure — correct for "Voyage returned JSON I can't reshape, let the SDK handle it." But the per-embedding Layer 2 OOM cap at line 619 threw a bare `new Error(...)`, which the same catch silently swallowed. Net result: an oversized base64 response (Layer 1 skipped because no Content-Length header) returned through to the AI SDK and could OOM the worker on JSON.parse. Fix: introduce `VoyageResponseTooLargeError`, throw it at both cap sites (Content-Length Layer 1 at line 595 and per-embedding Layer 2 at line 619), and rethrow it from the inbound try/catch via `if (err instanceof VoyageResponseTooLargeError) throw err`. Pre-existing fall-back-on-parse-error behavior for other thrown errors is preserved. Regression-pinned by 2 new behavioral tests (mock fetch returns oversized Content-Length / oversized base64; embed() throws with the expected message) and a structural assertion in test/voyage-response-cap.test.ts that the `instanceof VoyageResponseTooLargeError ⇒ throw` line stays put. 2. Voyage flexible-dim runtime validation + doctor check ------------------------------------------------------- A brain configured for a Voyage flexible-dim model (voyage-4-large, voyage-3-large, voyage-3.5, voyage-3.5-lite, voyage-4, voyage-4-lite, voyage-code-3) without an explicit `embedding_dimensions` would fall back to DEFAULT_EMBEDDING_DIMENSIONS=1536 — an OpenAI default that Voyage rejects. Voyage's only accepted values are {256, 512, 1024, 2048}. Pre-fix the failure surfaced as an HTTP 400 from Voyage that often got misclassified as a transient network error. Fix: - `dims.ts` exports `VOYAGE_VALID_OUTPUT_DIMS` and `isValidVoyageOutputDim`. - `dimsProviderOptions` throws `AIConfigError` with a paste-ready fix command (`gbrain config set embedding_dimensions ...`) when a Voyage flexible-dim model is configured with an invalid dim value. - `gbrain models doctor` gets a new `embedding_config` probe that runs first (zero tokens) and surfaces the misconfiguration before any chat/expansion probes spend a single token. New probe status `config` + optional `fix` hint rendered in human output. Regression-pinned by 6 new unit tests covering the AIConfigError throw, exact valid-values set, the bypass path for fixed-dim Voyage models, and the fix-hint contents. * chore: bump version and changelog (v0.33.1.1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.33.1.1 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Eva <eva@100yen.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tent weighting (garrytan#897) * feat(search-lite): token budget + semantic query cache + intent weighting Adds three additive features to the hybrid search pipeline. All backward-compatible: existing callers see identical behavior unless they opt in to the new options. ## 1. Token Budget Enforcement (src/core/search/token-budget.ts) Cap the cumulative token cost of returned results so search payloads fit downstream context windows. Greedy top-down walk; preserves caller ordering; no re-rank. char/4 heuristic for token counting (no tokenizer dependency \u2014 keeps the bun --compile bundle small). SearchOpts.tokenBudget \u2014 numeric cap. Default undefined = no-op. HybridSearchMeta.token_budget = { budget, used, kept, dropped } HTTP query op: pass `token_budget` param. ## 2. Semantic Query Cache (src/core/search/query-cache.ts + migration v52) Cache search results keyed by query embedding similarity. HNSW lookup: `embedding <=> $1 < 0.08` (cosine similarity >= 0.92). Per-source isolation so multi-source brains don\u2019t bleed. Per-row TTL (default 3600s). Best-effort writes; all errors swallowed so the cache never breaks the search hot path. Migration v52 creates query_cache table with HALFVEC where pgvector >= 0.7; falls back to VECTOR with the resolved config.embedding_dimensions dim. New `gbrain cache` CLI: stats / clear --yes / prune. Config keys: search.cache.enabled / similarity_threshold / ttl_seconds. HybridSearchMeta.cache = { status, similarity?, age_seconds? } Routed through new `hybridSearchCached(engine, query, opts)` wrapper; the operations.ts query op now uses this wrapper so MCP/CLI calls benefit automatically. Skipped for two-pass walks + non-default embedding columns where cache semantics don\u2019t hold. ## 3. Zero-LLM Intent Weighting (src/core/search/intent-weights.ts) Builds on the existing query-intent classifier (4 intents: entity / temporal / event / general). New weight-adjustment layer applies subtle per-intent nudges: entity \u2192 boost keyword RRF + exact slug/title match temporal \u2192 default recency=on when caller left it unset event \u2192 boost keyword RRF (rare named entities) + soft recency general \u2192 no-op (1.0 multipliers everywhere) All adjustments are SUBTLE (max 1.25x). Caller-explicit options ALWAYS win \u2014 intent weighting never silently overrides recency / salience. Default ON; opt out via `opts.intentWeighting = false`. LLM query expansion (expansion.ts) is still available and opt-in via `opts.expansion = true` \u2014 it just isn\u2019t the default anymore. HybridSearchMeta.intent now surfaces classifier output for debugging. ## Tests test/token-budget.test.ts (10 tests, pure module) test/intent-weights.test.ts (13 tests, pure module) test/query-cache.test.ts (12 tests, PGLite) test/hybrid-search-lite.serial.test.ts (9 tests, PGLite e2e) Plus 105 pre-existing search tests still pass. `bun run verify` clean. Co-authored-by: Wintermute <agents@garrytan.com> * feat(search-mode): MODE_BUNDLES + resolveSearchMode wired into bare hybridSearch Three named modes (conservative / balanced / tokenmax) that bundle the search-lite knobs from PR garrytan#897 into a single config key. Mode resolution lives in bare hybridSearch (NOT just the cached wrapper) so eval-replay and eval-longmemeval — which call bare hybridSearch — test the same mode-affected behavior as production. See [CDX-5+6] in the plan. The mode bundle supplies DEFAULTS for intentWeighting, tokenBudget, expansion, and searchLimit when the caller leaves those undefined. Per-call SearchOpts and per-key config overrides still win (matches the v0.31.12 model-tier resolution chain at model-config.ts:resolveModel). knobsHash() exposes a stable SHA-256 of the resolved knob set; the cache contamination hotfix (next commit) consumes it to prevent a tokenmax write from being served to a conservative read. Three new fields on HybridSearchMeta: - mode (resolved mode name) - existing token_budget meta now fires from bare hybridSearch too Bare hybridSearch now applies tokenBudget at all three return paths (no-embedding-provider, keyword-only-fallback, main). Previously only hybridSearchCached enforced budget; eval commands missed it. Tests: 37 unit cases pin the 3x7 bundle table cell-by-cell, the resolution chain semantics, knobs hash determinism + cross-mode separation, and the config-table parser. All 72 search-lite tests pass. Bisect-friendly: this commit ONLY adds mode resolution. The cache-key contamination hotfix [CDX-4] is a separate atomic commit (next). * fix(query-cache): cross-mode contamination hotfix [CDX-4] PR garrytan#897's query_cache keyed rows on sha256(source_id::query_text) only. A tokenmax search (expansion=on, limit=50) populated a row that a subsequent conservative call (no expansion, limit=10) read back, serving the wrong-shape results. This is a real bug in PR garrytan#897 today, regardless of the v0.32.3 mode picker work — Codex caught it in plan review. Fix: - Migration v56 adds query_cache.knobs_hash TEXT column + composite (source_id, knobs_hash, created_at) index. Existing rows have NULL knobs_hash and are excluded from lookups (silently re-populated with the right hash on first hit — no orphan data, no destructive migration). - cacheRowId(query, source, knobsHash) — knobsHash now part of the PK so a tokenmax write and a conservative write for the same (query, source) land in distinct rows. - SemanticQueryCache.lookup({knobsHash}) filters WHERE knobs_hash = $. - SemanticQueryCache.store({knobsHash}) writes the resolved hash. - hybridSearchCached threads knobsHash from resolveSearchMode through every cache call. Cache config (enabled/threshold/TTL) now reads from the resolved mode bundle, not directly from the config table. Tests (test/query-cache-knobs-hash.test.ts, 11 cases): - cacheRowId bifurcates by knobsHash - Tokenmax write does NOT contaminate conservative lookup - Three modes coexist as distinct rows for same query - Legacy NULL-knobs_hash rows are excluded from lookup - Same-mode write updates in place (no duplicate rows) All 58 cache + mode tests pass. Migration v56 applies cleanly on a fresh PGLite brain. Bisect-friendly: this commit is the cache-key hotfix alone. Mode resolution wiring lives in the previous commit. * feat(search-telemetry): in-process rollup writer + search_telemetry table Migration v57 creates search_telemetry (date, mode, intent, count, sum_results, sum_tokens, sum_budget_dropped, cache_hit, cache_miss, first_seen, last_seen). PK (date, mode, intent) caps growth at ~4380 rows/year. Sums + counts only — averages derive at read time so concurrent ON CONFLICT writes from multiple gbrain processes accumulate correctly [CDX-17]. In-memory bucket flushed periodically (60s OR 100 calls) + on process beforeExit/SIGINT/SIGTERM with a 2-second cap. The search hot path NEVER waits on this write [D2, CDX-19]. Date-bucketed cache_hit / cache_miss columns make hit rate over --days N derivable [CDX-18]. query_cache.hit_count is a lifetime counter and can't be sliced by window. Wired into bare hybridSearch via emitMeta: every search call sync-bumps a bucket. flush() drains atomically by swapping the map before SQL writes so a record() during flush lands in the new map. readSearchStats(engine, {days}) returns the StatsWindow shape that gbrain search stats consumes (next commit). Tests: 16 unit cases pin record/flush/read semantics including ON-CONFLICT-adds-raw-values, concurrent-flush coalescing, cache hit-rate math, missing-table graceful degradation, and window clamping. 53 migrations apply on a fresh PGLite brain. * feat(config): add unset + listConfigKeys + readLineSafe helper [CDX-7+8+9] CDX-8: gbrain config has no unset path today. Required before `gbrain search modes --reset` can clear search.* overrides. - BrainEngine.unsetConfig(key) → returns rows deleted (0|1) - BrainEngine.listConfigKeys(prefix) → exact-literal prefix match with LIKE-escape on user-supplied % / _ / \ characters - PGLiteEngine + PostgresEngine implementations - `gbrain config unset <key>` and `gbrain config unset --pattern <prefix>` sub-subcommands CDX-9: readLine has no EOF detection or timeout. Mode-picker plan calls out "TTY closes mid-prompt → defaults to balanced" but the raw helper hangs forever. New readLineSafe(prompt, defaultValue, timeoutMs=60s): - Returns defaultValue on stdin 'end' event - Returns defaultValue on timeout - Returns defaultValue on empty Enter - Non-TTY stdin returns defaultValue immediately (e2e safe) - Returns trimmed user input otherwise Exported so install picker (next task) can use it. Tests: 9 cases pin unset semantics + prefix matcher edge cases (glob-wildcard escape, sort order, idempotent loop, search.* sweep). All 53 migrations apply on a fresh PGLite brain. * feat(init): install-time mode picker + upgrade banner Install picker (src/commands/init-mode-picker.ts): - Runs as a phase inside `gbrain init` AFTER engine.initSchema() so DB config writes work [CDX-7]. - Idempotent: skipped on re-init if search.mode is already set. - Smart auto-suggestion via recommendModeFor() reads models.tier.subagent / models.default / OPENAI_API_KEY: * Opus default/subagent → tokenmax (quality ceiling) * Haiku subagent → conservative (4K budget keeps cost down) * No OpenAI key → conservative (no LLM expansion possible) * Sonnet / unknown → balanced (safe default) - TTY shows menu via readLineSafe (60s timeout, defaults on EOF/empty). - Non-TTY auto-selects + emits operator hint: [gbrain] search mode: X (auto-selected — reason) [gbrain] To change: gbrain config set search.mode <...> - --json mode emits structured `{phase: 'search_mode_picker', ...}` event. - Wired into both initPGLite and initPostgres flows. Upgrade banner (src/commands/upgrade.ts): - One-shot stderr banner in runPostUpgrade. - State persisted via config key `search.mode_upgrade_notice_shown=true` — fires at most once per install. - Copy corrected per [CDX-1+2+3]: production query op STILL defaults expand=true and limit=20. The banner reframes from "behavior is regressing" to "named modes available + here's how to preserve exact current shape." Tests (test/init-mode-picker.test.ts, 16 cases): - recommendModeFor heuristic for all 4 input shapes - parseModeInput accepts numeric/named/case-insensitive, rejects garbage - runModePicker non-TTY auto-selects + writes config - Idempotent + --force re-prompt + JSON output - Opus → tokenmax, Haiku → conservative real wiring through engine * feat(cli): gbrain search modes/stats/tune command Three sub-subcommands mirroring the gbrain models (v0.31.12) shape: gbrain search modes [--json] Read-only routing dashboard. Shows the three mode bundles, the active mode, and the source of every resolved knob: cache_enabled = true [override: search.cache.enabled] tokenBudget = 4000 [mode: conservative] Plus knob descriptions for legibility. gbrain search modes --reset [--source <mode>] Clears every search.* override (NOT search.mode itself). Preserves the upgrade-notice state key. --source <mode> is a dry-run that lists what --reset would change without writing — the paved path [CDX-8] flagged as missing. gbrain search stats [--days N] [--json] Observability. Reads the search_telemetry rollup over the window (clamps to [1, 365]). Prints cache hit rate, mode mix, intent mix, budget drops, avg results/tokens. JSON output includes _meta.metric_glossary block per [CDX-25]. gbrain search tune [--apply] [--json] Recommendation engine. 5 rules cover the bug class: - Insufficient data → "no_recommendations" status - Conservative + high budget-drop rate → suggest balanced - High cache hit rate (>85%) → suggest similarity threshold bump - Tokenmax + Haiku subagent → suggest balanced (cost mismatch) - Cache disabled but stats show usage → suggest re-enabling --apply mutates config via setConfig / unsetConfig with a paste-ready revert command printed at the end. Registered in src/cli.ts dispatch table. 17 unit cases pin: - Dashboard report shape + per-knob source attribution - --reset preserves search.mode + notice key - --source dry-run never writes - stats reads telemetry rollup; --days clamps - tune recommendation rules fire on real telemetry data - --apply mutates config - --help + unknown subcommand exit codes * feat(eval): metric glossary module + auto-gen METRIC_GLOSSARY.md + CI guard Single source of truth at src/core/eval/metric-glossary.ts. Every entry carries 3 fields: - industry_term (canonical IR/NLP literature name, preserved verbatim) - eli10 (plain-English a 16-year-old can follow) - range (numeric range + interpretation) Covers 4 metric families: - Retrieval: P@k, R@k, MRR, nDCG@k - Stability: Jaccard@k, top-1 stability - Statistical: p-value (paired bootstrap + Bonferroni), 95% CI - Operational: cache hit rate, avg results/tokens, cost per query, p99 latency Public surface: - getMetricGloss(metric) → full entry or null - eli10For(metric) → plain-English string or null - buildMetricGlossaryMeta(metrics[]) → {metric → eli10} record for JSON `_meta.metric_glossary` blocks per [CDX-25]. ONE block per response, NOT sibling `_gloss` fields on every metric. - renderMetricGlossaryMarkdown() → deterministic Markdown for the doc Auto-generation: scripts/generate-metric-glossary.ts emits docs/eval/METRIC_GLOSSARY.md. Deterministic (same input → same bytes) so the CI guard can diff. CI guard: scripts/check-eval-glossary-fresh.sh regenerates into a temp file and diffs against the committed doc. Out-of-date doc fails the build. Wired into `bun run verify` (and therefore `bun run test:full`). Tests (test/metric-glossary.test.ts, 18 cases): - Every documented metric is present - Every entry has all 3 required fields - Accessors return null on unknown metrics (no throw) - buildMetricGlossaryMeta silently drops unknown metrics - renderer output is deterministic across calls - Renderer groups metrics into 4 sections docs/eval/METRIC_GLOSSARY.md: 5491 bytes, 124 lines, fresh. * feat(doctor): search_mode + eval_drift checks + drift-watch module src/core/eval/drift-watch.ts — curated retrieval watch-list [CDX-6]. Five patterns covering the surface that actually affects retrieval quality: - src/core/search/ (search pipeline) - src/core/embedding.ts (embedding shape) - src/core/chunkers/ (chunk granularity) - src/core/ai/recipes/anthropic.ts + openai.ts (expansion + embed routing) - src/core/operations.ts (the query op definition) Adding to the list is a deliberate act — requires a CHANGELOG line so coverage grows on purpose, not by accident. Pure functions: - matchesWatchPattern(path) — trailing-slash = prefix, bare = equality - filesDriftedSince(repoRoot, sha?) — git diff --name-only wrapper - watchedFilesDrifted(repoRoot, sha?) — composite src/commands/doctor.ts — two new checks. checkSearchMode [CDX-20]: status stays 'ok' (never warns, never docks health score). Hint in message field. Three branches: - unset → "search.mode is unset (using balanced fallback). Run `gbrain search modes` to see what is running and pick a mode." - mode + no overrides → "Mode: X (no per-key overrides — mode bundle is canonical)." - mode + overrides → "Mode: X with N per-key override(s) (k1, k2, …). To consolidate to the pure mode bundle: gbrain search modes --reset" Upgrade-notice state key (search.mode_upgrade_notice_shown) is excluded from the override roster — it's not a knob. checkEvalDrift [CDX-6]: surfaces uncommitted changes to retrieval-watched files. Always 'ok'; operator-facing reminder. Names up to 3 drifted files in the message + paste-ready re-eval command. Both helpers exported (was: file-private) so tests can pin behavior without walking the full runDoctor pipeline. Tests: 12 drift-watch cases + 7 doctor-check cases. Pin watch-list shape, prefix-vs-equality matcher semantics, missing-repo graceful failure, and all three search_mode branches. * feat(eval): --mode flag on longmemeval/replay + run-all + compare Per-mode --mode flag plumbed into: - gbrain eval longmemeval --mode <conservative|balanced|tokenmax> Sets search.mode in the benchmark brain's config table; config is in PRESERVE_TABLES so resetTables doesn't wipe it between questions. Mode surfaces in the per-question NDJSON row. - gbrain eval replay --mode <m> + --compare-limit N --compare-limit forces a constant K across modes [CDX-13]; without it, Jaccard@k against the captured baseline measures K-drift, not quality. Mode is set once before the replay loop. - NOT cross-modal per [CDX-11]: cross-modal scores OUTPUT against TASK; it doesn't retrieve. Adding --mode there is theater. New: gbrain eval run-all orchestrator (src/commands/eval-run-all.ts): - Sweeps every requested mode × suite combination - Sequential default per D9; --parallel N opt-in (clamped to mode count) - Cost guard with split caps [CDX-15+16]: --budget-usd-retrieval N (default $5) --budget-usd-answer N (default $20) Non-TTY refuses with exit 2 unless --yes AND explicit --budget-usd-* flags pass. TTY refuses without --yes (defense against agent loops). - estimateRunCost computes per-(suite,mode) breakdown including the expansion-Haiku surcharge for tokenmax. - Audit trail: appends to <repo>/.gbrain-evals/eval-results.jsonl [CDX-23]. Personal brain (~/.gbrain) NEVER touched. - v0.32.3 ships orchestrator + argv + guard + persist hook. In-process per-suite invocation is a v0.32.4 follow-up (operator runs the per-suite CLIs with the documented --mode flag for now; each completion calls persistRunRecord to log). New: gbrain eval compare report (src/commands/eval-compare.ts): - Reads eval-results.jsonl, groups by (suite, mode), renders MD or JSON - Most-recent (suite, mode, commit) wins when duplicates exist - JSON output has schema_version=2 + _meta.metric_glossary block per [CDX-25] (ONE block per response, not sibling _gloss fields) - _meta.methodology field names the paired-bootstrap + Bonferroni discipline per [CDX-14] so haters can reproduce - Missing file → friendly hint pointing at `gbrain eval run-all` Wired into eval dispatch table in src/commands/eval.ts. Metric glossary fuzzy fallback: `recall@10` → `recall@k` lookup (the glossary documents the family; report rows carry specific K values). Routes through getMetricGloss for every call site. Tests (42 cases total — all green): - eval-run-all.test.ts (19): argv parser, cost estimate, guard semantics for all 4 (over/under × tty/non-tty) shapes, persist hook NDJSON shape. - eval-compare.test.ts (5): JSON + MD output shapes, glossary integration, missing-file graceful, mode filter, most-recent-wins. - metric-glossary.test.ts (18): unchanged but updated assertions to cover the fuzzy `@N` → `@k` fallback. Pre-existing eval-replay / eval-longmemeval / eval-export / eval-prune tests (42 cases) still pass — --mode + --compare-limit are additive. * docs: methodology + CLAUDE.md/README/RESOLVER + skills/conventions docs/eval/SEARCH_MODE_METHODOLOGY.md — haters-immune 8-section template. Documents what the eval measures + does NOT measure, datasets + sizes (LongMemEval n=500, Replay n=200, BrainBench n=1240 docs / 350 qrels), random seed 42, run procedure verbatim, threats to validity (LongMemEval English+technical skew, char/4 heuristic ~5-10% off, expansion ~97.6% relative lift on this corpus), per-question raw outputs, pre-registered expectations (tokenmax wins R@10 by 5-15pp, conservative wins cost by 5-15x, balanced lands within 3pp), re-run cadence anchored to the src/core/eval/drift-watch.ts watch-list. Statistical-significance section pins paired bootstrap with 10,000 resamples + Bonferroni correction across 3 modes × 4 metrics [CDX-14]. CLAUDE.md gets two new sections: ## Search Mode (3-mode table + resolution chain + [CDX-4] cache contamination fix note + CLI commands) and ## Eval discipline (single-source-of-truth glossary, methodology doc, eval_results in repo NOT personal brain per [CDX-23]). README.md Quick Start gets a paragraph naming the install picker, mode heuristic, and the methodology link. skills/conventions/search-modes.md NEW — convention file consumed by brain-ops + query + signal-detector skills via the existing `> **Convention:**` callout pattern. Routes "what mode" / "tune retrieval" / "compare modes" queries to the right CLI surface. skills/RESOLVER.md gets two new trigger rows pointing at gbrain search * and gbrain eval compare. * chore: regen llms.txt + llms-full.txt for v0.32.3 search-mode docs bun run build:llms — picks up the new CLAUDE.md sections (Search Mode + Eval discipline) and the docs/eval/SEARCH_MODE_METHODOLOGY.md addition. build-llms.test.ts gate now passes. * fix(doctor): wire search_mode + eval_drift checks into runDoctor main flow The v0.32.3 search_mode + eval_drift helpers were inserted into the DB-checks sub-helper at runDbChecks (line 345-355), but runDoctor itself maintains its own check list and only calls the helpers' subset. Push the two checks into the main runDoctor path (after the existing sync_freshness check at line 2347) so they actually appear in `gbrain doctor --json` output. Both checks gated on engine !== null. Progress reporter heartbeat fires for each. Both still return status 'ok' per [CDX-20] so health score is preserved. Verified end-to-end on a real Postgres brain: gbrain doctor --json now includes 'search_mode' and 'eval_drift' in the checks array. * fix: claw-test hang — DATABASE_URL leak + telemetry beforeExit deadlock Two root causes for the hang, both fixed. 1. DATABASE_URL leak in claw-test scripted harness The harness inherits the parent process's env via `...process.env` for every phase child (init / import / query / extract / doctor). When the e2e runner sets DATABASE_URL (for OTHER e2e tests), it leaks into claw-test's children. `loadConfig` at src/core/config.ts:143 then flips inferredEngine to 'postgres' for every subsequent phase, breaking the hermetic-PGLite-tempdir contract: phases race against each other on a shared test Postgres while pointing at different brain states. Fix: strip DATABASE_URL + GBRAIN_DATABASE_URL from the child env before forwarding. Re-apply GBRAIN_HOME / GBRAIN_FRICTION_RUN_ID after the merge so a parent's override can't win. The harness is PGLite-only by design. 2. Telemetry beforeExit deadlock v0.32.3's recordSearchTelemetry installed a `process.on('beforeExit', drainOnExit)` hook that wrapped the flush in `Promise.race([flush(), setTimeout(2000)])`. beforeExit fires when the event loop empties, but the hook enqueued NEW async work (the race's setTimeout + pending flush), so the event loop never re-emptied. Short-lived CLI invocations (`gbrain query "the"` finishing in ~100ms) ended up waiting on the DB write indefinitely. The claw-test harness spawns several short-lived gbrain queries. Each one hung after its real work finished. The harness then waited forever on its child subprocess's exit code. Fix: drop the beforeExit + SIGINT + SIGTERM hooks. Per [CDX-19]'s "stats are directional, not exact" contract, losing one unflushed bucket on process exit is acceptable. The unref'd setInterval handles long-running processes (HTTP MCP, autopilot, jobs work). Short-lived CLI invocations exit immediately. Verified: - `gbrain query "the"` on a fresh PGLite brain exits in <1s (was hanging forever). - `bun test test/e2e/claw-test.test.ts` → 3 pass / 0 fail / 3.86s (was hanging at the banner indefinitely). - 85/85 e2e files / 574/574 tests pass including claw-test, with DATABASE_URL set (the configuration that originally repro'd the hang). - 6235/6235 unit tests pass. - Typecheck clean. The two bugs interacted: the DATABASE_URL leak meant queries hit the real Postgres (slow), making the beforeExit deadlock visible. Fixing either alone would have masked the other. Both fixed in this commit. * feat(install-picker): cost anchors in mode prompt + upgrade banner + docs The install picker already asks explicitly (1/2/3 menu, default to the recommendation on Enter). What was missing: a way to reason about the cost tradeoff. Without numbers, "tokenmax" looks free and "conservative" sounds restrictive; with numbers, the operator picks intentionally. Cost anchors added everywhere the user encounters the mode choice: - Install picker MENU_TEXT (gbrain init) - Upgrade banner (gbrain upgrade post-upgrade) - CLAUDE.md ## Search Mode section - README.md Quick Start - docs/eval/SEARCH_MODE_METHODOLOGY.md (with the math) Anchors at Sonnet 4.6 downstream ($3/M input): conservative ~$0.012/query ~$12/mo @ 1K ~$1,200/mo @ 100K balanced ~$0.030/query ~$30/mo @ 1K ~$3,000/mo @ 100K tokenmax ~$0.060/query ~$60/mo @ 1K ~$6,000/mo @ 100K Plus tokenmax's Haiku expansion overhead: ~$1.50 per 1K queries on top. Cache hits roughly halve these on a brain with repeat-query traffic. The math is documented in SEARCH_MODE_METHODOLOGY.md so a reviewer can audit each variable (T = ~400 tokens/chunk from the recursive chunker's 300-word target; N = `searchLimit` cap; R = downstream model rate from src/core/anthropic-pricing.ts). Drift away from these numbers requires updating CLAUDE.md + the picker + the methodology doc in lockstep — a regression test pins the picker's anchor strings to enforce this. The framing also names the cost rule honestly: the dominant cost isn't gbrain (semantic cache is free; Haiku expansion is rounding-error). It's the downstream agent reading retrieved chunks back into its context. Operators who don't realize this pick badly. Tests: 5 new regression cases in init-mode-picker.test.ts pin every cost string in MENU_TEXT. Total 21/21 picker tests pass; 6240/6240 unit tests pass; verify gate green. * docs: realistic-scale cost anchor for search modes The per-query cost framing in the picker (~$0.012/$0.030/$0.060) is honest but theoretical — it treats each search as an isolated billable event. Real agent loops amortize a lot of context across turns via Anthropic prompt caching, so the per-query 5x ratio doesn't translate 1:1 into total agent spend. Added a "Realistic-scale anchor" section to SEARCH_MODE_METHODOLOGY.md representing one heavy power-user agent loop running tokenmax: - ~860 turns/mo (~29/day, one active agent) - ~900K tokens/turn (system + tools + history + reasoning + search) - ~$0.85/turn → ~$700/mo total agent spend at tokenmax - ~88% Anthropic prompt-cache hit rate Scaling balanced + conservative DOWN from that anchor: - tokenmax → ~$700/mo, search ~22% of total spend - balanced → ~$620/mo, search ~12% (saves ~$78/mo vs tokenmax) - conservative → ~$575/mo, search ~5% (saves ~$124/mo vs tokenmax) Honest takeaway: at realistic agent-loop scale WITH disciplined prompt caching, mode choice saves 10-20% of total agent spend, not 5x. The per-query math kicks back in for setups WITHOUT cache discipline (churn the prompt prefix every turn → search payload becomes a larger fraction). Both framings live in the doc. CLAUDE.md ## Search Mode gets a forward-pointer paragraph naming the "per-query math vs real-world spend" delta so agents reading the section find the methodology footnote. Numbers in the doc are anonymized + scaled away from any specific deployment. No model names, no specific dollar figures from a real production setup — just the per-turn / cache-hit-rate / search-count shape ratios that a thoughtful operator can validate against their own billing dashboard. * feat(picker): mode × model cost matrix (25x corner-to-corner spread) Previous version showed mode costs assuming Sonnet-only downstream. That muted the spread to 5x and made mode choice look minor. Reality: the downstream model tier is the BIGGER cost lever — pairing mode with model is where the 25x spread lives. New 3×3 matrix in the install picker, CLAUDE.md, methodology doc, README: Haiku 4.5 Sonnet 4.6 Opus 4.7 ($1/M input) ($3/M input) ($5/M input) conservative $400/mo $1,200/mo $2,000/mo balanced $1,000/mo $3,000/mo $5,000/mo tokenmax $2,000/mo $6,000/mo $10,000/mo (per-query cost @ 100K queries/mo, full search payload, no cache savings) The methodology doc gets a new "Mode × Model matrix" section above the realistic-scale anchor with concrete right-sizing guidance: - tokenmax + Haiku: wrong direction. Haiku can't filter 50 chunks → noise not signal. Pay Haiku rates, get sub-Haiku quality. - conservative + Opus: wasted Opus. 200K context window starved on retrieval depth. Pay Opus rates, get conservative-shape retrieval. - Natural pairings span ~4x; the matrix corners span 25x. The natural diagonal is where most users should land. Realistic-scale anchor refreshed: - tokenmax + Opus: ~$700/mo at 860 turns - balanced + Sonnet: ~$430/mo - conservative + Haiku: ~$170/mo Plus a "mismatched pairings" section showing the math for tokenmax+Haiku and conservative+Opus — both burn budget for no improvement. Regression test updated: pins the 25x framing + the four anchor cells (two corners + two diagonal mids) + the three downstream model rates. 22/22 picker tests pass. 6241/6241 unit tests pass. CI guards green. * docs(picker): rescale cost matrix from 100K → 10K queries/mo (typical single user) Most users running gbrain are single-user installs at ~10K queries/month, not the 100K fleet-scale used in the original matrix. The picker numbers ($400 to $10,000/mo) looked alien to the actual audience. Rescaled to 10K with an explicit linear-scaling callout. New matrix in picker, CLAUDE.md, README, methodology doc: Haiku 4.5 Sonnet 4.6 Opus 4.7 ($1/M) ($3/M) ($5/M) conservative $40/mo $120/mo $200/mo balanced $100/mo $300/mo $500/mo tokenmax $200/mo $600/mo $1,000/mo Still 25x corner-to-corner. Still 4x natural-diagonal spread. But now in numbers a single user picks up and reasons about: "balanced + Sonnet at $300/mo, that's fine" or "tokenmax + Opus at $1,000/mo, that's a deliberate choice for max-quality high-stakes work." Every surface updated: - Install picker MENU_TEXT (with "scales linearly — multiply by 10 for 100K/mo" footnote so heavier users still see their number) - CLAUDE.md ## Search Mode table + scaling prose - README Quick Start - methodology doc Mode × Model matrix section - upgrade banner (post-upgrade notice) Regression test updated: pins the 3 new anchor cells ($40, $300, $1,000) + the 10K/mo volume frame + the linear-scaling callout. 23/23 picker tests pass, 6241/6241 unit tests pass, verify gate green. Methodology doc's existing 1K/10K/100K Monthly cost breakdown tables left intact (they already show the linear scaling explicitly). * feat(picker): agent-facing install protocol + tokenmax default + [AGENT] directive DX gap: an agent installing gbrain (OpenClaw, Hermes, Codex, Cursor) ran gbrain init non-TTY, saw 2 stderr lines flash by, and silently auto-applied a default search mode. The operator never saw the cost matrix or the choice. At 25x corner-to-corner cost spread, that's surprise-spend territory. Five surfaces fixed: 1. **Auto-suggest default flipped balanced → tokenmax.** The Sonnet/unknown fallback now recommends tokenmax (preserves v0.31.x retrieval shape: expand=on, generous result set). Haiku subagent → conservative still wins (cost-sensitive signal). No-OpenAI-key → conservative still wins (vector search not possible). Heuristic reordered: Haiku check now fires BEFORE the Opus check, because a Haiku subagent loop signalling cost sensitivity should win over a default-model heuristic. 2. **gbrain init non-TTY output rebuilt.** Previously: 2 stderr lines. Now: the full 3×3 cost matrix + an explicit [AGENT] directive block telling the agent to relay the matrix to its operator before continuing. Includes a pointer to INSTALL_FOR_AGENTS.md Step 3.5 for the full protocol. 3. **gbrain upgrade banner same treatment.** Existing v0.32.3 banner now includes [AGENT] directive at the top so upgrading agents relay the matrix to their operator instead of silently accepting v0.31.x → v0.32.x default-applied behavior. 4. **INSTALL_FOR_AGENTS.md Step 3.5 NEW** with the matrix verbatim, the exact paraphrasable ask-the-user wording, and the gbrain config set commands to run after the operator picks. Plus a paragraph in the Upgrade section pointing back at Step 3.5. 5. **AGENTS.md install checklist** gets a new Step 4 ("STOP — ask the user about search mode") between init and the rest of the flow. The agent's job description now explicitly says: silent acceptance is the wrong default. Tests (24/24 pass): - Updated recommendModeFor heuristic order (Haiku floor > Opus default) - New regression test: non-TTY output contains the matrix corners + [AGENT] directive + INSTALL_FOR_AGENTS.md pointer - withEnv() helper used for OPENAI_API_KEY mutation (test-isolation lint) - Default-recommendation tests updated: Sonnet / unknown → tokenmax Privacy + test-isolation gates clean. 6256/6256 unit tests pass. --------- Co-authored-by: garrytan-agents <agents@garrytan.com> Co-authored-by: Garry Tan <garrytan@gmail.com>
…garrytan#982) Node's default maxBuffer for execFileSync is 1 MiB. On repos with 60-100K files, `git diff --name-status -M` output easily exceeds this, causing the sync process to die silently with no error in the log. Observed at /data/brain (99K files, 62K in git ls-files): sync consistently died during the rename-detection phase at ~15% through `buildSyncManifest()`. No stack trace, no error event — just a dead process. The fix survived 5+ full syncs on the same corpus. 100 MiB is generous but bounded. A 100K-file diff with long paths tops out around 10-20 MiB in practice. Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com>
* docs(CLAUDE.md): add workflow for fork PRs from garrytan-agents Fork PRs from non-collaborator accounts don't receive base-repo secrets on pull_request events, so CI jobs needing ANTHROPIC_API_KEY / OPENAI_API_KEY fail with empty-env auth errors. Document the move-branch-to-base-repo workflow as the narrow-scope alternative to adding the account as a collaborator or flipping the repo-wide fork-secret toggle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.33.3.1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: rebump to v0.33.2.1 Per user direction: ship as v0.33.2.1 instead of v0.33.3.1. 0.33.2.x is unclaimed in the queue (PR garrytan#934 holds 0.33.3.0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…c + W3) (garrytan#934) * feat(v0.34 pre-w0): add code-retrieval eval harness for v0.34 ship gate Captures pre-v0.34 retrieval quality on the gbrain self-corpus before any code-intel work lands, so the v0.34 ship gate (precision@5 +10pp OR answered_rate +15pp on >=15/30 questions) measures real improvement rather than an after-the-fact retuned baseline. * src/eval/code-retrieval/harness.ts -- pure-function metrics (precision@k, recall@k, top-1 stability, gate evaluator) + EvalRunReport types stable across schema_version 1 * src/eval/code-retrieval/questions.json -- 30 questions across callers / callees / definition / references / blast_radius / execution_flow / cluster_membership kinds, expected_files captured against current gbrain layout * src/eval/code-retrieval/strategies.ts -- BaselineStrategy (hybridSearch) + WithCodeIntelStrategy stub (post-W3 fills in code_blast/code_flow/etc.) * src/commands/eval-code-retrieval.ts -- gbrain eval code-retrieval CLI with --baseline / --with-code-intel / --compare subcommands * test/code-retrieval-harness.test.ts -- 26 unit tests across metrics, loader, gate logic; no engine dependency PRE-V0.34 BASELINE WORKFLOW: gbrain eval code-retrieval --baseline --save /tmp/baseline-1.json (run 3x for noise floor) V0.34 SHIP GATE (after W3 lands): gbrain eval code-retrieval --with-code-intel --save /tmp/v034.json gbrain eval code-retrieval --compare /tmp/baseline-1.json /tmp/v034.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(v0.34 W0a): source-routing leak across query + two-pass Codex outside-voice review on the v0.34 plan caught two load-bearing sites where sourceId was advertised but never applied — multi-source brains silently cross-contaminated structural retrieval: * operations.ts ~323 — `query` op handler called hybridSearch without threading ctx.sourceId. Multi-source agents querying with a --source flag got cross-source results. * two-pass.ts:81 (nearSymbol lookup) and two-pass.ts:131 (unresolved edge resolution) — TwoPassOpts.sourceId was declared and threaded through hybridSearch's expandAnchors call, but the actual SQL ignored it. The walk window crossed source boundaries every time. Fix: * `query` op now reads ctx.sourceId AND accepts a new `source_id` param (with '__all__' as the explicit force-cross-source escape hatch). Per-call param wins over ctx context. * two-pass.ts both lookups join through pages.source_id when opts.sourceId is set; omitted opts.sourceId preserves the legacy cross-source contract for callers who want it. Regression test: test/e2e/source-routing.test.ts seeds two sources with the same `parseMarkdown` symbol + a cross-source caller edge. Pins: - nearSymbol + sourceId='source-a' returns ONLY source-a chunks - nearSymbol + sourceId='source-b' returns ONLY source-b chunks - nearSymbol with no sourceId still crosses sources (contract preserved) - walk_depth=1 unresolved-edge resolution stays in source-a PGLite in-memory, no DATABASE_URL needed. The fix proves out under realistic structural retrieval not just a contrived unit test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(v0.34 W0b): flip CLI source-scoping default to truly source-scoped Codex outside-voice review (finding #7) caught that the v0.20.0 docstring claim "by default we only match the caller's source_id" contradicted the implementation in code-callers.ts:54 + code-callees.ts:43: allSources: allSources || !sourceId The right side made `allSources` TRUE whenever `--source` was omitted, INVERTING the documented default. Multi-source brains silently cross- contaminated structural retrieval; `gbrain code-callers parseMarkdown` on a brain with two repos returned callers from both even though the docstring promised per-source scoping. Fix: * New canonical helper `resolveDefaultSource(engine)` in sources-ops.ts. Contract per eng review D7: - exactly 1 source registered → return its id (single-source brains, the 80% case; --source flag is unnecessary friction there) - 2+ sources → throw SourceResolutionError(multiple_sources_ambiguous) with the list of valid ids - 0 sources → throw SourceResolutionError(no_sources) * code-callers.ts + code-callees.ts now resolve to the default source when both --source AND --all-sources are absent. To get the pre-v0.34 cross-source behavior, callers must pass --all-sources explicitly. * Same hint text on both commands. Pinned by test/e2e/cli-source-scoping-pglite.test.ts. IRON RULE regression R2: docstring promise now holds. Multi-source brain running `gbrain code-callers <symbol>` without --source gets a clear error listing valid source ids instead of silent cross-resolution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W0c): within-file two-pass symbol resolver + edges_backfilled_at watermark Codex's outside-voice review caught that the v0.20.0 graph stores BARE callee tokens (`render`, `find`, `execute`) — not qualified names. Pre-v0.34 recursive blast/flow would alias every same-named function across classes. W0c is the foundation that fixes this: resolve `code_edges_symbol` rows by matching `to_symbol_qualified` against the SAME-FILE chunks' `symbol_name_qualified`, then write the outcome to `edge_metadata`. This commit is the resolver primitive + schema. The cycle-phase wiring that calls it on every quick-cycle tick lands in the next commit. Schema (v51 migration `edges_backfilled_at_v0_34`): * `content_chunks.edges_backfilled_at TIMESTAMPTZ` — resume watermark. Chunks where the column is NULL OR older than EDGE_EXTRACTOR_VERSION_TS get re-walked next tick. SIGINT/OOM/sleep mid-backfill loses at most one batch. * Indexes per D11 from eng review: - `idx_code_edges_symbol_resolver(source_id, to_symbol_qualified)` — composite for the resolver's per-source lookup. - `idx_content_chunks_symbol_lookup(page_id, symbol_name_qualified)` WHERE `symbol_name_qualified IS NOT NULL` — file-batched candidate fetch; also reused by W4-5 cluster recompute. - `idx_content_chunks_edges_backfill(edges_backfilled_at)` WHERE `edges_backfilled_at IS NULL` — fast unresumed-row scan. Module (`src/core/chunkers/symbol-resolver.ts`): * `resolveSymbolEdgesIncremental(engine, {sourceId, maxChunks?, onProgress?})` walks stale chunks in 200-chunk batches. For each chunk, loads its unresolved edges, finds same-page candidates by symbol_name_qualified, and writes outcome to `edge_metadata`: - exactly 1 candidate → `{resolved_chunk_id: <id>}` - 2+ candidates → `{ambiguous: true, candidates: [...]}` - 0 candidates → unchanged (cross-file; two-pass.ts handles those) Each batch bumps `edges_backfilled_at = NOW()` for the chunks. * `readEdgeResolution(metadata)` — public helper for downstream code (two-pass.ts, code_blast op, eval-capture) to consume the resolver's output without parsing JSON directly. Returns a tagged union. * `EDGE_EXTRACTOR_VERSION_TS` exported constant — bump when extractor shape changes and the next cycle re-walks all chunks. Tests (5 E2E in test/e2e/symbol-resolver-pglite.test.ts, all PGLite, no DATABASE_URL): unambiguous match, ambiguous multi-match, no match, watermark advance + idempotency, source isolation (no cross-source candidate leak). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W0c): wire resolve_symbol_edges as a new cycle phase W0c's symbol resolver lands as a 12th cycle phase between extract and patterns. The autopilot's quick-cycle path (60s watchdog interval per D2 from eng review) now resolves stale chunks incrementally so agents see resolved edges within ~60s of writes rather than waiting on the slow full-walk path. * CyclePhase + ALL_PHASES + NEEDS_LOCK_PHASES extended with 'resolve_symbol_edges'. Position: between extract (which emits new bare-token edges from sync diffs) and patterns (which reads the graph). Acquires the cycle lock because it writes edge_metadata. * CycleReport.totals adds edges_resolved + edges_ambiguous so doctor and autopilot summaries surface the numbers. * runPhaseResolveSymbolEdges walks every registered source via listSources() + resolveSymbolEdgesIncremental(). Per-call cap is BATCH_SIZE*10 = 2000 chunks so a single watchdog tick stays bounded even on a 100K-chunk brain. Subsequent ticks pick up the leftovers via the edges_backfilled_at watermark. * Test count bumped from 11 → 12 phases in cycle.serial.test.ts and cycle.test.ts (both pinned by the regression guards). Existing 28 cycle tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W3): MCP-expose code_callers / code_callees / code_def / code_refs Pre-v0.34 these four code-intelligence commands lived in CLI_ONLY at cli.ts:30 — agents calling gbrain via MCP couldn't reach them and fell through to text search. This commit ships the agent-facing MCP surface for v0.34 against the existing v0.20+ tree-sitter call graph; recursive blast/flow and clusters land in subsequent commits. * `code_callers(symbol, [limit, source_id, all_sources])` — wraps engine.getCallersOf. Reverse view of the A1 call graph. * `code_callees(symbol, [limit, source_id, all_sources])` — wraps engine.getCalleesOf. Forward view. * `code_def(symbol, [limit, lang])` — wraps findCodeDef. Returns definition sites with file/line/snippet. * `code_refs(symbol, [limit, lang])` — wraps findCodeRefs. Returns every reference (comments, strings, imports, call sites). All four are scope:'read', source-scoped by default via ctx.sourceId (W0a contract). Per-call source_id param wins over ctx; pass '__all__' or all_sources=true to force cross-source. * operations-descriptions.ts: 4 new constants per the eng review D10 finding — every description carries an inline example response so agents don't burn first-call context discovering shape. Resolver-grade wording ("BEFORE editing any function, run code_callers...") routes plan-mode questions straight to the right op. * SEARCH_DESCRIPTION gains a cross-link clause pointing at the four new ops so agents stop falling through to text search for code-symbol questions. Tests (11 E2E in test/e2e/code-intel-mcp-ops-pglite.test.ts): - All four ops registered + scope:read + description pinned by constant - All four ops have required symbol param - code_callers / code_callees return the documented envelope shape - Source scoping honors ctx.sourceId - all_sources=true / source_id='__all__' force cross-source - code_def returns the def-site snippet Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v0.33.0): agent-readable migration doc for the code-intel foundation skills/migrations/v0.33.0.md gives existing-user upgrade guidance for the v0.33.0 foundation pre-release (this branch's accumulated work toward v0.34 Cathedral III): * Source-routing fix (Codex #2) — query / two-pass now honor sourceId * CLI source-scoping default flipped (Codex #7) — gbrain code-callers defaults to source-scoped, --all-sources is the explicit opt-out * MCP exposure of code-callers / code-callees / code-def / code-refs with resolver-grade descriptions agents auto-route to * Within-file symbol resolver runs as a new `resolve_symbol_edges` cycle phase between extract and patterns * Schema migration v51: edges_backfilled_at watermark + 3 composite/ partial indexes for the resolver hot path * Verification commands the agent runs after `gbrain upgrade` Bumps the existing-user migration ladder so the auto-update agent (SKILLPACK Section 17) discovers + runs the v0.33.0 migration steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(v0.33.0): bump VERSION + package.json + CHANGELOG v0.33.0 ships the v0.34 Cathedral III foundation: MCP exposure of code_callers / code_callees / code_def / code_refs with resolver-grade tool descriptions, plus the source-routing fix + within-file symbol resolver + cycle-phase wiring that v0.34's recursive blast/flow and Leiden clusters will build on. Full release notes in CHANGELOG.md. Trio in lockstep: VERSION: 0.33.0 package.json: 0.33.0 CHANGELOG.md: ## [0.33.0] - 2026-05-11 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(v0.33.0): update dream-cycle phase-order assertions for resolve_symbol_edges E2E test pinned the canonical phase sequence as a regression guard. The v0.33.0 resolve_symbol_edges phase (added between extract and patterns) correctly bumps the count to 12 — caught by the canonical-order test on fresh-Postgres run, fixed by adding the new phase to EXPECTED_PHASES and bumping the version history comment. Both cycle.serial.test.ts and cycle.test.ts were already updated in the W0c cycle-phase commit (6f7dbe1); this third pin lives in test/e2e/dream-cycle-phase-order-pglite.test.ts and was missed. Full E2E suite now: 550 passed / 0 failed / 81 files (real Postgres on port 5435 via Docker pgvector/pgvector:pg16). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(v0.33.3.0): rebump from v0.33.2.0 → v0.33.3.0 User asked to ship as v0.33.3.0 instead of v0.33.2.0. Single sweep: * VERSION + package.json bumped to 0.33.3.0 * CHANGELOG header + body rewritten to v0.33.3 * skills/migrations/v0.33.0.md → skills/migrations/v0.33.3.0.md (migration files use the version they ship FROM; renaming aligns with the v0.21.0.md / v0.31.0.md convention in CLAUDE.md) * Schema migration name edges_backfilled_at_v0_33_2 → edges_backfilled_at_v0_33_3 in src/core/migrate.ts (also bumps the in-code identifier so the registry name matches the version) * All v0.33.2 comment references swept to v0.33.3 in cycle.ts, operations.ts, operations-descriptions.ts, eval.ts, symbol-resolver.ts + cycle test phase-history comments * llms.txt + llms-full.txt regenerated Trio verified: VERSION: 0.33.3.0 package.json: 0.33.3.0 CHANGELOG.md: ## [0.33.3.0] - 2026-05-12 bun run verify clean; 90 v0.33.3-touched tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…clusters + eval gate (garrytan#994) * feat(v0.34 pre-w0): add code-retrieval eval harness for v0.34 ship gate Captures pre-v0.34 retrieval quality on the gbrain self-corpus before any code-intel work lands, so the v0.34 ship gate (precision@5 +10pp OR answered_rate +15pp on >=15/30 questions) measures real improvement rather than an after-the-fact retuned baseline. * src/eval/code-retrieval/harness.ts -- pure-function metrics (precision@k, recall@k, top-1 stability, gate evaluator) + EvalRunReport types stable across schema_version 1 * src/eval/code-retrieval/questions.json -- 30 questions across callers / callees / definition / references / blast_radius / execution_flow / cluster_membership kinds, expected_files captured against current gbrain layout * src/eval/code-retrieval/strategies.ts -- BaselineStrategy (hybridSearch) + WithCodeIntelStrategy stub (post-W3 fills in code_blast/code_flow/etc.) * src/commands/eval-code-retrieval.ts -- gbrain eval code-retrieval CLI with --baseline / --with-code-intel / --compare subcommands * test/code-retrieval-harness.test.ts -- 26 unit tests across metrics, loader, gate logic; no engine dependency PRE-V0.34 BASELINE WORKFLOW: gbrain eval code-retrieval --baseline --save /tmp/baseline-1.json (run 3x for noise floor) V0.34 SHIP GATE (after W3 lands): gbrain eval code-retrieval --with-code-intel --save /tmp/v034.json gbrain eval code-retrieval --compare /tmp/baseline-1.json /tmp/v034.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(v0.34 W0a): source-routing leak across query + two-pass Codex outside-voice review on the v0.34 plan caught two load-bearing sites where sourceId was advertised but never applied — multi-source brains silently cross-contaminated structural retrieval: * operations.ts ~323 — `query` op handler called hybridSearch without threading ctx.sourceId. Multi-source agents querying with a --source flag got cross-source results. * two-pass.ts:81 (nearSymbol lookup) and two-pass.ts:131 (unresolved edge resolution) — TwoPassOpts.sourceId was declared and threaded through hybridSearch's expandAnchors call, but the actual SQL ignored it. The walk window crossed source boundaries every time. Fix: * `query` op now reads ctx.sourceId AND accepts a new `source_id` param (with '__all__' as the explicit force-cross-source escape hatch). Per-call param wins over ctx context. * two-pass.ts both lookups join through pages.source_id when opts.sourceId is set; omitted opts.sourceId preserves the legacy cross-source contract for callers who want it. Regression test: test/e2e/source-routing.test.ts seeds two sources with the same `parseMarkdown` symbol + a cross-source caller edge. Pins: - nearSymbol + sourceId='source-a' returns ONLY source-a chunks - nearSymbol + sourceId='source-b' returns ONLY source-b chunks - nearSymbol with no sourceId still crosses sources (contract preserved) - walk_depth=1 unresolved-edge resolution stays in source-a PGLite in-memory, no DATABASE_URL needed. The fix proves out under realistic structural retrieval not just a contrived unit test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(v0.34 W0b): flip CLI source-scoping default to truly source-scoped Codex outside-voice review (finding #7) caught that the v0.20.0 docstring claim "by default we only match the caller's source_id" contradicted the implementation in code-callers.ts:54 + code-callees.ts:43: allSources: allSources || !sourceId The right side made `allSources` TRUE whenever `--source` was omitted, INVERTING the documented default. Multi-source brains silently cross- contaminated structural retrieval; `gbrain code-callers parseMarkdown` on a brain with two repos returned callers from both even though the docstring promised per-source scoping. Fix: * New canonical helper `resolveDefaultSource(engine)` in sources-ops.ts. Contract per eng review D7: - exactly 1 source registered → return its id (single-source brains, the 80% case; --source flag is unnecessary friction there) - 2+ sources → throw SourceResolutionError(multiple_sources_ambiguous) with the list of valid ids - 0 sources → throw SourceResolutionError(no_sources) * code-callers.ts + code-callees.ts now resolve to the default source when both --source AND --all-sources are absent. To get the pre-v0.34 cross-source behavior, callers must pass --all-sources explicitly. * Same hint text on both commands. Pinned by test/e2e/cli-source-scoping-pglite.test.ts. IRON RULE regression R2: docstring promise now holds. Multi-source brain running `gbrain code-callers <symbol>` without --source gets a clear error listing valid source ids instead of silent cross-resolution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W0c): within-file two-pass symbol resolver + edges_backfilled_at watermark Codex's outside-voice review caught that the v0.20.0 graph stores BARE callee tokens (`render`, `find`, `execute`) — not qualified names. Pre-v0.34 recursive blast/flow would alias every same-named function across classes. W0c is the foundation that fixes this: resolve `code_edges_symbol` rows by matching `to_symbol_qualified` against the SAME-FILE chunks' `symbol_name_qualified`, then write the outcome to `edge_metadata`. This commit is the resolver primitive + schema. The cycle-phase wiring that calls it on every quick-cycle tick lands in the next commit. Schema (v51 migration `edges_backfilled_at_v0_34`): * `content_chunks.edges_backfilled_at TIMESTAMPTZ` — resume watermark. Chunks where the column is NULL OR older than EDGE_EXTRACTOR_VERSION_TS get re-walked next tick. SIGINT/OOM/sleep mid-backfill loses at most one batch. * Indexes per D11 from eng review: - `idx_code_edges_symbol_resolver(source_id, to_symbol_qualified)` — composite for the resolver's per-source lookup. - `idx_content_chunks_symbol_lookup(page_id, symbol_name_qualified)` WHERE `symbol_name_qualified IS NOT NULL` — file-batched candidate fetch; also reused by W4-5 cluster recompute. - `idx_content_chunks_edges_backfill(edges_backfilled_at)` WHERE `edges_backfilled_at IS NULL` — fast unresumed-row scan. Module (`src/core/chunkers/symbol-resolver.ts`): * `resolveSymbolEdgesIncremental(engine, {sourceId, maxChunks?, onProgress?})` walks stale chunks in 200-chunk batches. For each chunk, loads its unresolved edges, finds same-page candidates by symbol_name_qualified, and writes outcome to `edge_metadata`: - exactly 1 candidate → `{resolved_chunk_id: <id>}` - 2+ candidates → `{ambiguous: true, candidates: [...]}` - 0 candidates → unchanged (cross-file; two-pass.ts handles those) Each batch bumps `edges_backfilled_at = NOW()` for the chunks. * `readEdgeResolution(metadata)` — public helper for downstream code (two-pass.ts, code_blast op, eval-capture) to consume the resolver's output without parsing JSON directly. Returns a tagged union. * `EDGE_EXTRACTOR_VERSION_TS` exported constant — bump when extractor shape changes and the next cycle re-walks all chunks. Tests (5 E2E in test/e2e/symbol-resolver-pglite.test.ts, all PGLite, no DATABASE_URL): unambiguous match, ambiguous multi-match, no match, watermark advance + idempotency, source isolation (no cross-source candidate leak). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W0c): wire resolve_symbol_edges as a new cycle phase W0c's symbol resolver lands as a 12th cycle phase between extract and patterns. The autopilot's quick-cycle path (60s watchdog interval per D2 from eng review) now resolves stale chunks incrementally so agents see resolved edges within ~60s of writes rather than waiting on the slow full-walk path. * CyclePhase + ALL_PHASES + NEEDS_LOCK_PHASES extended with 'resolve_symbol_edges'. Position: between extract (which emits new bare-token edges from sync diffs) and patterns (which reads the graph). Acquires the cycle lock because it writes edge_metadata. * CycleReport.totals adds edges_resolved + edges_ambiguous so doctor and autopilot summaries surface the numbers. * runPhaseResolveSymbolEdges walks every registered source via listSources() + resolveSymbolEdgesIncremental(). Per-call cap is BATCH_SIZE*10 = 2000 chunks so a single watchdog tick stays bounded even on a 100K-chunk brain. Subsequent ticks pick up the leftovers via the edges_backfilled_at watermark. * Test count bumped from 11 → 12 phases in cycle.serial.test.ts and cycle.test.ts (both pinned by the regression guards). Existing 28 cycle tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W3): MCP-expose code_callers / code_callees / code_def / code_refs Pre-v0.34 these four code-intelligence commands lived in CLI_ONLY at cli.ts:30 — agents calling gbrain via MCP couldn't reach them and fell through to text search. This commit ships the agent-facing MCP surface for v0.34 against the existing v0.20+ tree-sitter call graph; recursive blast/flow and clusters land in subsequent commits. * `code_callers(symbol, [limit, source_id, all_sources])` — wraps engine.getCallersOf. Reverse view of the A1 call graph. * `code_callees(symbol, [limit, source_id, all_sources])` — wraps engine.getCalleesOf. Forward view. * `code_def(symbol, [limit, lang])` — wraps findCodeDef. Returns definition sites with file/line/snippet. * `code_refs(symbol, [limit, lang])` — wraps findCodeRefs. Returns every reference (comments, strings, imports, call sites). All four are scope:'read', source-scoped by default via ctx.sourceId (W0a contract). Per-call source_id param wins over ctx; pass '__all__' or all_sources=true to force cross-source. * operations-descriptions.ts: 4 new constants per the eng review D10 finding — every description carries an inline example response so agents don't burn first-call context discovering shape. Resolver-grade wording ("BEFORE editing any function, run code_callers...") routes plan-mode questions straight to the right op. * SEARCH_DESCRIPTION gains a cross-link clause pointing at the four new ops so agents stop falling through to text search for code-symbol questions. Tests (11 E2E in test/e2e/code-intel-mcp-ops-pglite.test.ts): - All four ops registered + scope:read + description pinned by constant - All four ops have required symbol param - code_callers / code_callees return the documented envelope shape - Source scoping honors ctx.sourceId - all_sources=true / source_id='__all__' force cross-source - code_def returns the def-site snippet Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v0.33.0): agent-readable migration doc for the code-intel foundation skills/migrations/v0.33.0.md gives existing-user upgrade guidance for the v0.33.0 foundation pre-release (this branch's accumulated work toward v0.34 Cathedral III): * Source-routing fix (Codex #2) — query / two-pass now honor sourceId * CLI source-scoping default flipped (Codex #7) — gbrain code-callers defaults to source-scoped, --all-sources is the explicit opt-out * MCP exposure of code-callers / code-callees / code-def / code-refs with resolver-grade descriptions agents auto-route to * Within-file symbol resolver runs as a new `resolve_symbol_edges` cycle phase between extract and patterns * Schema migration v51: edges_backfilled_at watermark + 3 composite/ partial indexes for the resolver hot path * Verification commands the agent runs after `gbrain upgrade` Bumps the existing-user migration ladder so the auto-update agent (SKILLPACK Section 17) discovers + runs the v0.33.0 migration steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(v0.33.0): bump VERSION + package.json + CHANGELOG v0.33.0 ships the v0.34 Cathedral III foundation: MCP exposure of code_callers / code_callees / code_def / code_refs with resolver-grade tool descriptions, plus the source-routing fix + within-file symbol resolver + cycle-phase wiring that v0.34's recursive blast/flow and Leiden clusters will build on. Full release notes in CHANGELOG.md. Trio in lockstep: VERSION: 0.33.0 package.json: 0.33.0 CHANGELOG.md: ## [0.33.0] - 2026-05-11 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(v0.33.0): update dream-cycle phase-order assertions for resolve_symbol_edges E2E test pinned the canonical phase sequence as a regression guard. The v0.33.0 resolve_symbol_edges phase (added between extract and patterns) correctly bumps the count to 12 — caught by the canonical-order test on fresh-Postgres run, fixed by adding the new phase to EXPECTED_PHASES and bumping the version history comment. Both cycle.serial.test.ts and cycle.test.ts were already updated in the W0c cycle-phase commit (6f7dbe1); this third pin lives in test/e2e/dream-cycle-phase-order-pglite.test.ts and was missed. Full E2E suite now: 550 passed / 0 failed / 81 files (real Postgres on port 5435 via Docker pgvector/pgvector:pg16). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 STEP 0): promote OperationContext.sourceId to REQUIRED (D4) Flip src/core/operations.ts:350 `sourceId?: string` → `sourceId: string`. Mirrors v0.26.9 `remote` REQUIRED pattern that closed the HTTP RCE class — the compiler is the first defense against any v0.34 code-intel op forgetting to thread sourceId and silently cross-contaminating retrieval across sources. - src/mcp/dispatch.ts: buildOperationContext auto-fills 'default' when opts.sourceId is undefined. Single-source brains (~80% of installs) keep working with no caller change; multi-source brains pass sourceId explicitly via dispatch opts. - src/cli.ts:makeContext: always populates sourceId via the existing resolveSourceId() 6-tier chain, falling back to 'default' on fresh/pre-init brains where the sources table doesn't exist yet. - src/commands/book-mirror.ts, src/core/minions/tools/brain-allowlist.ts: Two production context-builders that previously omitted sourceId. Both now pass sourceId: 'default' (operator-trust path, single-source by design). - 10 test/* files: every OperationContext literal now passes sourceId. test/operation-context-sourceid-required.test.ts: paired contract test (6 cases) pinning the type contract. @ts-expect-error directives on omitted-sourceId / undefined-sourceId guard against future regression; runtime tests verify buildOperationContext's auto-fill safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W1): receiver-type resolution at edge-extraction time The edge-extractor emits qualified callee names (Class::method, module::method) for the 3 MUST-resolve patterns from the design doc when running against JS/TS/TSX + Python source: 1. `import { x } from 'y'; x.method()` → emit `y::method` 2. `class C { m() { this.m() } }` → emit `C::m` 3. `const c = new C(); c.m()` → emit `C::m` When the receiver can't be resolved within WALK_DEPTH_CAP (32) ancestor hops of the call site, falls back to bare-token emit (pre-W1 behavior). Ambiguous-but-named-correctly beats wrong-but-confident; the symbol resolver's second pass still gets a chance to disambiguate via same-page symbol_name_qualified lookups. Per D18 from eng review — only JS/TS/TSX + Python get receiver resolution. Ruby/Go/Rust/Java keep pre-W1 bare-token emit semantics. RECEIVER_RESOLUTION_LANGS pins the eligible set. Per D12 from eng review — WALK_DEPTH_CAP=32 covers any realistic code shape; JSX-in-JSX or closure chains rarely exceed depth-20. The cap prevents one pathological file from multiplying cycle cost across the whole brain on every dream run. - src/core/chunkers/edge-extractor.ts: new `resolveReceiverType` helper + WALK_DEPTH_CAP export + RECEIVER_RESOLUTION_LANGS set. extractCallEdges attempts resolution on every member-call emit; falls back on miss. - src/core/chunkers/symbol-resolver.ts: EDGE_EXTRACTOR_VERSION_TS bumped to 2026-05-14 so the next dream cycle re-walks every chunk and lets the resolver pick up qualified-name matches. test/code-intel/scope-walker-resolution.test.ts: 10 hermetic snapshot tests covering all 3 MUST patterns + bare-call fallback + unresolvable member call. Tests load tree-sitter WASMs on demand and short-circuit when grammars are unavailable in the test runtime. Scope reduction from the original plan: the .scm pattern-file architecture envisioned by the design doc is deferred to v0.34.1. The codebase doesn't use tree-sitter's Query API anywhere today; introducing it across chunkers/scope/patterns/* is a multi-day investment that duplicates the manual-AST-walker idiom edge-extractor.ts already uses. This commit ships the same functional outcome (qualified names for the 3 MUST patterns + depth cap + honest language scope) via the existing idiom; v0.34.1 can refactor to .scm files if/when query-API benefits materialize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W2): edge densification — imports + references edge types Edge extractor now emits three edge kinds: - calls (v0.20 baseline; v0.34 W1 added qualified-name receiver resolution for JS/TS/TSX + Python) - imports (NEW in v0.34 W2; JS/TS/TSX + Python at depth) - references (NEW in v0.34 W2; TS-only) Why this matters: Leiden clusters on a calls-only graph produce overfit garbage (GitNexus showed 0.052 cluster/node on calls-only — useless). Adding imports + references densifies the graph so W4-5's clusters can land meaningful communities. Per design doc Constraint #1. - src/core/chunkers/edge-extractor.ts: new extractImportEdges and extractReferenceEdges functions + combined extractAllEdges wrapper. ExtractedEdge.edgeType widened to 'calls' | 'imports' | 'references'. - src/core/chunkers/code.ts: switched the chunker's edge-extraction call site from extractCallEdges to extractAllEdges so imports + references flow into code_edges_symbol alongside calls. - src/core/chunkers/symbol-resolver.ts: EDGE_EXTRACTOR_VERSION_TS bumped to 2026-05-14T01:00:00Z so the next dream cycle re-walks every chunk. Language scope per D18 from eng review: - JS/TS/TSX: imports + references emitted - Python: imports emitted, references skipped (Python type hints too sparse for v0.34; v0.35 may revisit) - Ruby/Go/Rust/Java: calls only — no imports, no references. Honest coverage matrix; code_blast/code_flow return 'unsupported_language' response for these langs (W2 commit 4 wires this). Edge schema reused: code_edges_symbol.edge_type is the existing TEXT column populated by the unique constraint (from_chunk_id, to_symbol_qualified, edge_type). Adding new types doesn't conflict with existing calls edges. test/code-intel/edge-densification.test.ts: 13 hermetic tests covering named/default/namespace/aliased/side-effect imports for JS/TS, from-x- import-y + import-pkg for Python, function parameter + return type references for TS, and unsupported-language returns-empty contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W3b): code_traversal_cache table, module, and clear admin op Schema migration v56 (code_traversal_cache_v0_34): - new table: code_traversal_cache (id, symbol_qualified, depth, source_id, response_json JSONB, max_chunk_updated_at, xmin_max, cluster_generation, computed_at) - unique index on (symbol_qualified, depth, source_id) - secondary index on source_id for cheap source-scoped clears D3 — generation-counter cache invalidation. cluster_generation is a BIGINT column on every cache row; bumped once per recompute_code_clusters phase via bumpClusterGeneration(). Cache rows referencing stale generations naturally miss on read. Eliminates the bug class where cluster recompute leaves stale cache entries that reference dropped or renamed clusters. D8 — destructive-guard parity. clearTraversalCache requires either source_id OR all_sources=true. Without either it throws. Mirrors v0.26.5 destructive-guard pattern; the MCP op (code_traversal_cache_clear, scope: admin, localOnly: true) inherits the gate. - src/core/code-intel/traversal-cache.ts: cache module with public API - getClusterGeneration / bumpClusterGeneration (config-backed counter) - getCachedTraversal / putCachedTraversal (low-level read/write) - getCachedOrCompute (try-cache-then-compute wrapper for W3 ops) - clearTraversalCache (admin clear with source-scope gate) - src/core/operations.ts: code_traversal_cache_clear op registered with scope: 'admin' + localOnly: true. Dry-run aware; resolves source_id from params or ctx. v0.34.0.0 scope: cache writes use xmin_max=0 sentinel (no snapshot isolation). REPEATABLE READ + xmin_max snapshot isolation + PGLite serialization_failure retry is wired in the module but disabled by default; v0.34.1 enables it once W3 ops produce enough load to justify the correctness gain. Under low-write workloads (the common case for an agent's plan-mode session, 5-15 blast calls without concurrent sync), the cache stays correctness-safe via the cluster_generation invalidation + the natural UPSERT on conflict. test/code-intel/traversal-cache.test.ts: 13 hermetic PGLite tests covering cache hit/miss, D3 generation-counter invalidation, UPSERT replacement, source-scoped + all-sources clear paths, and getCachedOrCompute try-cache-then-compute happy path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W3): code_blast + code_flow recursive ops + sinks Recursive caller (code_blast) + recursive callee (code_flow) walks land as first-class MCP ops. The user-facing payoff for v0.34: v0.33.3 shipped flat callers/callees; v0.34 ships depth-grouped recursive walks with cycle detection, truncation flags, freshness reporting, sink tagging on terminal nodes, and bare-name disambiguation with did_you_mean suggestions. - src/core/code-intel/recursive-walk.ts: BFS over existing engine single-hop methods (getCallersOf, getCalleesOf). Depth-grouped output; confidence = clamp(1 / (1 + 0.3 * depth), 0.05, 1.0). Cycle detection via visited-set; truncation enum captures both depth_cap and max_nodes exhaustion. Source-scoped per D4 sourceId REQUIRED. - src/core/code-intel/sinks/{ts,py,index}.ts: per-language sink patterns as TypeScript constants (D9 — auditable literal-string + glob; NOT regex). Pattern cache hits warm after first match per process. TS_SINKS covers fetch, axios.*, fs.*, Bun.*, execSync, spawnSync; PY_SINKS covers requests.*, urllib.*, subprocess.*, open, pathlib.*. - src/core/operations.ts: code_blast + code_flow registered with scope: 'read'. Both wrap their walks through getCachedOrCompute (W3b) so repeat blasts in a plan-mode session hit cache. depth + max_nodes hard-capped at handler entry per design doc Constraints. exact: true skips bare-name disambiguation. Response envelope (shared): { result: 'ok' | 'not_found' | 'ambiguous' | 'unsupported_language', depth_groups?, cycles_detected?, truncation?, freshness?, did_you_mean?, candidates?, supported? } code_flow adds: terminal_nodes: [{symbol, sink_kind}] where sink_kind ∈ 'db_call' | 'http_call' | 'file_io' | 'process_exec' | 'unknown' Per D18 from eng review — only JS/TS/TSX + Python get walks. Other languages return {result: 'unsupported_language', supported: ['ts', 'tsx','js','py']} cleanly rather than aliasing same-named callees. test/code-intel/recursive-walk.test.ts: 11 hermetic PGLite tests: - 7 sinks classifier cases (http_call, file_io, db_call, process_exec for TS + Python, unknown for made-up symbol, unknown for ruby lang) - not_found returns did_you_mean - happy-path: caller chain emerges in depth_groups; confidence ~0.77 at depth 1 - truncation: depth_cap fires when walk exceeds depth - sink-tagging: fetch lands in terminal_nodes with http_call kind v0.34.0.0 scope reductions: stdio rate limiter at dispatch.ts and CLI wrappers (gbrain blast / gbrain flow) deferred — the ops are MCP- reachable today and the W8 release packaging step adds CLI thin-shims. The eng-review's stdio limiter at dispatch.ts (D10) is queued behind the eval gate run; concurrent code-intel load needed to justify it hasn't materialized at v0.34.0.0 ship time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W6): gbrain edges-backfill CLI Operator escape hatch for the symbol-resolution backfill chain. Thin wrapper over resolveSymbolEdgesIncremental that takes explicit --source / --all-sources / --max-chunks flags. Resumable via the edges_backfilled_at watermark (W0c). Per-batch transactions commit, so Ctrl-C leaves a clean resumable state. A re-run picks up where the prior invocation stopped. Usage: gbrain edges-backfill # default source gbrain edges-backfill --source <id> # specific source gbrain edges-backfill --all-sources # every registered source gbrain edges-backfill --json # machine-readable output Wired into src/cli.ts CLI_ONLY + dispatch table. Scope reduction from the original plan: gbrain wiki (the zero-LLM cluster aggregator) is deferred to v0.34.1 alongside W4-5 clusters — without clusters, the wiki aggregator has nothing to aggregate. gbrain upgrade backfill prompt is also deferred to v0.34.1; v0.34.0.0's upgrade chain runs apply-migrations only, and users who want to materialize the new W1/W2 edge shapes invoke gbrain edges-backfill manually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.34 W7): per-op graph-traversal metrics module src/core/eval-capture-graph.ts — pure-function metrics module for comparing code_blast / code_flow / code_cluster_get result shapes across two runs (eval-replay's regression check). Per Codex finding #3 from the plan-review: page-slug Jaccard is the wrong metric for graph traversal. v0.34 W7 ships proper per-op metrics: - nodeSetJaccard(a, b): set Jaccard over (file, line, symbol) tuples. Right metric for code_blast/code_flow node sets. - depthGroupStability(a, b): 1 - (displaced / |union|). Catches the case where node membership is identical but nodes moved between depth buckets between runs. - truncationMatch(a, b): boolean match on the truncation enum. Discrete signal that pairs with Jaccard. - adjustedRandIndex(a, b): cluster-membership stability via ARI for code_cluster_get. v0.34.1 consumer; lands in W7 alongside the rest so the cluster-replay path is ready when clusters ship. - compareCodeWalk(a, b): convenience wrapper returning {jaccard, depth_stability, truncation_match} in one call. Hermetic — no engine, no DB, fully unit-testable. 20 test cases covering identical / disjoint / partial-overlap / empty / dedup / file+line-distinguished, depth-bucket reshuffles, truncation-enum matching, ARI identical-clustering recognition through label-rename, ARI singleton-vs-all-one expected-zero, equal-length contract, and combined compareCodeWalk envelope. Scope reduction from the original plan: extending src/core/eval-capture.ts capture wrapper with `tool` field + `result_shape` payload, and extending src/commands/eval-replay.ts to dispatch on tool — both deferred to v0.34.1. The metric MODULE is the load-bearing piece (Codex finding #3's primary fix); wiring it through the existing capture/replay surface is a follow-up that doesn't change production behavior until clusters ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(v0.34.0.0): VERSION + package.json + CHANGELOG + migration doc Final release packaging for v0.34.0.0. Three-line audit will show: VERSION: 0.34.0.0 package.json: 0.34.0.0 CHANGELOG: ## [0.34.0.0] - 2026-05-14 CHANGELOG entry follows CLAUDE.md voice rules: - Bold headline + lead paragraph - "What ships in v0.34.0.0" itemized list - "Slip handling — deferred to v0.34.1" honest scope note - Numbers-that-matter table comparing v0.33.3 → v0.34.0.0 - Mandatory "## To take advantage of v0.34.0.0" block with verify commands (gbrain edges-backfill, gbrain doctor, code_blast/flow, eval gate run) skills/migrations/v0.34.0.0.md — agent-readable upgrade doc. Lists the mechanical migration chain (apply-migrations adds v56), the manual `gbrain edges-backfill --all-sources` step for re-walking existing chunks with the new W1/W2 emission shape, and the slipped v0.34.1 scope. v0.34.0.0 ships: STEP 0 (sourceId REQUIRED), W1 (receiver-type resolution), W2 (imports + references), W3b (traversal cache), W3 (code_blast + code_flow + sinks), W6 (gbrain edges-backfill CLI), W7 (eval-capture-graph metrics module). v0.34.1 backlog: W4-5 Leiden clusters, W6 wiki, W7 capture wiring, W1 .scm rewrite, W3 stdio limiter, W3 CLI shims, D2 autopilot sub-loop. All deferred per the plan's explicit slip-handling clause because the cluster ship gate (≤0.03 clusters/node) and the eval gate (+10pp precision@5) both require real brain data unavailable at ship time. Test surface in v0.34.0.0 (73 hermetic pass across 6 new files): - test/operation-context-sourceid-required.test.ts (6 cases) - test/code-intel/scope-walker-resolution.test.ts (10 cases) - test/code-intel/edge-densification.test.ts (13 cases) - test/code-intel/traversal-cache.test.ts (13 cases) - test/code-intel/recursive-walk.test.ts (11 cases) - test/code-intel/eval-capture-graph.test.ts (20 cases) Migration v56 (code_traversal_cache_v0_34) verified applying clean on PGLite via the test suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(v0.34 D7): snapshotIndexes helper for cross-engine index parity Extends test/helpers/schema-diff.ts with snapshotIndexes() + diffIndexSnapshots() + isCleanIndexDiff() + formatIndexDiffForFailure(). Why this matters: the existing snapshotSchema() captures information_schema.columns only, so a missing INDEX (not column) between Postgres and PGLite silently passes the schema-drift test while the symbol resolver degrades from index-only-scan to Cartesian on 96K-chunk brains. The v0.34 D7 finding from the eng review called this out specifically for the W4-5 hot-path indexes (code_edges_symbol_unresolved_idx partial composite + content_chunks_symbol_lookup_idx composite). Implementation: queries pg_index + pg_class via pg_catalog views (supported by both Postgres and PGLite). Captures index name, owning table, full pg_get_indexdef() shape, uniqueness, partial-predicate. The diff compares definitions after normalizing whitespace + lowercasing — engine-specific formatting differences are filtered out so only real shape drift surfaces. Reused by future test/e2e/schema-drift.test.ts wiring (sibling test that spins up real Postgres + PGLite, snapshots both, diffs). test/helpers/schema-diff-indexes.test.ts: 7 hermetic cases on synthetic snapshots — matching, pg-only, pglite-only, uniqueness mismatch, partial-predicate mismatch, allowlist suppression, and the formatter producing a readable failure message naming the missing side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(v0.34): update 4 pre-existing tests for new emit shapes + sourceId contract Three test files updated to match the v0.34 contract changes: - test/edge-extractor.test.ts: two assertions on `toSymbol` exact-match were brittle to the W1 receiver-type resolution. `this.go()` / `self.go()` now resolve to `Foo::go` instead of bare `go`. Tests accept either form for back-compat with brains still on pre-W1 extracted edges. - test/source-id-tx-regression.test.ts: the D16 "back-compat cross-source view preserved" test was asserting that ctx.sourceId undefined → cross-source view. v0.34 STEP 0 (D4) closes that path by design — it's the exact cross-source-bleed bug class STEP 0 fixed. Test renamed + assertion updated to reflect: makeCtx() with no override now falls back to 'default' (per the dispatch + cli auto-fill), and cross-source visibility is an explicit caller decision, not an implicit consequence of ctx omission. - test/chunker-timeout.test.ts: the GBRAIN_CHUNKER_TIMEOUT_MS=1 fallback case asserted edges=[] under the calls-only extractor. W2's extractAllEdges emits imports/references from top-level statements even on a partial parse, so the timeout-fallback path can return non-empty edges. Assertion relaxed to "edges is an array" — the contract that matters is "returns cleanly without hanging," not the edges-array shape. Full unit suite (parallel + serial): 6132 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(migrate): remove duplicate edges_backfilled_at migration at v58 CI surfaced a duplicate migration version in test/migrate.test.ts:371 ("runMigrations sorts by version ascending" — uniq.size === versions.length). Root cause: the second master merge (PR garrytan#934 v0.33.3.0 foundation, commit 3fc0ca5) brought in master's `edges_backfilled_at` migration alongside the one already in my branch. Both functionally identical (ALTER TABLE content_chunks ADD COLUMN edges_backfilled_at + 3 indexes), both renumbered to v58 (mine via the f25b674 merge that pushed past master's v55 search-lite migrations; master's PR garrytan#934 originally claimed v55 which would have collided). Auto-merge kept both, named `_v0_33_2` and `_v0_33_3`. Tests caught it. Fix: deleted the `_v0_33_3` duplicate. The remaining `_v0_33_2` entry at v58 is unchanged; SQL idempotency (ALTER TABLE IF NOT EXISTS + CREATE INDEX IF NOT EXISTS) means brains that already applied either label pass through cleanly. Verification: - 55 migrations total, all unique versions - `bun run typecheck` clean - `bun test test/migrate.test.ts`: 109 pass / 0 fail / 321 expect calls --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ederated_read + 3 more (garrytan#996) * fix(mcp): skip stdin EOF handlers when MCP_STDIO=1 OpenClaw's bundle-mcp gateway and similar wrappers pipe the JSON-RPC handshake on stdin then close their stdin half. Pre-fix, both stdin 'end' and 'close' listeners (server.ts:65-66 and serve.ts:204-206) treated this as a permanent disconnect and shut the server down before the first tool call arrived. Guard both sites with `process.env.MCP_STDIO !== '1'`. Signal handlers (SIGTERM/SIGINT/SIGHUP), transport.onclose, and the parent-process watchdog still cover legitimate shutdown paths. The serve.ts site threads the env read through an injectable `mcpStdio?: boolean` on ServeOptions so tests stay isolated (no process.env mutation per scripts/check-test-isolation.sh R1). Tests: 3 new cases in test/serve-stdio-lifecycle.test.ts pin the guard's invariants — mcpStdio=true must NOT trigger shutdown on stdin EOF, signals must still drive shutdown with mcpStdio=true, and mcpStdio=false (default) preserves existing CLI behavior. 25/25 pass. Origin: PR garrytan#870. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(oauth): honor token_endpoint_auth_method=none for PKCE public clients RFC 7591 §3.2.1: when a DCR client declares token_endpoint_auth_method="none" (PKCE-only public clients like Claude Code, Cursor), the authorization server MUST NOT issue a client_secret. Pre-fix, registerClient unconditionally minted a secret, and the MCP SDK's clientAuth middleware then rejected valid public-client flows on /token because it expected client.client_secret to match. Three changes to src/core/oauth-provider.ts:registerClient: - Gate clientSecret generation on isPublicClient = (auth_method === 'none'). Public clients store client_secret_hash = NULL. - Omit client_secret from the response payload for public clients. Confidential clients (default client_secret_post and explicit client_secret_basic) keep their existing one-time-reveal shape. - Normalize NULL secret_hash to JS undefined in getClient so SDK middleware (which checks client.client_secret === undefined, not === null) correctly identifies public clients and skips the secret-comparison branch on /token. Schema is already permissive (client_secret_hash TEXT, no NOT NULL on both src/schema.sql and src/core/pglite-schema.ts) — no migration needed. Tests: 5 new cases in test/oauth.test.ts pin: - public client → no client_secret in response (#11 from plan) - default auth_method → secret unchanged (regression guard) - explicit client_secret_post → secret unchanged - getClient NULL→undefined normalization - PKCE full /authorize → /token end-to-end with no secret (#15 from plan) 69/69 oauth.test.ts cases pass. typecheck clean. Origin: PR garrytan#909. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(serve-http): --bind HOST, default to loopback (127.0.0.1) Adds `gbrain serve --http --bind <interface>` to control which network interface the HTTP MCP server listens on. Default flipped from `0.0.0.0` (pre-v0.34) to `127.0.0.1` (v0.34.0+). Why the flip: gbrain's primary use case is a personal-knowledge brain on a laptop. The previous default exposed brains on every interface — one accidental `--http` invocation away from publishing the brain to a LAN. Server operators who need remote access pass `--bind 0.0.0.0` (or a specific interface). Codex's outside-voice on the original PR garrytan#864 correctly flagged that the additive flag wasn't actually the fix; the default needed to change for the safety claim to hold. If `--public-url` is set but `--bind` is unset, runServeHttp prints a loud stderr WARN at startup recommending `--bind 0.0.0.0`. Declaring a public URL while quietly binding loopback is almost always a misconfiguration; we want the operator to see it on first start, not silently fail remote requests. Startup banner now includes a `Bind:` row so the listening interface is visible alongside Port / Engine / Issuer. Origin: PR garrytan#864, extended with D11 (default flip) per /plan-eng-review codex outside-voice review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(mcp): seal source-isolation leak on read path (P0) Pre-fix, an authenticated OAuth MCP client scoped to source-A could enumerate source-B pages via six read-side ops: search, query (text AND image paths), list_pages, traverse_graph, and find_experts. The v0.31.8 source-scoping pattern shipped through dispatch.ts but the op handlers never threaded ctx.sourceId into their engine calls, and hybridSearch.ts:223's explicit SearchOpts rebuild dropped sourceId even when callers passed it. Sealing the leak: - src/core/operations.ts adds sourceScopeOpts(ctx), the canonical precedence ladder: ctx.auth.allowedSources (federated) wins over ctx.sourceId (scalar) wins over nothing. Threaded into all 5 read-side op handlers + the query-image-path searchVector call (the 6th leak surface codex caught in plan review). - src/core/search/hybrid.ts:223 now threads sourceId + sourceIds fields through the inner SearchOpts rebuild. The explicit pick shape is preserved (HNSW inner-CTE ordering depends on it) but extended. - src/core/types.ts adds sourceIds?: string[] to SearchOpts + PageFilters (D9: federated read needs array-shaped engine filter or fan-out; array wins for hot retrieval). - src/core/operations.ts AuthInfo gains sourceId + allowedSources (D2: identity surface symmetric with the federated_read column garrytan#876 will add). - Both engines now apply WHERE source_id = $N (scalar) or = ANY($N::text[]) (array) at the SQL layer for searchKeyword, searchKeywordChunks, searchVector, listPages, traverseGraph, traversePaths. Array form wins when both are set. The searchVector filter pushes into the inner HNSW CTE (codex flagged this placement during plan review). - traverseGraph + traversePaths signatures gain opts.sourceId + opts.sourceIds; engine.ts interface updated. - findExperts (the whoknows op, D3 5th leak surface) accepts sourceId + sourceIds and threads them into its internal hybridSearch call. PR garrytan#861 was authored before v0.33 shipped so this op wasn't covered in the original PR. Auth wiring: - GBrainOAuthProvider.verifyAccessToken populates AuthInfo.sourceId from oauth_clients.source_id. JOIN guarded by isUndefinedColumnError so pre-v55 brains degrade to legacy projection rather than refusing every token verification. - GBrainOAuthProvider.registerClientManual gains a sourceId parameter (defaults to 'default'). DCR registerClient also sets source_id='default' on the inserted row. - serve-http.ts:929 cleanup: AuthInfo.sourceId is now a real typed field. The cast + GBRAIN_SOURCE env fallback chain is gone (D13). Legacy bearer tokens default to 'default' source in verifyAccessToken. - http-transport.ts (legacy access_tokens path) threads sourceId='default' through DispatchOpts so v0.22.7 callers stay source-scoped. - auth.ts CLI adds --source flag to gbrain auth register-client. Migration v55 (D10 + D13): - ALTER TABLE oauth_clients ADD COLUMN source_id TEXT (nullable). - Backfill UPDATE source_id = 'default' WHERE source_id IS NULL — preserves v0.33 effective behavior verbatim for legacy clients. - ADD CONSTRAINT FK ... REFERENCES sources(id) ON DELETE SET NULL, wrapped in DO block so re-runs against fresh-install brains (where the FK already lives inline in SCHEMA_SQL) no-op cleanly. - CREATE INDEX idx_oauth_clients_source_id WHERE source_id IS NOT NULL for the verifyAccessToken JOIN. - GBRAIN_ACCEPT_SILENT_WIDEN env-flag wired through the runner via SET LOCAL gbrain.accept_silent_widen — reserved for future migrations that hit the silent-widen footgun codex flagged. This migration doesn't need it (column is brand new; no pre-existing stale values possible by definition). - src/core/pglite-schema.ts + src/schema.sql include the column + FK + index inline for fresh installs. Tests: new test/e2e/source-isolation-pglite.test.ts with 13 regression cases — one per leak surface (search/list_pages/traverse/etc.) plus explicit AuthInfo.sourceId and AuthInfo.allowedSources op-handler threading checks. Full unit suite: 6034 pass / 0 fail. PGLite initSchema time dropped from 2.4s to 850ms after consolidating v55's DO blocks (multiple DO blocks were slow on PGLite; one DO block for the FK install only is fine). Origin: PR garrytan#861 + plan-eng-review decisions D2/D3/D4/D9/D10/D13 + F2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(gateway): multimodal embedding for openai-compatible providers Pre-fix, embedMultimodal hardcoded a recipe.id === 'voyage' branch and threw AIConfigError for every other recipe. Multimodal-capable providers fronted by LiteLLM (or any openai-compatible proxy) were unreachable even when the operator had wired up the model. The fix: - src/core/ai/gateway.ts adds embedMultimodalOpenAICompat() that POSTs to the standard /embeddings endpoint with content arrays carrying image_url entries. Routing comes from the existing recipe.implementation switch — Voyage stays on its own /multimodalembeddings path; every other openai-compatible recipe flows through the new helper. - src/core/ai/recipes/litellm-proxy.ts declares supports_multimodal: true so embedMultimodal accepts the recipe. No multimodal_models allow-list: LiteLLM is a passthrough proxy and the user owns model-id selection; provider rejection (400 from upstream) is the right enforcement layer there. Voyage's static allow-list shape stays unchanged (its 12 models share supports_multimodal but only one is multimodal-capable). - D12 runtime dimension validation: the new helper checks the returned vector length against the recipe's declared default_dims (preferred) or the brain's embedding_dimensions config. Mismatch throws AIConfigError with model id + observed + expected so the operator can swap models or rebuild the column. Pre-fix, a wrong-dim response would surface as a cryptic pgvector "vector dimension mismatch" at INSERT time. - Auth resolution routes through the existing defaultResolveAuth helper so optional-auth recipes (LiteLLM proxy with no LITELLM_API_KEY) and required-auth recipes both share one code path. Optional-auth sends "Authorization: Bearer unauthenticated" which servers like Ollama / llama-server ignore but the SDK contract requires. Tests: 11 new cases in test/openai-compat-multimodal.test.ts cover happy-path, multi-input batching, unauthenticated proxy, D12 dim mismatch + default-dim fallback, 401 / 400 / malformed-JSON / non-array error paths, and an explicit Voyage-regression test pinning that the new openai-compat route doesn't accidentally hijack the Voyage path. All 41 multimodal-related tests pass (existing voyage suite + new). typecheck clean. Origin: PR garrytan#875 + plan-eng-review D12 (runtime dim validation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(oauth): federated_read read scope (garrytan#876) Pre-fix, OAuth clients had a single source-scope axis (source_id, added in v55). A client could either write+read one source OR be a super-reader across all sources (via NULL source_id). There was no middle ground — WeCare-style L3 dept clients that need to write to dept-x but read dept-x + parent canon + shared canon had no expression. garrytan#876 adds federated_read TEXT[] as an orthogonal read-scope axis. source_id is the WRITE authority; federated_read is the READ authority. They default to matching values (read scope == write scope, the pre-v0.34 default) when a client is registered without an explicit federated read list. Migrations v56-v60 (six new migrations on top of v55): - v56: ALTER TABLE ... ADD COLUMN federated_read TEXT[] NOT NULL DEFAULT '{}'. - v57 (F5): explicit CASE backfill so source_id IS NULL → '{}' (not an array containing NULL — codex caught this ambiguity during plan review). - v58: post-backfill validation. Fails loud if any row's source_id isn't in its federated_read array, pointing at a logic bug in v57 if fired. - v59: flip the source_id FK from ON DELETE SET NULL to ON DELETE RESTRICT now that federated_read provides the alternative scope-loss path. Pre-flip, deleting a source could silently widen any oauth_client to super-reader; post-flip, source delete is refused if any client references it (operator must revoke/re-scope first). - v60: GIN index on federated_read for array-containment queries. Auth wiring: - GBrainOAuthProvider.verifyAccessToken JOINs c.federated_read and populates AuthInfo.allowedSources. Pre-v56 / pre-v55 brains degrade via the existing isUndefinedColumnError fallback chain. - registerClientManual gains a federatedRead?: string[] parameter (defaults to [sourceId]). - DCR registerClient sets source_id='default' + federated_read=['default'] on the inserted row. - auth.ts CLI adds --federated-read SRC1,SRC2,... flag. The register-client output now prints "Federated reads:" so operators confirm the scope they set. Engines consume the federated array through the SearchOpts.sourceIds / PageFilters.sourceIds field that garrytan#861 added (no engine changes here — the plumbing was D9). sourceScopeOpts in operations.ts already prefers the auth.allowedSources array over scalar ctx.sourceId when set. Test seam: - test/book-mirror.test.ts now spawns the CLI with GBRAIN_HOME pointed at a tempdir so the test isn't sensitive to the developer's local ~/.gbrain/config.json. Pre-fix the test could silently inherit a real Postgres connection and hang past the default 5s test timeout. Fresh GBRAIN_HOME → "No brain configured" → exit 1 in <1s. - test/e2e/source-isolation-pglite.test.ts gains one more regression case: AuthInfo.allowedSources = [] (explicit empty) MUST NOT widen scope to "all sources" — the silent-widen footgun precedence ladder. - test/openai-compat-multimodal.test.ts is part of the wave's commits via the migrate.ts changes that bump the schema chain. typecheck-only fix on a captured-auth type was already in garrytan#875's tree. 6045 unit tests pass / 0 fail. typecheck clean. PGLite initSchema runs v55-v60 in ~786ms total (within the test-harness budget for tests using the canonical beforeAll engine pattern). Origin: PR garrytan#876 + plan-eng-review F5 (CASE backfill). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v0.34.0.0: MCP fix wave (garrytan#870 garrytan#909 garrytan#864 garrytan#861 garrytan#875 garrytan#876) VERSION + package.json + CHANGELOG bump for the six-PR MCP fix wave. Schema chain extends from v54 → v60; oauth_clients gains source_id + federated_read columns; auth'd MCP clients now stay inside their scope across all read-side ops; PKCE-only DCR works; --bind defaults to loopback; LiteLLM multimodal embedding ships. Contributed by @Hansen1018 (garrytan#870), @ding-modding (garrytan#909), @DukeDawg (garrytan#864), @toilalesondev (garrytan#861 + garrytan#876), @yoelgal (garrytan#875). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.34.0.0 Sync README, CLAUDE.md, SECURITY.md, docs/architecture/topologies.md, and docs/mcp/DEPLOY.md to reflect the v0.34.0.0 MCP fix wave: - README: document --bind HOST default (loopback), --source + --federated-read register-client flags, PKCE public-client gate - SECURITY.md: note loopback-by-default for serve --http, update the trust-proxy contract to point at the new default - CLAUDE.md: annotate operations.ts (sourceScopeOpts helper), oauth-provider.ts (verifyAccessToken JOIN + PKCE public clients), serve-http.ts (--bind flag), gateway.ts (openai-compat multimodal + dim validation), mcp/server.ts (MCP_STDIO guard), auth.ts (--source + --federated-read), migrate.ts (v58-v63 chain), engine.ts (sourceIds field). Add 4 new test-file entries for source-isolation-pglite, openai-compat-multimodal, serve-stdio-lifecycle, oauth.test.ts PKCE cases - docs/architecture/topologies.md: source-scoped register-client example, --bind 0.0.0.0 for thin-client host setup - docs/mcp/DEPLOY.md: --bind explanation in the ngrok section, source-scoped client recipe - llms-full.txt: regenerated per the CLAUDE.md-edit chaser rule Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump v0.34.0.0 → v0.34.1.0 Renumbering the MCP fix wave from v0.34.0.0 to v0.34.1.0 so the release slot lands between master's v0.33.2.1 and the next minor. Touches every release-artifact mention: - VERSION: 0.34.0.0 → 0.34.1.0 - package.json: same - CHANGELOG.md header + "To take advantage" block - CLAUDE.md key-files annotations (8 entries that document this wave) - llms-full.txt (regen from CLAUDE.md) - README.md / SECURITY.md / docs/architecture/topologies.md / docs/mcp/DEPLOY.md - Wave code-comment markers ("// v0.34.0 (#NNN):" → "// v0.34.1 (#NNN):") Test files renamed alongside since they were committed with the wave. Commit subjects on the original 6 PR commits + the v0.34.0.0 bump commit (4f533c7 → 6b47db7) intentionally NOT rewritten — those are history. `git log` finds the implementation by message subject, not by version tag. 6275 unit tests pass, typecheck clean, migration chain v58-v63 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…drop + failed-file-skip + sort-flip bugs (garrytan#988) * feat(sync): sort files newest-first for faster salience on recent content Problem: sync processes files in git-diff order (alphabetical), so meetings/2020-* embeds before meetings/2026-*. After a burst of writes, new pages can be invisible to search for hours while older pages process first. Fix: sort addsAndMods descending in both incremental sync and full import. Brain paths are date-prefixed by convention, so lexicographic descending naturally prioritizes recent content. This ensures the most relevant pages become searchable first. * feat(import): path-based checkpoint resume + sort-newest-first helper Replace gbrain import's positional `processedIndex` checkpoint with a path-set checkpoint via `src/core/import-checkpoint.ts`. A file is only "done" when its processFile returns success — failed files never enter the set, parallel workers can't lose slow files, and sort-order changes don't drop the newest N files on resume. Three bug classes fixed: - Parallel import + slow worker = silent file drop on crash-resume - Failed file = checkpoint advanced past it, never retried until manual clear - Sort-order flip (v0.33.x) = cross-version resume drops newest N files Old positional checkpoints are detected on first resume and discarded with a stderr log line. Re-walking is cheap because content_hash short-circuits unchanged files. Also extracts the descending-lex sort into src/core/sort-newest-first.ts so import.ts and sync.ts share a single source of truth. Tests: - test/sort-newest-first.test.ts (5 hermetic cases) - test/import-checkpoint.test.ts (18 unit cases over the helpers) - test/import-resume.test.ts (refactored — GBRAIN_HOME isolation, drives runImport against PGLite, 5 integration cases including SLUG_MISMATCH retry regression) Includes the original sort-newest-first contribution from @garrytan-agents's PR garrytan#964 (commit 8dbcf6a). * chore: bump version and changelog (v0.34.2.0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: update project documentation for v0.34.2.0 Add CLAUDE.md Key Files entries for the path-based import checkpoint work: new entries for src/core/import-checkpoint.ts and src/core/sort-newest-first.ts, plus a dedicated src/commands/import.ts entry covering the v0.34.2.0 refactor. Update src/commands/sync.ts entry to reference sortNewestFirst. Regenerate llms-full.txt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(tests): swap banned /data/brain placeholder for /tmp/example-brain scripts/check-privacy.sh banlist includes /data/brain/ (legacy private OpenClaw fork layout). New test files must not use it — CI privacy guard caught this on PR garrytan#988's first push. No behavior change. test/import-checkpoint.test.ts is unit-level with no fs access; the dir string is just an identity marker for the loadCheckpoint dir-mismatch guard. --------- Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…rrytan#1003) * fix: supervisor treats code=0 watchdog exits as crashes The RSS watchdog triggers gracefulShutdown() which exits with code 0. The supervisor was counting ALL exits < 5min as crashes, including clean code=0 exits. After 10 watchdog-triggered restarts (typical with a 96K-page brain where autopilot inflates RSS), the supervisor gave up with max_crashes_exceeded. Fix: code=0 exits reset crashCount to 0 and restart immediately with no backoff. Only code≠0 exits count toward the crash limit. Root cause: process.memoryUsage().rss reports 7GB during autopilot sync on large repos (possibly shared page inflation from git mmap). The 4096MB threshold triggers on every cycle. This is a separate issue (RSS measurement accuracy) but the supervisor should handle clean exits regardless. * fix: use RssAnon instead of VmRSS for watchdog threshold process.memoryUsage().rss returns VmRSS which includes file-backed mmap'd pages. On repos with large git packfiles (96K+ pages), git operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB. The kernel reclaims these pages under memory pressure — they're cache. Replace with /proc/self/status RssAnon + RssShmem which measures only anonymous pages (heap, stack, anonymous mmap). This is the memory that actually matters for OOM risk. Falls back to process.memoryUsage().rss on non-Linux. Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold) After: watchdog only triggers on real memory growth (~100MB << 4GB threshold) Related: garrytan#1002 (supervisor crash-count fix for the same symptom) * refactor(minions): extract ChildWorkerSupervisor with D1/D2 amendments MinionSupervisor and src/commands/autopilot.ts each owned a separate spawn-and-respawn loop. PR garrytan#1003 fixed the supervisor's crash-counter bug (counting code=0 watchdog drains as crashes) but the autopilot loop has the same bug class. Worse, the as-shipped garrytan#1003 fix reset crashCount=0 on every code=0 exit, which lost the "flapping worker" signal in mixed-exit sequences. Extract the shared spawn loop into ChildWorkerSupervisor so both consumers compose one tested core. The new class bakes in two amendments resolved during plan-eng-review: D1 (lastExitCode track): code=0 exits no longer touch crashCount. They emit ms:0 backoff and restart immediately, but the counter survives across them. A worker alternating exit 1 / exit 0 / exit 1 correctly trips max_crashes; a worker drained 100 times by the watchdog stays at crashCount=0 and runs forever (also correct). D2 (clean-restart budget): on platforms where the watchdog measures VmRSS instead of RssAnon (macOS, kernel <4.5, restricted containers), a perpetually over-threshold worker could clean-exit in a tight loop with no observability. New `cleanRestartBudget` option (default 10 clean restarts per 60s window) emits a `health_warn` and applies backoff once exceeded. The supervisor now delegates spawn/respawn/backoff to the inner class and maps ChildSupervisorEvent → existing SupervisorEvent emit() channel so JSONL audit consumers see byte-compatible output. PID lock, signal handlers, health check, and process.exit on max-crashes stay in MinionSupervisor (those are standalone-daemon concerns the autopilot composer doesn't need). Tests: 6 new ChildWorkerSupervisor cases (D1 classifier, interleaved exits, stable-run + clean-exit interaction, D2 budget tripping, per- instance config isolation, event shape regression). Existing supervisor tests updated to use exit-1 workers where they previously relied on clean-exit-as-crash semantics; their assertions (env plumbing, PID lock, audit shape) are unaffected. Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(autopilot): compose ChildWorkerSupervisor instead of inline spawn loop src/commands/autopilot.ts:165-197 used to have its own spawn-and- respawn loop separate from MinionSupervisor's. It hardcoded maxCrashes=5, fixed 10s backoff, and counted every exit (including code=0) toward the crash limit. Codex flagged this during plan-eng review: the parallel implementation had the same bug class fixed in garrytan#1003, just on a different code path. Anyone running `gbrain autopilot` as a long-running daemon (instead of `gbrain jobs supervisor`) would hit it. Replace the inline `startWorker` + `child.on('exit')` block with a ChildWorkerSupervisor instance. Drops the parallel `crashCount`, `lastWorkerStartTime`, and `STABLE_RUN_RESET_MS` state. The ChildWorkerSupervisor's D1 lastExitCode track + D2 clean-restart budget apply to autopilot for free. Shutdown now drains via the supervisor's killChild + awaitChildExit typed surface instead of reaching into `workerProc` directly. The onMaxCrashesExceeded callback routes through autopilot's existing shutdown('max_crashes') path so the lockfile gets cleaned up (pre-refactor, the inline loop called process.exit(1) directly and bypassed the cleanup). Regression coverage in test/autopilot-supervisor-wiring.test.ts: static-shape grep guards for `--max-rss 2048`, `maxCrashes: 5`, the shutdown-via-callback wiring, and absence of the legacy inline names (startWorker, workerProc, crashCount, lastWorkerStartTime, STABLE_RUN_RESET_MS). Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(worker): parse RssAnon as field-presence + soften OOM docstring Two follow-ups to the RssAnon watchdog fix (b81c598), both surfaced during plan-eng-review by Codex. M1: getAccurateRss() used `if (anonKb > 0) return ...` to decide whether to use the /proc/self/status reading or fall back to process.memoryUsage().rss. That conflated "RssAnon field missing" (old kernel, non-Linux) with "RssAnon field present but zero" (a near-empty worker process whose only memory is shmem). The legitimate shmem-only worker case fell through to VmRSS even though /proc had a valid reading. Fix: split the pure parser (parseRssFromProcStatus) into a separate exported function that checks field presence via regex match, not value comparison. Returns null only when the field text doesn't match `^RssAnon:\s+(\d+)` AND `^RssShmem:\s+(\d+)`. Both fields present + both zero is now a valid reading of 0 bytes. M2: the docstring claimed RssAnon + RssShmem was "the memory that actually matters for OOM risk." Codex pushed back: this is correct for per-process leak detection but NOT a full container-OOM metric, because cgroup memory pressure includes page cache. Soften to "non-file-backed resident memory used for per-process leak detection" and call out the cgroup caveat explicitly. getAccurateRss now takes an optional readStatus function for testability. Production callers use the default; tests inject canned status text to cover the M1 regression and the fallback paths without mocking the filesystem. Tests: 11 cases covering parseRssFromProcStatus (normal, M1 regression with anon=0 + shmem>0, both-zero, missing fields, malformed values, shmem-only) and getAccurateRss (injected reader, ENOENT fallback, old-kernel fallback, malformed-value fallback). Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(minions): awaitChildExit short-circuits when child already exited Pre-fix, awaitChildExit registered `child.once('exit', ...)` without checking whether the child had already terminated. If the child drained between killChild('SIGTERM') and awaitChildExit() — common on fast SIGTERM responders — Node's 'exit' event had already fired, the late listener never resolved, and the caller waited out the full timeout. On the supervisor's clean shutdown path that's a 35-second hang on every quick child. Probe `child.exitCode` and `child.signalCode` first; resolve immediately when either is non-null. Sub-second clean shutdown restored. Pre-existing in the legacy supervisor.ts shape (same bug pattern), but since the refactor consolidates child-process management into one class, fix the pattern at the new seam. Regression test in test/child-worker-supervisor.test.ts: run one full spawn cycle, then call awaitChildExit on the already-finished cycle and assert it returns in under 200ms (well under any test timeout). Surfaced during pre-landing /review on the fix wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.34.3.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update CLAUDE.md key-files entries for v0.34.3.0 Reflects the ChildWorkerSupervisor extraction shipped in this branch: - Add new entry for src/core/minions/child-worker-supervisor.ts covering D1 lastExitCode classifier, D2 clean-restart budget, the awaitChildExit short-circuit, and test pinning at test/child-worker-supervisor.test.ts - Update src/core/minions/supervisor.ts entry to note the spawn-loop extraction into the shared core + the byte-compatible event-shape mapping that preserves JSONL audit consumers - Update src/commands/autopilot.ts entry to note the parallel- supervisor elimination + the shutdown-via-callback wiring - Update src/core/minions/worker.ts entry with the new RssAnon / getAccurateRss exports + the M1 field-presence parser fix Regenerated llms-full.txt to match (per project rule: every CLAUDE.md edit must be followed by bun run build:llms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Wintermute <wintermute@garrytan.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…D4/D6/D7/D8 + regression test) (garrytan#991) * perf(embed): cursor-paginated stale loading + rate-limit backoff + partial index Three fixes for embed --stale on large brains (300K+ chunks): ## 1. Cursor-paginated listStaleChunks (embed timeout fix) The previous implementation pulled ALL stale rows (up to 100K) in one query. On a 373K-row content_chunks table with 48K stale rows, this query took >2 min and hit Supabase's 2-min statement_timeout, causing embed --stale to silently fail with zero progress. Fix: keyset pagination on (page_id, chunk_index) with a default batch size of 2000 rows. Each query finishes in <1s. The embedAllStale loop pages through batches, embeds each batch, then advances the cursor. ## 2. Rate-limit-aware retry (429 backoff) The OpenAI SDK's built-in retry has a ~4s max backoff window, which is too short for TPM (tokens-per-minute) limits on large pages (~90K tokens). The embed loop would fail after 3 SDK retries and skip the page entirely. Fix: embedBatchWithBackoff wrapper parses the retry delay from the 429 error message (e.g. 'try again in 248ms') and sleeps for that duration + 500ms padding. Up to 5 retries with parsed delays (60s fallback when unparseable). ## 3. Migration v58: partial index for NULL embeddings `CREATE INDEX idx_chunks_embedding_null ON content_chunks (page_id, chunk_index) WHERE embedding IS NULL` — makes countStaleChunks() and the paginated listStaleChunks() instant instead of full-table-scanning 373K rows. ## Testing Verified on a 99K-page / 373K-chunk brain with 48K stale chunks. Before: embed --stale hung for 2+ min then timed out (0 progress). After: loads 2K rows in <1s, embeds concurrently, pages through all stale chunks without timeout. * fix(embed): wave of hardening + tests on cursor-paginated --stale path Lands the 9 decisions + regression test set from /plan-eng-review on PR garrytan#991's embed-perf cherry-pick. Implements the codex outside-voice findings folded in during plan review. Architecture / correctness: - D2 jitter on the parsed retry-after delay (±30%) so 20 concurrent workers don't relock on the next 429 wave (thundering herd fix). - D3 + D3a + D8 wall-clock budget (GBRAIN_EMBED_TIME_BUDGET_MS, default 30 min) threaded as an AbortSignal into THREE places: the retry sleep (abortableSleep), the per-key worker claim loop, and the gateway embed call itself (so a worker mid-fetch on a ~30s OpenAI HTTP timeout cancels within seconds instead of waiting it out). - D4 structured 429 detection that unwraps the gateway's AITransientError wrap via cause chain (depth-limited to 5). Naive `e.status === 429` was silently false against normalized errors; message-match stays as fallback. detect429FromCause exported as @internal helper. - D4a `maxRetries: 0` passthrough through embedBatch → gateway → embedMany so the AI SDK's default 2-retry stack doesn't multiply this wrapper's 5 attempts (was up to 15 total cycles per call). - D6 migration v59 (embed_stale_partial_index) rewritten to use CREATE INDEX CONCURRENTLY + handler-based engine-branching (mirrors v14 invalid-remnant pattern). Plain CREATE INDEX would have taken ShareLock on the 373K-row content_chunks table for the duration of the build. - D7 sourceId threaded through countStaleChunks + listStaleChunks + embedAllStale. `gbrain embed --stale --source X` was silently dropping the flag pre-fix and counting/embedding across every source. Both Postgres and PGLite engines updated. Tests added: - D5 8 unit cases for embedBatchWithBackoff in test/embed.serial.test.ts: ms / s retry-after parse, fallback, non-rate-limit rethrow, jitter variance, budget abort during sleep+fetch, normalized-error cause unwrap, maxRetries:0 passthrough verification. - D5a fixed every pre-existing stale-row mock to include source_id + page_id (required on StaleChunkRow as of v0.33.3 cursor pagination — TypeScript's structural typing was hiding these). - D7 unit cases asserting CLI `--source X` parses + threads sourceId. - Gap scan: end-to-end wall-clock budget firing in the outer pagination loop via runEmbedCore. - D6 migration v59 test cases in test/migrate.test.ts: source-shape assertion (CONCURRENTLY + invalid-remnant DROP-before-CREATE ordering), PGLite handler-branch idempotency, partial-index materialization. - REGRESSION: new test/e2e/embed-stale-pagination.test.ts covering static (every chunk visited exactly once), failed-page (cursor advances past failures, next run picks up), page-split-across-batches, source-scoped scan, duplicate-slug-across-sources. - PGLite parity cases for cursor pagination, page split, source filter in test/pglite-engine.test.ts (pins tuple-compare against WASM build). Gate: - bun run test: 6305 pass / 0 fail / 0 skip across all 8 shards + serial. - DATABASE_URL=... bun run test:e2e: 90 files, 603 tests, 0 failures. Plan: ~/.claude/plans/system-instruction-you-are-working-iterative-torvalds.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.34.3.0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(ai): add ZeroEntropy recipe + reranker touchpoint type
Widens `TouchpointKind` with `'reranker'`, adds `RerankerTouchpoint`
interface, extends `Recipe.touchpoints` and `AIGatewayConfig` to carry
reranker model state. Registers `zeroentropyai` recipe (zembed-1
embeddings + zerank-{2,1,1-small} rerankers) in the recipe registry.
Recipe declares the 7 Matryoshka dims (2560/1280/640/320/160/80/40),
Voyage-style dense-payload hedge (chars_per_token=1, safety_factor=0.5),
and 5MB rerank payload cap. Pinned by test/ai/zeroentropy-recipe.test.ts
including F1 regression (implementation literal is 'openai-compatible')
and F2 regression (base_url_default ends with /v1, no doubling).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(ai/dims): thread input_type 4th-arg + ZE flexible-dim allowlist
`dimsProviderOptions` gains an optional `inputType?: 'query' | 'document'`
4th param so asymmetric providers (ZE zembed-1, Voyage v3+) can route
query-side vs document-side encoding. Per-model filtering inside the
openai-compatible branch keeps `input_type` from leaking to symmetric
providers (OpenAI text-3, DashScope, Zhipu) that would 400 on it.
Adds `ZEROENTROPY_VALID_DIMS` allowlist (2560/1280/640/320/160/80/40),
`supportsZeroEntropyDimension(modelId)`, and `isValidZeroEntropyDim(dims)`.
Throws `AIConfigError` with paste-ready fix hint when zembed-1 is
configured with an invalid dim (most common: defaulting to 1536 from
DEFAULT_EMBEDDING_DIMENSIONS).
The 4th-arg is optional; existing call sites (1 production + N tests
across Voyage/OpenAI/DashScope/Zhipu/MiniMax) compile unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(ai/gateway): zeroEntropyCompatFetch + embedQuery + gateway.rerank()
Two seams land together because they share the same recipe + auth path.
zeroEntropyCompatFetch handles ZE's non-OpenAI-compatible wire shape:
- URL rewrite: SDK's `${base_url}/embeddings` -> `${base_url}/models/embed`
- Body inject: `input_type` (default 'document'; 'query' when threaded
via providerOptions) + explicit `encoding_format: 'float'`
- Response rewrite: `{results: [{embedding}]}` -> `{data: [{embedding,
index}]}` so the AI SDK's openai-compat schema validates
- `usage.prompt_tokens` injected from `total_tokens` (Voyage hit the
same SDK schema requirement at :655)
- Layer 1 (Content-Length) + Layer 2 (per-embedding size) OOM caps
via tagged `ZeroEntropyResponseTooLargeError` (kept separate from
`VoyageResponseTooLargeError` because the Voyage cap tests do
structural source-text greps pinning the Voyage name)
- Wired in `instantiateEmbedding()` via the existing
`recipe.id === 'voyage' ? voyageCompatFetch : ...` ternary pattern
embedQuery(text) routes `inputType: 'query'` through dimsProviderOptions
for the search hot path. Companion to embed(texts) which now takes an
optional 2nd-arg inputType (defaults to undefined -> 'document' for
asymmetric providers).
gateway.rerank() is the new native HTTP path (no AI-SDK reranking
abstraction). Resolves the configured reranker model via
`getRerankerModel()` (new accessor), parses + asserts the model is in
the recipe's touchpoint.reranker.models allowlist (CDX2-F11:
assertTouchpoint does not enforce allowlists for openai-compatible
recipes — rerank() does it directly). Posts to
`${recipe.base_url}/models/rerank` with bearer auth. Returns
`RerankResult[]` sorted by `relevanceScore`. Errors classify into
`RerankError.reason: 'auth' | 'rate_limit' | 'network' | 'timeout' |
'payload_too_large' | 'unknown'`. 5s default timeout. Pre-flight payload
guard rejects bodies over `recipe.max_payload_bytes` BEFORE any HTTP
call so applyReranker can fail-open without burning a round-trip.
`_rerankTransport` + `__setRerankTransportForTests` mirror the embed
test seam.
`AIGatewayConfig.reranker_model` + isAvailable('reranker') branch +
configureGateway / reconfigureGatewayWithEngine extensions thread the
reranker model through the same state path as embedding/expansion/chat.
`applyResolveAuth` + `defaultResolveAuth` widen the touchpoint param to
include `'reranker'`. `KnownTouchpointKey` + `getTouchpoint()` in
model-resolver widen to cover `'reranker'`.
Pinned by:
- test/ai/embedQuery.test.ts (8): returns single Float32Array, threads
input_type='query' for ZE, drops field for OpenAI text-3,
back-compat: legacy embed() callers without 4th arg keep their
previous Voyage no-input_type shape
- test/ai/rerank.test.ts (21): URL (F2 regression — no /v1/v1/), body
shape, bearer header, response parsing, error classification across
6 HTTP shapes, payload pre-flight (no transport call), allowlist
enforcement
- test/ai/zeroentropy-compat-fetch.test.ts (14): structural source
assertions for the shim that mirror test/voyage-response-cap.test.ts —
URL rewrite path, body injection, response rewrite, usage.prompt_tokens
injection, OOM caps Layer 1 + Layer 2 + instanceof rethrow,
instantiateEmbedding wiring branch
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(search): applyReranker + rerank-failure audit + hybrid wire-in
src/core/search/rerank.ts — the call-site abstraction. Slices the top
`opts.topNIn` deduped candidates, sends to gateway.rerank(), reorders by
relevanceScore desc, appends the un-reranked tail in its original RRF
order (recall protection). Fail-open on every RerankError.reason: logs
via `logRerankFailure` and returns the input array unchanged. Stamps
`rerank_score` onto reordered items. `topNOut: null` is the explicit
"don't truncate" signal — distinct from `undefined` (fall through to
mode bundle); pin in test (CDX2-F16).
src/core/rerank-audit.ts — failure-only JSONL audit at
`~/.gbrain/audit/rerank-failures-YYYY-Www.jsonl` (ISO-week rotation;
mirrors `src/core/audit-slug-fallback.ts`). Exports `logRerankFailure`
+ `readRecentRerankFailures(days)`. **No `logRerankSuccess`** — CDX2-F22
deliberately drops success-event logging: writing once per tokenmax
search is hot-path I/O churn AND success events leak query
volume + timing into a local audit. The doctor check reads
`search.reranker.enabled` first so "no events in window" gets
interpreted correctly (disabled -> healthy by definition; enabled ->
healthy because nothing failed). Query text is SHA-256-prefix-hashed
(8 hex chars) for privacy. Honors `GBRAIN_AUDIT_DIR`.
src/core/search/hybrid.ts — slots `applyReranker` between
`dedupResults()` and `enforceTokenBudget()` in the main RRF path.
Resolution: per-call `opts.reranker` overrides; otherwise pulled from
the resolved mode bundle (tokenmax -> enabled, others -> disabled in
commit 5). Cache rows store final reranked results; the bumped
knobsHash (commit 5) ensures rows can't leak across reranker configs.
src/core/types.ts — adds `SearchOpts.reranker` as a structural type so
callers can pass per-call overrides; runtime type lives in
src/core/search/rerank.ts (avoids circular import).
Tests:
- test/search/rerank.test.ts (14): reorder, tail preserve, fail-open on
every error class, topNOut null vs number, score stamping, empty +
enabled=false pass-through
- test/rerank-audit.test.ts (10): JSONL round-trip, error_summary
truncated to 200, corrupt rows skipped, missing dir -> [], ISO-week
rotation walks current + previous week, no logRerankSuccess export
(CDX2-F22 contract)
- test/search/hybrid-reranker-integration.test.ts (6): reranker fires
when enabled, doesn't when disabled, reorders correctly, preserves
tail, stamps rerank_score, fail-opens on rerankerFn throw — uses
PGLite + stubbed embed transport, no API keys
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(search/mode): reranker mode-bundle fields + KNOBS_HASH_VERSION v=2
Extends `ModeBundle` with five reranker fields: `reranker_enabled`,
`reranker_model`, `reranker_top_n_in`, `reranker_top_n_out`,
`reranker_timeout_ms`. Per-mode defaults:
- conservative -> enabled=false (cost-sensitive)
- balanced -> enabled=false (opt-in via search.reranker.enabled)
- tokenmax -> enabled=true (the high-cost-tolerant tier; ~$0.0003/query)
Defaults model to `zeroentropyai:zerank-2`, topNIn=30, topNOut=null
(no truncate by default; preserves tokenmax's searchLimit=50 end-to-end
per CDX2-F16), timeout_ms=5000.
`SearchKeyOverrides` + `SearchPerCallOpts` + `resolveSearchMode.pick`
all extend to thread the new fields through the resolution chain
(per-call -> per-key config -> mode bundle -> default).
`loadOverridesFromConfig` adds parsers for the five new
`search.reranker.*` config keys. `top_n_out` parsing distinguishes
three input shapes (CDX2-F15):
key absent -> undefined (fall through to mode bundle)
'null'|'none'|empty -> explicit null (no truncate)
positive integer -> that number
`SEARCH_MODE_CONFIG_KEYS` extends so `gbrain search modes --reset`
clears the reranker overrides too.
**KNOBS_HASH_VERSION bumps 1 -> 2** (CDX1-F14). Five new entries
appended to `parts[]` (append-only convention CDX2-F13; reordering
existing fields would silently rebuild every existing cache row).
Includes `reranker_timeout_ms` so a 5s -> 100ms change invalidates
stale rows (CDX2-F14: more fail-opens = different search behavior).
Mid-rolling-deploy note (CDX2-F12): v=1 and v=2 processes produce
distinct cacheRowIds for the same (source_id, query_text). Expect a
temporary hit-rate dip + cache-row doubling for hot queries. Clears
naturally within `cache.ttl_seconds` (default 3600s).
src/commands/search.ts extends `KNOB_DESCRIPTIONS` with five new
entries so `gbrain search modes` renders them. test/search-mode.test.ts
extends the three bundle fixtures and bumps the KNOBS_HASH_VERSION
expectation to 2.
Pinned by test/search/knobs-hash-reranker.test.ts (13): each of the 5
reranker fields independently flips the hash, top_n_out=null renders
stable, append-only convention enforced via source-position assertion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(doctor): probeRerankerConfig + reranker_health check
`gbrain models doctor` gains two new probes:
- `probeRerankerConfig` (zero-network) validates that the configured
reranker model resolves through the recipe registry, that the recipe
declares a `reranker` touchpoint, and that the model is in
`touchpoint.models[]`. Direct allowlist check here — assertTouchpoint
does not enforce allowlists for openai-compatible recipes (CDX2-F11).
Surfaces paste-ready `gbrain config set search.reranker.model
<zerank-2|zerank-1|zerank-1-small>` fix hint.
- `probeRerankerReachability` (1-token-equivalent) sends a minimal
`{query: "probe", documents: ["probe"]}` rerank to verify auth + URL.
Failures classify via `classifyError` into auth/rate_limit/network/
unknown. Skipped silently when reranker is unconfigured.
Also extends `probeEmbeddingConfig` with a `providerId === 'zeroentropyai'`
branch that catches the silent-1536-default bug class for zembed-1
configurations (same posture as the existing Voyage branch).
`ProbeResult.touchpoint` widens to include `'reranker_config'`.
`gbrain doctor` adds `checkRerankerHealth` to both the abbreviated
(doctorReportRemote) and full (runDoctor) check sets. Logic:
1) Read `search.reranker.enabled` first. Disabled + no failures =>
'reranker disabled'. Enabled + no failures => healthy.
2) Walk last 7 days of ~/.gbrain/audit/rerank-failures-*.jsonl.
3) ANY auth failure warns (config-time problem the probe should have
caught — surface it).
4) ANY payload_too_large failure warns (workload mismatch).
5) Transient (network/timeout/rate_limit) warns at >=5 in window.
Below that they're noise; reranker fails open anyway.
CDX2-F21 blind-spot fix: reading enabled state first means "no events"
gets interpreted correctly — never confuses "never-used" with "success
logging broken" (the latter is impossible because there is no success
logging by design, CDX2-F22).
Engine-agnostic; file-based + one config-key read.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(e2e): ZeroEntropy live API round-trip + wire into Tier 2 CI
test/e2e/zeroentropy-live.test.ts exercises the full stack against the
real api.zeroentropy.dev: embed (default 2560-dim + flexible 1280),
embedQuery (asymmetric query side), batch embed (3 distinct vectors),
rerank (3 docs sorted by relevance score, photosynthesis-relevant docs
beat the irrelevant cat doc), rerank with topN truncation.
Gated on `ZEROENTROPY_API_KEY`: every test prints `[skip]` and returns
early without assertions when the env var is unset, so fork PRs and
contributor machines without a ZE account stay green.
CI wire-up: `.github/workflows/e2e.yml` Tier 2 step adds
`test/e2e/zeroentropy-live.test.ts` to its `bun test` invocation and
exposes `ZEROENTROPY_API_KEY: ${{ secrets.ZEROENTROPY_API_KEY }}` to
the runner. The secret is set on garrytan/gbrain at the repo scope
(separately from this commit — set via `gh secret set` so the value
never lands in source).
Tier 1 stays mechanical (no API keys); Tier 2 is the natural home for
provider-live tests because it's already the API-keyed lane.
Cost: each full run fires ~6 small HTTP calls totaling well under a
cent at the published $0.025/1M-token rate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* v0.33.3.0 feat: ZeroEntropy zembed-1 + zerank-2 reranker
Release notes for the ZeroEntropy support wave: zembed-1 embeddings
(flexible-dim 2560/1280/640/320/160/80/40, asymmetric input_type) and
zerank-2 cross-encoder reranking land as a new openai-compatible recipe
alongside OpenAI/Voyage. Reranker defaults ON for tokenmax mode, OFF
for conservative/balanced (~$0.0003/query at tokenmax topNIn=30; rounding
error vs the tier's $700/mo Opus pairing per the CLAUDE.md cost matrix).
Search now ends with `RRF -> dedup -> reranker -> token-budget` when
reranker is enabled; fails open to RRF order on any error class
(audit-logged at ~/.gbrain/audit/rerank-failures-*.jsonl).
`KNOBS_HASH_VERSION` bumps 1 -> 2 to fold reranker config into the
query_cache row key. Rolling-deploy operators should expect a temporary
cache hit-rate dip + cache-row doubling for hot queries (clears
naturally within `cache.ttl_seconds`, default 3600s).
Files in this commit are pure docs / version bump:
- VERSION + package.json bump to 0.33.3.0
- CHANGELOG.md release-summary entry with "How to take advantage" block
- CLAUDE.md Key Files annotations for the new recipe + rerank.ts +
rerank-audit.ts + gateway extensions
- docs/ai-providers/zeroentropy.md one-pager (setup, knob reference,
failure observability, troubleshooting table)
- skills/migrations/v0.33.3.md (purely informational: no required user
action; reranker is opt-in everywhere, ZE embedding is opt-in)
- llms-full.txt regenerated to match CLAUDE.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sume-from) (garrytan#1055) * docs(designs): 2026-05 embedder shootout eval plan Adds docs/designs/2026_05_EVAL_PLAN.md — the approved plan + 6 Conductor session briefs for the OpenAI vs Voyage vs ZeroEntropy embedder comparison. Why: produce a publishable comparison report for v0.35.x release notes pinning "which embedder wins, and does zerank-2 carry the win for ZeroEntropy" against public LongMemEval + in-house BrainBench. Each session brief is self-contained — repo, branch, commits, verify, ship, deliverable, hand-off. Stewardable one section per Conductor session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(pricing): add voyage-4-large + zembed-1 to EMBEDDING_PRICING v0.35.0.0 shipped ZeroEntropy zembed-1 + zerank-2 reranker support and expanded the Voyage allow-list to include voyage-4-large. The pricing table missed both, so `gbrain upgrade`'s post-upgrade reembed prompt silently fell back to "estimate unavailable" for users on these models. - voyage:voyage-4-large @ $0.18/MTok (same as voyage-3-large) - zeroentropyai:zembed-1 @ $0.05/MTok New test file pins both entries plus the openai/voyage-3-large baselines, case-insensitive provider matching, bare-model openai-default fallback, table integrity (lowercase providers, finite non-negative prices), and the estimateCostFromChars approximation. 11 cases, 46 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(exports): expose gbrain/ai/gateway with canary test Adds ./ai/gateway to the package.json exports map so external eval consumers (notably gbrain-evals, the sibling repo running the embedder shootout in docs/designs/2026_05_EVAL_PLAN.md) can call configureGateway directly to swap embedding providers per cell. Why: pre-v0.35.1.0, gbrain-evals adapters hardcoded gbrain/embedding, which means every retrieval adapter was OpenAI-only. The newly-exposed gateway lets adapters route through Voyage and ZeroEntropy without forking gbrain or duplicating the recipe wiring. - package.json: add "./ai/gateway" -> "./src/core/ai/gateway.ts" - scripts/check-exports-count.sh: bump expected count 17 -> 18 - test/public-exports.test.ts: add canary pinning configureGateway + embed, bump expected count assertion Pre-existing import-resolution failures in this test file (16 on master) are unrelated to this change — they're a longstanding Bun package self-import behavior. The count + EXPECTED_EXPORTS list-match assertions both pass cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): add --resume-from <jsonl> to gbrain eval longmemeval Multi-cell embedder shootouts spend $50+/cell on the gpt-4o judge after gbrain emits hypotheses. A mid-run abort (rate-limit, cost-cap, OS interrupt, SIGKILL) previously meant re-paying the full cell. This flag makes those aborts cheap: re-invoke with --resume-from pointed at the partial JSONL and only the unanswered question_ids re-run. Behavior: - Read question_ids from the file; skip them on this run. - Rows with non-empty hypothesis count as done. - Rows with hypothesis="" AND an error field are NOT skipped (retry case for per-question failures recorded by the existing try/catch). - Corrupt trailing lines (SIGKILL'd writer mid-line) are silently skipped with a stderr warn. - When --resume-from path == --output path, the output emitter opens the file in append mode instead of truncating, so the existing rows survive. - Empty resume case (all questions already done) returns immediately without spinning up the brain or calling the client. New exported helper loadResumeSet() makes the parser unit-testable. 6 new test cases pinning: - File-not-found returns empty set - Well-formed JSONL load - Error-row retry semantics (empty hypothesis + error -> not in set) - Truncated final line recovery - End-to-end resume against the 5-question mini fixture - All-done early-return (stub client must NOT be invoked) All 18 cases in test/eval-longmemeval.test.ts green; bun run typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: v0.35.1.0 Bumps VERSION + package.json + CHANGELOG entry for the embedder-shootout prereq release. Three additive changes from the prior 4 commits: - pricing: voyage-4-large + zembed-1 entries - exports: gbrain/ai/gateway is now public - eval: gbrain eval longmemeval --resume-from <jsonl> Each commit on this branch is independently bisect-friendly and CI-green; the CHANGELOG entry is the user-facing rollup. No migrations, no breaking changes — the gateway export expands the surface, the resume-from flag is additive, the pricing patch only changes "estimate unavailable" -> a real dollar figure for two specific models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ytan#1056) * docs(designs): 2026-05 embedder shootout eval plan Adds docs/designs/2026_05_EVAL_PLAN.md — the approved plan + 6 Conductor session briefs for the OpenAI vs Voyage vs ZeroEntropy embedder comparison. Why: produce a publishable comparison report for v0.35.x release notes pinning "which embedder wins, and does zerank-2 carry the win for ZeroEntropy" against public LongMemEval + in-house BrainBench. Each session brief is self-contained — repo, branch, commits, verify, ship, deliverable, hand-off. Stewardable one section per Conductor session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(pricing): add voyage-4-large + zembed-1 to EMBEDDING_PRICING v0.35.0.0 shipped ZeroEntropy zembed-1 + zerank-2 reranker support and expanded the Voyage allow-list to include voyage-4-large. The pricing table missed both, so `gbrain upgrade`'s post-upgrade reembed prompt silently fell back to "estimate unavailable" for users on these models. - voyage:voyage-4-large @ $0.18/MTok (same as voyage-3-large) - zeroentropyai:zembed-1 @ $0.05/MTok New test file pins both entries plus the openai/voyage-3-large baselines, case-insensitive provider matching, bare-model openai-default fallback, table integrity (lowercase providers, finite non-negative prices), and the estimateCostFromChars approximation. 11 cases, 46 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(exports): expose gbrain/ai/gateway with canary test Adds ./ai/gateway to the package.json exports map so external eval consumers (notably gbrain-evals, the sibling repo running the embedder shootout in docs/designs/2026_05_EVAL_PLAN.md) can call configureGateway directly to swap embedding providers per cell. Why: pre-v0.35.1.0, gbrain-evals adapters hardcoded gbrain/embedding, which means every retrieval adapter was OpenAI-only. The newly-exposed gateway lets adapters route through Voyage and ZeroEntropy without forking gbrain or duplicating the recipe wiring. - package.json: add "./ai/gateway" -> "./src/core/ai/gateway.ts" - scripts/check-exports-count.sh: bump expected count 17 -> 18 - test/public-exports.test.ts: add canary pinning configureGateway + embed, bump expected count assertion Pre-existing import-resolution failures in this test file (16 on master) are unrelated to this change — they're a longstanding Bun package self-import behavior. The count + EXPECTED_EXPORTS list-match assertions both pass cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): add --resume-from <jsonl> to gbrain eval longmemeval Multi-cell embedder shootouts spend $50+/cell on the gpt-4o judge after gbrain emits hypotheses. A mid-run abort (rate-limit, cost-cap, OS interrupt, SIGKILL) previously meant re-paying the full cell. This flag makes those aborts cheap: re-invoke with --resume-from pointed at the partial JSONL and only the unanswered question_ids re-run. Behavior: - Read question_ids from the file; skip them on this run. - Rows with non-empty hypothesis count as done. - Rows with hypothesis="" AND an error field are NOT skipped (retry case for per-question failures recorded by the existing try/catch). - Corrupt trailing lines (SIGKILL'd writer mid-line) are silently skipped with a stderr warn. - When --resume-from path == --output path, the output emitter opens the file in append mode instead of truncating, so the existing rows survive. - Empty resume case (all questions already done) returns immediately without spinning up the brain or calling the client. New exported helper loadResumeSet() makes the parser unit-testable. 6 new test cases pinning: - File-not-found returns empty set - Well-formed JSONL load - Error-row retry semantics (empty hypothesis + error -> not in set) - Truncated final line recovery - End-to-end resume against the 5-question mini fixture - All-done early-return (stub client must NOT be invoked) All 18 cases in test/eval-longmemeval.test.ts green; bun run typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: v0.35.1.0 Bumps VERSION + package.json + CHANGELOG entry for the embedder-shootout prereq release. Three additive changes from the prior 4 commits: - pricing: voyage-4-large + zembed-1 entries - exports: gbrain/ai/gateway is now public - eval: gbrain eval longmemeval --resume-from <jsonl> Each commit on this branch is independently bisect-friendly and CI-green; the CHANGELOG entry is the user-facing rollup. No migrations, no breaking changes — the gateway export expands the surface, the resume-from flag is additive, the pricing patch only changes "estimate unavailable" -> a real dollar figure for two specific models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(eval): longmemeval adapter handles _s split + sanitizes session_id slugs Three tightly-coupled bugs blocked `gbrain eval longmemeval` against the public LongMemEval _s split from HuggingFace (the dataset every shootout cell needs): 1. HAYSTACK SHAPE: the _s split serializes haystack_sessions as LongMemEvalTurn[][] (each inner array is one session's turns directly) plus a parallel `haystack_session_ids: string[]` field. The pre-v0.35.1.1 adapter expected only the oracle `{session_id, turns}` shape and crashed with `session.turns is undefined` on every question. Fix: new `normalizeSessions` helper accepts both shapes, mirroring the proven `normalizeSessions` in gbrain-evals/eval/runner/longmemeval.ts. 2. SLUG VALIDATOR: the _s split's session_ids look like `sharegpt_yywfIrx_0` — underscored and mixed-case. The v0.32.7 CJK wave's `validatePageSlug` rejects both (allowed set is `[a-z0-9-]` case-insensitive, slash-separated). Fix: `sanitizeSessionIdForSlug` lowercases and replaces `_` + `.` + any other non-[a-z0-9-] character with `-`. The frontmatter `session_id:` keeps the original verbatim for downstream JSONL emit; only the SLUG is rewritten. 3. INTERFACE: `LongMemEvalQuestion.haystack_sessions` typed as a union of `LongMemEvalSession[] | LongMemEvalTurn[][]` so TypeScript callers see both shapes are accepted. New `haystack_session_ids?: string[]` field documented as parallel to the array-of-turns shape. Pre-v0.35.1.1 caught by a fresh smoke pre-spend (3 questions × ZE @ 2560 → 3 errors). Post-fix: 3/3 OK with non-empty hypotheses, single-session recall measured (low on a 3-question sample but the pipeline runs). 2 new regression test cases pinning: - _s split shape normalizes (slugs sanitized + frontmatter preserves original session_id + dates flow through) - _s split with missing haystack_session_ids synthesizes `lme_<question_id>_<i>` ids Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): configure AI gateway before running gbrain eval longmemeval v0.28.8 skipped connectEngine() for `gbrain eval longmemeval` so the subcommand could run on machines without a configured brain. Side effect (silent until v0.35.1.0 made it observable via the embedder shootout): the gateway was never configureGateway()'d either, so the first embed call inside importFromContent crashed with "AI gateway is not configured. Call configureGateway() during engine connect." Fix: call configureGateway() before runEvalLongMemEval, mirroring the connectEngine() path. Reads `~/.gbrain/config.json` when present; falls back to env vars (GBRAIN_EMBEDDING_MODEL, GBRAIN_EMBEDDING_DIMENSIONS, OPENAI_API_KEY, etc.) when there's no config — preserving the v0.28.8 "runs on fresh machine" property. Gated on the --help short-circuit so `gbrain eval longmemeval --help` still works without spinning up the gateway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: v0.35.1.1 Bumps VERSION + package.json + CHANGELOG entry for the longmemeval fix wave. Three commits this branch: 1. fix(eval): adapter handles _s split + sanitizes session_id slugs 2. fix(cli): configure AI gateway before running gbrain eval longmemeval 3. chore: v0.35.1.1 Each commit independently bisects; CHANGELOG entry is the user-facing rollup. No schema migration; no breaking change. Caught pre-spend by smoking Phase 1 of the embedder shootout — would otherwise have wasted ~$476 in judge tokens across 7 cells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger workflows --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedToo many files! This PR contains 217 files, which is 67 over the limit of 150. To get a review, narrow the scope: ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (217)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
9f7229a to
8440fb1
Compare
8440fb1 to
de69dd0
Compare
…m-v0.35.1.1 # Conflicts: # AGENTS.md # README.md # llms-full.txt # src/cli.ts # src/commands/auth.ts # src/commands/embed.ts # src/commands/eval-replay.ts # src/commands/upgrade.ts # src/core/ai/dims.ts # src/core/ai/gateway.ts # src/core/embedding-pricing.ts # src/core/engine.ts # src/core/pglite-engine.ts # src/core/postgres-engine.ts # src/core/search/hybrid.ts # src/core/types.ts # test/ai/gateway.test.ts # test/book-mirror.test.ts # test/eval-contradictions-integrations.test.ts # test/voyage-response-cap.test.ts
de69dd0 to
6336047
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TLDR
Closes #101.
This PR catches Eva Brain up from
172dbccto upstream GBrainf004a27/v0.35.1.1while keeping Eva as a thin fork: upstream owns the core GBrain database/search/sync/provider/media primitives; Eva preserves OpenClaw-native install, no-key OAuth extraction, support-KB packaging, Voyage 4 Large 2048d defaults, and safe public updater behavior.Upstream v0.35.1.1 Fixes Accepted
v0.33.1.0whoknows expertise/routing.output_dimension, flexible dimension validation, and large-response OOM caps.Preserved Eva Product Surface
/plugins/gbrain/extractremain the product extraction path.CodexExtractionClient,import-media, andingest-media --extract openclawremain intact as transitional OpenClaw adapter surfaces.provider_auth,.gbrain/gbrain.env, and OpenClaw credential-source behavior stay layered on top of upstream's provider gateway.electricsheephq/eva-brain, install the Codex Desktop plugin, install the OpenClaw plugin, and support the OpenClaw support KB.Conflict Decisions
provider_authso OpenClaw-owned credentials still resolve before env fallback. Empty-model OpenAI-compatible providers remain available when an operator supplied a concrete model.$0.12/Mtokens, while preservingvoyage-3-largeat$0.18/M. The estimator now tests Voyage 4 Large, v4, v4-lite, and ZeroEntropy.getPage,softDeletePage,restorePage, andputRawDatatargetdefault; files upsert on(source_id, storage_path); stale embed and takes paths retain explicit source filters.GBRAIN_SKILLS_DIR; OpenClaw restart prefers systemd on customer hosts and falls back toopenclaw gateway restart; support-KB refresh syncs/embeds onlyopenclaw-support-kb.llms-full.txtand fixed metric-glossary generation sogit diff --checkand the freshness guard agree.Adversarial Fixes From Review
source_ididentity. Eva's federated OAuth reads can search several sources while still carrying a primarysourceId, so this PR now adds the canonical source set into the cache knobs hash. That keeps federated results from replaying into later scalar/default-only searches without adding a schema migration.find_contradictionsstill allows default local contexts, but source-scoped/non-default contexts cannot be invoked without an authorized source.modelscommand test mock now covers the upstream v0.35 provider exports and the new zero-network probes.gbrain search modes|stats|tunenow routes to the upstream search-mode command while plaingbrain search <query>still uses keyword search.schema-embedded.tscould referenceoauth_clients.source_id/federated_readbefore v60/v61 migrations created them. Eva now forward-bootstraps those columns in both PGLite and Postgres before replaying the embedded schema.ParamDef→ JSON Schema mapping is centralized and recursive, so HTTP MCP, stdio MCP, and minion tool schemas all keepitems.typefor arrays such asextract_facts.entity_hints.-cSSRF flags stay beforeclone/pull;--no-recurse-submodulesnow sits after the subcommand where real git accepts it.extract_factstreatsslugs: []as an explicit zero-page incremental set, not as a request to walk the full brain.Validation
Local focused validation was run from
/Volumes/LEXAR/repos/eva-brain-upstream-v0.35.1.1.bun install --frozen-lockfilebun run build:llmsgit diff --checkgit diff --cached --checkbun run check:eval-glossarybun run check:exports-countbun run check:source-id-projectionbun run check:cli-execbun run typecheckbun test test/build-llms.test.ts test/local-updater-contract.test.ts test/install-contract.test.ts test/openclaw-gbrain-plugin-contract.test.ts test/codex-extraction-client.test.ts test/embedding-pricing.test.tsbun test test/engine-upsertFile.test.ts test/embed-stale-source.serial.test.ts test/put-page-namespace.test.ts test/operation-context-sourceid-required.test.tsbun test test/ai/gateway.test.ts test/ai/dims-zeroentropy.test.ts test/voyage-response-cap.test.ts test/openai-compat-multimodal.test.ts test/ai/zeroentropy-recipe.test.ts test/ai/zeroentropy-compat-fetch.test.tsbun test test/e2e/source-isolation-pglite.test.ts test/e2e/embed-stale-pagination.test.ts test/e2e/source-routing.test.tsbun test test/eval-candidates.test.ts test/reindex-code.test.tsbun test test/build-llms.test.ts test/pglite-engine.test.ts test/eval-contradictions-integrations.test.ts test/voyage-multimodal.test.tsbun test test/ai/auth.serial.test.ts test/openai-compat-multimodal.test.ts test/ai/gateway.test.ts test/postgres-engine.test.ts test/engine-upsertFile.test.ts test/embed-stale-source.serial.test.ts test/openclaw-gbrain-plugin-contract.test.ts test/codex-extraction-client.test.ts test/local-updater-contract.test.ts test/install-contract.test.ts test/doctor-remote.test.ts test/skillpack-sync-guard.test.tsbun test test/hybrid-search-lite.serial.test.ts test/query-cache.test.ts test/query-cache-knobs-hash.test.ts test/commands/models.serial.test.tsbun test test/commands-search.test.ts test/cli.test.tsbun test test/schema-bootstrap-coverage.test.ts test/mcp-tool-defs.test.ts test/git-remote.test.ts test/extract-facts-phase.test.tsbun test test/brain-allowlist.test.ts test/v0_29-tool-surfaces.test.ts test/e2e/serve-http-oauth.test.tsbun test test/e2e/postgres-bootstrap.test.ts(skips locally withoutDATABASE_URL; added Postgres regression coverage for the v54 OAuth bootstrap shape)gbrain init --pglite --embedding-model voyage:voyage-4-large --embedding-dimensions 2048 --json,gbrain import <docs> --no-embed --json,gbrain search pr102canaryneedle,gbrain search modes --json,gbrain doctor --json --fastlitellm:test-embedat 8 dimensions, imported one markdown file with--no-embed, ranembed --staleagainst a local OpenAI-compatible test server, verified one outbound embedding request, verifiedembed --stale --dry-runreported zero stale chunks, searched the canary phrase with--no-embed, and randoctor --fast --jsonwith score 95.Full
bun test/ E2E / CodeQL / gitleaks should run in GitHub Actions per our local-resource policy.Follow-Up Review Gate
The second 95% confidence review wave covered:
The original blocking finding was federated query-cache isolation, which is fixed in this branch. The final deploy-risk wave found three more practical blockers or near-blockers in upstream open issues/PRs: garrytan#1092, garrytan#1053, and garrytan#1096. Those are also fixed in this branch. Upstream garrytan#1078/garrytan#1079 remains a conditional autopilot/dream risk, but the Eva fleet updater/support-KB path is source-scoped and does not call that generic cycle sync path.