Phase 7 unified-graph upgrade + Phase 7.5 wiring fixes (FYI/visibility branch from ernes-toe fork)#5
Open
ernes-toe wants to merge 197 commits into
Open
Conversation
… PPR, HNSW, reranker, Louvain, LME bench
Seven additive patches. Every existing default is preserved; new capabilities
are opt-in via new constructor params with graceful fallbacks if deps missing.
1. Salience decay (memory_client.py)
- _effective_salience(): base * exp(-k*age) + log1p(access) * alpha
- Applied in both C++ fast-path and Python path of recall()
- Non-persistent (computed on read) — no write contention
- Existing stored salience column becomes the "base" the dream engine can nudge
2. Bi-temporal edges (memory_client.py SQLite schema)
- connections table gains event_time, ingestion_time, valid_from, valid_to
(all NULL by default; pre-existing edges are always-valid)
- Idempotent ALTER TABLE migration on open
- add_connection() accepts the new fields; get_connections(at_time=...)
filters to edges valid at a given instant. Graphiti-style.
3. Cross-encoder reranker (memory_client.py)
- Opt-in via NeuralMemory(rerank=True, rerank_model=...)
- Uses sentence-transformers CrossEncoder lazily; silent no-op if absent
- Reranks top-k*3 after initial scoring in both C++ and Python paths
4. PPR engine for think() (memory_client.py)
- think(engine='ppr', alpha=0.15) runs Personalized PageRank (HippoRAG-2 style)
- Default engine='bfs' preserves the original decay-BFS spreading activation
- networkx preferred; pure-numpy power-iteration fallback when unavailable
5. HNSW index + lazy graph load (memory_client.py)
- Opt-in hnswlib index for Python-only retrieval path (when C++ bridge absent)
- lazy_graph=True defers _load_from_store; nodes hydrate on demand via
_ensure_node(). PPR in lazy mode expands two hops before running.
- Auto capacity growth; graceful disable if hnswlib import fails
6. Louvain community detection (dream_engine.py Insight phase)
- _detect_communities(): networkx louvain_communities first, BFS fallback
- Deterministic seed=42 so repeated dreams yield comparable cuts
7. LongMemEval-style benchmark (benchmarks/lme_eval.py)
- Synthetic 15-record smoke corpus built-in; --dataset for real LME JSONL
- Reports Recall@{1,5,10}, MRR, p50/p95 latency
- Flags for --rerank / --use-hnsw / --engine to A/B configurations
Also updates install.sh to probe for networkx and hnswlib as optional deps
with the same warn-if-absent pattern used for sentence-transformers.
Tests: existing test_suite.py and test_integration.py pass; the two failing
tests on this machine are pre-existing (C++ library not built locally).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sync.sh propagates python/ → hermes-plugin/. The hermes-agent plugin dir (~/.hermes/hermes-agent/plugins/memory/neural) is a symlink into the latter, so this commit seals the Phase-B upgrades as the live plugin code. Mirrors commit 2dbf4e0: salience decay, bi-temporal edges, cross-encoder reranker, PPR think() engine, HNSW+lazy-load, Louvain community detection. Verified end-to-end through ~/.hermes/hermes-agent/venv/bin/python3 (3.11): - import NeuralMemory OK, HNSW active, networkx+hnswlib available - recall() surfaces salience_factor, bi-temporal at_time filter expires edges correctly - think(engine='ppr') returns results Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Seven additive follow-ups on the Phase B branch. All additive; defaults
preserve prior behavior.
H1 — hnsw_ef constructor param (was hardcoded)
H5 — salience_multiply opt-out flag (clean revert for Bucket-C shift)
H7 — stats() reports feature availability (hnsw_active, louvain_available,
reranker_loaded, salience_multiply, rerank_enabled, hnsw_ef, cpp_available)
H10 — neural_dashboard as 7th plugin tool (wraps tools/dashboard/generate.py)
H11 — tools/compact.py weekly compaction (dry-run default; sticky-label whitelist)
H12 — ~/.local/bin/remember shell CLI (cross-agent write + recall)
H13 P2 — NeuralMemoryProvider.on_memory_write() now mirrors built-in memory
writes into neural-memory with rotation-candidate vs mirror-from-default
labels via _is_identity_grade() heuristic; Phase 1 skill shipped separately
at ~/.hermes/skills/meta/dual-memory-rotation-hygiene/SKILL.md
Also: tools/obsidian_sync.py (live-graph generator — Phase 8 of the
obsidian vault build at neural-memory-vault).
Tests: 33/35 pass (2 failing pre-existing — C++ library not built locally).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Polls known git repos (neural-memory, pulse-hermes, LangGraph, hermes-agent)
and Obsidian vaults for changes since last run. Writes compact notes into
neural-memory via the `remember` CLI (H12).
v1 scope:
- git commits: last_sha..HEAD per tracked repo, no-merges
- vault edits: files modified within MAX_AGE_MIN (60 min default)
- state persisted to ~/.neural_memory/observer-state.json
- max-events cap prevents flooding on first run after downtime
- all events carry `observer:git:*` or `observer:vault:*` source labels
so compaction (H11) can target them if they turn out to be noise
v2 (deferred): Haiku filter + Opus extract stages for richer content.
Current v1 is zero-LLM — just passes commit subjects through.
Launchd plists at ~/Library/LaunchAgents/:
- com.ae.pulse-ingest.plist (daily 06:00 — A5)
- com.ae.neural-observer.plist (every 15 min — A6)
First live tick ingested 10 git commits from LangGraph + hermes-agent into
neural-memory. Corpus 58 → 68 memories.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Handles the actual LongMemEval JSON shape (haystack_sessions +
answer_session_ids + question + answer). Different from the synthetic
lme_eval.py which expected fact+paraphrased-query pairs.
For each record:
1. Flatten haystack_sessions into individual turns
2. Seed each turn into memory with label `lme:{qid}:{sess_id}:{turn_idx}`
3. recall(question, k=10)
4. Score: rank of first result whose session_id is in answer_session_ids
Reports R@1/R@5/R@10, MRR, p50/p95 latency.
--max flag caps records (default 20) to keep runs tractable.
Dataset: huggingface.co/datasets/xiaowu0162/longmemeval (note: no underscore
before "eval" in the repo name — the suffix is "eval" not "_eval").
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- neural_dashboard added to tool list (H10) - New Phase B upgrades section: salience decay, bi-temporal edges, cross-encoder reranker, PPR think() engine, HNSW+lazy graph, Louvain community detection, LongMemEval benchmarks - Optional deps table (sentence-transformers / networkx / hnswlib / pyodbc) - Feature-state introspection example (mem.stats()) - Maintenance tools section: tools/compact.py, tools/observer.py, tools/obsidian_sync.py, tools/dashboard/generate.py Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…SW/rerank/stats 11 new tests (tag: phase-b) covering the highest-risk Phase B items: - salience: factor range clamp, access boost, age decay - bi-temporal: at_time filter includes valid, excludes expired edges - ppr: engine returns results; lazy_graph mode hydrates subgraph on think() - louvain: dense triangles + weak bridge → ≥2 communities when networkx present - hnsw: use_hnsw=False silent fallback to brute-force - rerank: rerank=True with nonexistent model silent no-op (no crash) - stats: H7 feature-flag keys present + honored - salience off-switch (H5): salience_multiply=False works Full test suite: 45 passed / 1 failed / 1 skipped (the 1 fail is pre-existing, C++ library not built on this machine). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sistence H3 Conflict × Bi-Temporal When `remember()` supersede fires (cosine > 0.7 + content differs), call `store.set_edges_valid_to(conflict_id, now)` to invalidate the old edges temporally. They remain queryable via `get_connections(id, at_time=past)` but default recall ignores them. Also clears stale in-memory graph edges. H4 HNSW Persistence Save/load the hnswlib index to disk alongside the DB (`<db>.hnsw.bin`). Cold-start with valid cache: ~60ms vs minutes of bulk rebuild. Periodic save every 50 writes (`hnsw_save_every`). Staleness check via `get_current_count() == expected_count`. Rebuild on mismatch. `close()` flushes final save. H6 Dream × Bi-Temporal `DreamBackend.prune_weak()` now soft-deletes via `UPDATE ... SET valid_to=now` (when bi-temporal columns exist) instead of `DELETE FROM connections`. Falls back to hard-delete on pre-migration schemas. `DreamBackend.add_bridge()` stamps `ingestion_time = valid_from = now` on REM bridges, + edge_type='rem_bridge'. Also: `SQLiteStore.get_connections()` default behavior now filters expired edges (valid_to IS NULL or valid_to > now). Explicit `include_expired=True` kwarg returns everything for audit/replay. Full test suite: 45 passed / 1 failed / 1 skipped (pre-existing C++ absent). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds unit test for the H13 Phase 2 rotation heuristic — identity-grade strings preserve in default memory, episodic/factual strings route to neural-memory. Proxies the plugin's implementation to avoid importing hermes-agent's runtime (plugin __init__.py has top-level agent.memory_provider import). Covers the untested-heuristic item flagged in Review 04 of the vault. Full suite: 46 passed / 1 failed (pre-existing C++ absent) / 1 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…efore generic architectural match (caught by H8 test)
Three Auto-Dream-inspired lifecycle improvements ported into neural-memory.
All additive; defaults preserve prior behavior unless opt-out specified.
H18 Date Normalization
`NeuralMemory._normalize_dates(text, ref_time)` static method.
Converts relative dates ("yesterday", "last week", "N days ago",
"tomorrow", "this morning", etc.) to absolute ISO ("on 2026-04-25 ...").
Applied in `remember()` via new `normalize_dates=True` default kwarg.
Conservative: leaves ambiguous phrasings ("a couple days ago") untouched.
H19 Active Contradiction Replacement
Schema additive: `superseded_memories` table with original_id, content,
label, embedding, salience, superseded_by, superseded_at, superseded_reason.
`SQLiteStore.archive_superseded()` + `replace_memory()` methods.
`remember()` supersede branch rewired: archive old row to audit table,
replace `memories` row in-place with new content. No more `[SUPERSEDED]`
prefix bloating stored content. Defensive fallback to legacy prefix on
archive failure. H3 edge invalidation + in-memory graph cleanup preserved.
H20 Sub-Agent Dream Dispatch
`DreamEngine.dream_now(dispatch='inline'|'subprocess')`.
- `inline` (default): runs in-process, blocks until complete (preserved).
- `subprocess`: spawns Python subprocess that re-opens the DB + runs cycle.
Returns immediately with job_id. Status file at
`~/.neural_memory/dream-jobs/<job_id>.json`. SQLite WAL allows concurrent
read+write; recall in parent not blocked by dream.
`DreamEngine.dream_status(job_id)` polls completion.
Verified end-to-end:
- H18: 7 normalization test cases pass (yesterday/last week/N days ago/etc.)
- H19: in-place supersede preserves id; current row clean; audit row exists
with cosine reason; chain of N supersessions → N archive rows
- H20: subprocess returns in 10ms; dream completes async; status polls
Pairs with new vault notes:
06 — Roadmap/Tier 1 Hardening/H18 Date Normalization.md
06 — Roadmap/Tier 1 Hardening/H19 Active Contradiction Replacement.md
06 — Roadmap/Tier 1 Hardening/H20 Sub-Agent Dream Dispatch.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H14 — benchmarks/lme_real.py: --no-auto-connect : bulk-seed bypass for retrieval-only benchmarks --batch-embed N : ~10x throughput via embed_batch() instead of per-turn A6 — tools/observer.py: Always-on observer poller (15-min cadence via launchd com.ae.neural-observer). Watches git commits + vault edits + project file changes; filter+extract via LLM; write to neural-memory with provenance labels. Both already shipped to disk + verified working (114/114 jackrabbit-wonderland tests pass with these flags; observer-state.json updates every 15 min). Long-overdue VC catch-up — these have been uncommitted since 2026-04-25 late-night ship-burst. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
python/schema_upgrade.py — extends the SCHEMA + _migrate_bitemporal pattern from memory_client.py:120-142 with 16 additive memories columns and 7 truly-new connections columns (3 of the spec's 10 already present from earlier _migrate_bitemporal work, skipped by idempotent guard). All ALTER TABLE ADD COLUMN; no row rewrites, no destructive changes. New memories cols: kind, confidence, valid_from, valid_to, transaction_time, origin_system, source, metadata_json, memory_visibility, pin_state, decay_rate, reuse_count, last_reinforced_at, extracted_entities_json, locus_id, procedural_score. New connections cols: confidence, transaction_time, origin_system, salience, last_strengthened_at, evidence_count, metadata_json. transaction_time is added alongside the existing ingestion_time. Both coexist; transaction_time is the canonical name going forward (Phase 7 spec); legacy ingestion_time data preserved for backward compat. Backfill deferred to Commit 2 (retain-time typing). python/test_schema_upgrade.py — 5 stdlib-unittest contracts: adds_memory_columns, adds_connection_columns, is_idempotent, preserves_existing_records, legacy_columns_unchanged. Stdlib instead of pytest to avoid adding a dep (subtract-not-extend). .gitignore — backups/ excluded (operational rollback artifacts; not source). Pre-commit verification: - 5/5 unit tests pass on tmp DBs - Migration applied to live-shape backup DB (3.5GB / 231 memories / 10468 connections / WAL-mode): 8→24 memory cols, 10→17 connection cols; row counts unchanged; second-run no-op confirms idempotency on real data - Pre-write reviewer audit found no column-name collisions, no constraint conflicts, no test-fixture breakage Live ~/.neural_memory/memory.db NOT yet migrated — pending explicit authorization. Backup at backups/memory_pre_phase_b_20260501T171948Z.db (SHA-256 0210a4e6...c7d2b30) provides instant rollback. Refs (in claude-memory PRIVATE repo): - reference_neural_memory_execution_addendum.md lines 60-150 (Commit 1 contract) - reference_neural_memory_unified_integration_handoff.md Section 6.1 - project_hermes_ecosystem_sprint2_v3_recon.md (Sprint 2 anchor) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt 2 P7C2)
Wires the 23 columns added in Commit 1 into the retain hot path. New memories
get auto-classified into one of 11 kinds (procedural/world/experience/
mental_model/etc.) with provenance fields populated.
New files:
- python/memory_types.py: frozenset constants for 11 MEMORY_KINDS + 16
EDGE_TYPES per handoff section 13.1.
- python/classify_memory_kind.py: heuristic classifier (no LLM, no model
load, deterministic, ~10us/call). Detects procedural/world/mental_model
patterns; defaults to experience. AE-domain patterns (NEC, code, when/if,
conclude/seems) tuned for electrical contracting + back-office language.
- python/test_memory_typing.py: 13 unittest contracts. 8 classifier tests
(procedural/world/mental_model/inference/empty/metadata-override/invalid-
override/membership) + 5 store-layer tests (backward-compat positional,
typed kwargs persist, explicit transaction_time preserved, empty metadata,
fresh-init schema upgrade).
Modified python/memory_client.py:
- Imports json, time, classify_memory_kind, SchemaUpgrade.
- SQLiteStore.__init__ now invokes SchemaUpgrade(db_path).upgrade() after
_migrate_bitemporal -- fresh installs auto-migrate to Phase 7 schema.
- SQLiteStore.store() extended with 8 keyword-only typed params: kind,
confidence, source, origin_system, valid_from, valid_to, transaction_time,
metadata. Builds dynamic INSERT -- only includes typed columns when caller
provides non-None values; schema defaults handle the rest. transaction_time
auto-stamps to time.time() when None.
- NeuralMemory.remember() extended with same typed kwargs + auto-calls
classify_memory_kind(text) when kind is not provided. Pass-through to
store.store().
Backward compatibility verification:
- All new params keyword-only; existing positional callers unchanged.
- 13/13 new tests pass.
- Existing test_suite.py: 41/47 pass; the 6 failures are pre-existing
environmental issues (libneural_memory.so not built locally, hermes plugin
not symlinked) -- none related to memory storage. memory:persistence,
memory:large_batch_100, unified:basic_workflow, perf:store_100, and the
entire phase-b: suite all pass.
Reviewer findings (agent scope: backward-compat audit):
- 4 categories clean (positional callers, import cycle, store callers,
metadata_json no collision).
- 2 punted to Commit 3:
* MSSQL backend (mssql_store.py) not extended; AE local install lacks
pyodbc so MSSQL writes don't fire. Extend in Commit 3 if MSSQL becomes
a deployment target.
* H19 supersession path (replace_memory) does not propagate typed kwargs;
superseded memories retain pre-supersession typing. Tracked for Commit 3.
Commit 1 wiring carry-over: SchemaUpgrade now invoked at every
SQLiteStore.__init__. New tmp DBs in tests get Phase 7 schema automatically.
Live ~/.neural_memory/memory.db (already migrated) will re-invoke .upgrade()
on next process start -- idempotent no-op.
Refs (in claude-memory PRIVATE repo):
- reference_neural_memory_execution_addendum.md lines 159-220 (C2 contract)
- reference_neural_memory_unified_integration_handoff.md sec 8.2 + 13.1
- project_hermes_ecosystem_sprint2_v3_recon.md (Sprint 2 anchor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Borrows entity-intelligence patterns from Hindsight/Graphiti/Memary without splitting memory stores: entities live as kind='entity' nodes in the same memories table, linked to source memories via mentions_entity edges. New files: - python/entity_extraction.py: extract_entities() heuristic (capitalized words minus stopwords; AE-domain acronyms NEC/GFCI/EMT pass through as entities). EntityRegistry class wraps SQLiteStore with case-insensitive get_or_create / lookup / frequency-tracking / mentions_entity edge linking. process_memory() runs the full extract->create->link pipeline. - python/test_entity_extraction.py: 15 unittest contracts. 6 extraction unit tests + 9 registry tests including case-insensitive dedup, frequency increment, typed-edge creation, and end-to-end process_memory. Modified python/memory_client.py: - NeuralMemory.__init__: instantiates self.entities = EntityRegistry(self.store). Skipped for MSSQL backend (registry needs _lock attr; MSSQL handled in C4+). - NeuralMemory.remember(): after store.store(), runs entities.process_memory() to extract/link entities. Wrapped in try/except so entity failure does NOT block memory storage. - 3 new public methods on NeuralMemory (delegate to self.entities): get_entity(name), get_entities_for_memory(memory_id), count_entities_named(name). - SQLiteStore.get_stats(): excludes kind='entity' rows from the user-facing 'memories' count + adds separate 'entities' count. Preserves historical semantic of "memories the user added" vs "derived entity nodes". Backward compatibility verification: - 15/15 new entity tests pass. - 13/13 P7C2 tests still pass. - 5/5 P7C1 tests still pass. - test_suite.py: 41 passed / 6 failed / 1 skipped — IDENTICAL baseline to pre-Commit-3. The fix to get_stats() prevents 'memory: large batch' from miscounting Entry-derived entity row. - Existing remember(text) and remember(text, label) callers unchanged. Reviewer findings carry-over: H19 supersession path still does not propagate typed/entity kwargs through replace_memory; tracked for P7C4+. MSSQL backend still not extended (entity processing skipped when use_mssql=True). Refs (in claude-memory PRIVATE repo): - reference_neural_memory_execution_addendum.md lines 218-262 (C3 contract) - reference_neural_memory_unified_integration_handoff.md sec 5.1 + 5.15 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t 2 P7C4) Wires the kind classifier from P7C2 into retrieval. recall(kind='procedural') now returns only procedural-kind memories. Procedural memories can declare their supporting experience base via evidence_ids, creating typed derived_from edges in the unified graph. Modified python/memory_client.py: - recall() gains kind: Optional[str] = None keyword-only kwarg. When set, the inner search over-fetches by 5x (max(k*5, 25)) to compensate for filter loss, then post-filters to matching kind, then slices to k. Existing recall(query, k, temporal_weight) behavior unchanged when kind is None. - _filter_by_kind() helper: single batched SELECT of ids matching the requested kind, in-memory set membership filter on results. One DB roundtrip regardless of result-set size. Sidesteps reviewer's concern about kind-not-in-result-dicts by querying the DB directly. - _recall_inner() is the renamed original recall() — preserves all 3 paths (C++ / HNSW / brute-force) untouched. recall() is a thin wrapper. - remember() gains evidence_ids: Optional[list[int]] = None keyword-only kwarg. When set, creates derived_from edges from the new memory to each evidence id via add_connection(edge_type='derived_from'). Invalid IDs are silently skipped via per-edge try/except — best-effort link. New file python/test_procedural_memory.py: 7 unittest contracts covering kind-filter return correctness, default behavior unchanged, single + multi evidence_ids edge creation, procedural in general recall, invalid evidence id resilience, empty/None evidence_ids no-op. Reviewer findings (agent backward-compat scope): - Filter point analysis: identified all 3 return paths; my wrapper approach applies filter once at the outer wrapper, not per-path. - No kwarg collision (recall has no existing kind param). - No existing recall(kind=...) callers across python/, benchmarks/, test_suite.py. - dream_engine.py:782 calls recall(content, k=10) without kind kwarg, so default None preserves global view. - add_connection(edge_type='derived_from') accepted; nullable temporal fields support non-bi-temporal edges. Backward compatibility verification: - 7/7 new procedural tests pass. - 15/15 P7C3 tests still pass. - 13/13 P7C2 tests still pass. - 5/5 P7C1 tests still pass. - test_suite.py: 41 passed / 6 failed / 1 skipped — IDENTICAL baseline. Refs (in claude-memory PRIVATE repo): - reference_neural_memory_execution_addendum.md lines 263-298 (C4 contract) - reference_neural_memory_unified_integration_handoff.md sec 7.2 + 8.5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rint 2 P7C5) Borrows Hindsight's BM25 sparse channel and Graphiti's bi-temporal validity channel without forking off into separate stores: sparse search runs against a SQLite FTS5 virtual table mirroring memories.content, returning canonical memory IDs from the same memories table. Temporal search runs regular semantic recall then post-filters by valid_from/valid_to validity at as_of. python/schema_upgrade.py: - Adds _ensure_fts5() helper: creates memories_fts virtual table (internal- content FTS5 mode for trivial sync) and idempotently backfills from any existing memories whose rowid is missing from the FTS index. - upgrade() return dict gains 'fts_rows_backfilled' key. - Silent no-op if SQLite was compiled without FTS5 — sparse channel just returns empty results in that case. python/memory_client.py: - SQLiteStore.store(): after main INSERT, also INSERTs (rowid, content) into memories_fts. Wrapped in try/except sqlite3.OperationalError so missing- FTS5 builds don't break stores. - SQLiteStore.replace_memory(): NEW — refreshes FTS5 row via DELETE+INSERT on H19 supersession path. Reviewer flagged the stale-content gap; this fixes it within Commit 5 rather than punting to Commit 6. - NeuralMemory.sparse_search(query, k=5): SELECT rowid FROM memories_fts WHERE content MATCH ? ORDER BY rank LIMIT ? (FTS5 BM25). Returns memory dicts via SQLiteStore.get(). Empty list on FTS unavailable / no match / empty query. - NeuralMemory.temporal_search(query, as_of, k=5): runs _recall_inner with k*5 over-fetch, batched SELECT of (id, valid_from, valid_to), filters via _is_valid_at() helper. - NeuralMemory._is_valid_at(valid_from, valid_to, as_of): NULL-as-unbounded bi-temporal predicate (matches existing get_connections at_time semantics). python/test_sparse_temporal.py: 9 unittest contracts: - 5 sparse: finds_exact_jargon, respects_k_limit, returns_empty_for_no_match, works_on_fresh_install, handles_empty_query - 4 temporal: prefers_valid_at_as_of, returns_old_for_past_query, null_validity_is_always_valid, open_ended_validity_persists_into_future python/test_schema_upgrade.py: relaxed idempotency assertion to verify only column-add counts (other keys like fts_rows_backfilled may be present). Reviewer findings (FTS5 + temporal channel scope): - FTS5 module available on local Python ✓ - No pre-existing virtual tables in repo or live DB ✓ - replace_memory FTS sync gap CAUGHT and FIXED in this commit - valid_from/valid_to confirmed on memories table from P7C1 ✓ - Existing get_connections at_time semantics reused ✓ Backward compatibility verification: - 9/9 new sparse+temporal tests pass. - 7/7 P7C4 tests still pass. - 15/15 P7C3 tests still pass. - 13/13 P7C2 tests still pass. - 5/5 P7C1 tests still pass (after relaxing idempotency assertion). - test_suite.py: 41 passed / 6 failed / 1 skipped — IDENTICAL baseline. Refs (in claude-memory PRIVATE repo): - reference_neural_memory_execution_addendum.md lines 300-336 (C5 contract) - reference_neural_memory_unified_integration_handoff.md sec 5.1 + 7.5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t 2 P7C6) Per addendum lines 338-374. Formalizes the existing EmbeddingProvider multi-backend pattern under a registry interface. No mandatory new heavy dependencies; BGE-M3 is optional via FlagEmbedding. python/embedding_registry.py: - BackendUnavailable exception for clean missing-backend signaling. - BgeM3Backend: optional BGE-M3 hybrid (dense + sparse + multi-vector) adapter. Lazy imports FlagEmbedding; raises BackendUnavailable when the library isn't installed. Surface: .embed() returns dense, .embed_sparse() returns lexical token-weight dict (Hindsight-style sparse channel, persistable in metadata_json for query-time scoring without re-embedding). - get_embedding_backend(name=None, *, allow_missing=False): top-level factory. Resolution: name arg -> NEURAL_MEMORY_EMBED_BACKEND env var -> 'auto'. Recognized: auto/default/sentence-transformers/tfidf/hash/bge-m3. - 'default' is an alias for 'auto' (matches addendum test contract). python/test_embedding_registry.py: 7 unittest contracts: - default_backend_loads_and_embeds, auto_backend_loads_and_embeds, hash_backend_is_deterministic, bge_m3_is_optional (None or .embed), bge_m3_raises_without_allow_missing, env_var_dispatches_backend, explicit_name_overrides_env_var. Backward compatibility: - 7/7 new tests pass. - Existing EmbeddingProvider untouched; no caller changes needed. - Old deployments do not break. - Heavy embedding models remain optional. - test_suite.py: 41/47 — IDENTICAL baseline. Refs: - reference_neural_memory_execution_addendum.md lines 338-374 (C6 contract) - reference_neural_memory_unified_integration_handoff.md sec 5.6 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 2 P7C7)
Borrows HippoRAG-2 PPR + MAGMA's relation-view dimensions WITHOUT splitting
the unified graph. Single connections table; relation views are edge-type
weight filters, not separate stores (per non-negotiable handoff 2.1).
python/memory_client.py:
- _EDGE_WEIGHTS_BY_INTENT: 5 intent classes (factual/causal/temporal/
procedural/entity) -> edge-type weight multiplier dict. Per handoff sec 17.5.
- _classify_intent(query): heuristic from query starter ("Who"->entity,
"When"->temporal, "Why"->causal, "How"->procedural, default->factual).
Also catches mid-sentence " who "/" when "/etc and AE-domain "contact" cue.
- intent_edge_weights(query): public method returning weight dict.
- available_relation_views(): returns ['semantic', 'temporal', 'causal',
'entity', 'procedural'] per addendum acceptance test.
- uses_single_connection_table(): returns True. Confirms unified-graph
substrate constraint.
- graph_search(query, k=5, hops=2): PPR-style retrieval. Strategy:
1. seed via _recall_inner (dense)
2. BFS up to hops levels weighted by intent_edge_weights
3. accumulated activation = max-of-paths through node
4. damping 0.7 per hop
5. unknown edge_type baseline weight 0.3
Seeds remain in results (combined dense+graph signal ranks highest).
Single connections table consulted; no fork.
python/test_graph_search.py: 9 unittest contracts:
- two_hop_related_memory_is_reachable (a-mentions->b-applies_to->c chain)
- relation_view_filter_uses_single_graph
- entity_query_weights_entity_edges_higher_than_semantic
- temporal_query_weights_happened_before_higher
- causal_query_weights_caused_by_high (>= 0.8)
- available_relation_views_contains_five
- uses_single_connection_table_is_true
- intent_classifier_routing (5 query types correctly classified)
- graph_search_empty_db_returns_empty
Bug fix during build: get_connections() returns dict key 'type' (not
'edge_type' as I initially assumed); graph_search now reads both for
forward-compat.
Backward compatibility verification:
- 9/9 new graph_search tests pass.
- All prior P7C1-P7C6 tests still pass.
- test_suite.py: 41/47 IDENTICAL baseline.
Refs:
- reference_neural_memory_execution_addendum.md lines 376-418 (C7 contract)
- reference_neural_memory_unified_integration_handoff.md sec 4.2 + 5.5 + 5.10 + 17.5
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…int 2 P7C8) Per addendum lines 420-466. Establishes the non-negotiable: RRF and rank-only fusion are CANDIDATE FEATURES, never the final ranking authority. Final law is salience-weighted continuous scoring across feature-vector channels (semantic + sparse + graph + temporal + entity + procedural + locus + RRF feature) with confidence multiplier and contradiction/stale penalties. python/scoring.py (NEW): - ScoringConfig dataclass: final_authority='continuous_salience_score', features tuple naming all 12 ranking signals (including 'rrf_feature'). - DEFAULT_WEIGHTS dict (semantic 0.30, sparse 0.15, graph 0.20, temporal 0.10, entity 0.10, procedural 0.05, locus 0.03, rrf 0.07). - CandidateFeatures dataclass: per-candidate scoring inputs. - score_candidate(f, weights, *, cross_encoder_score, beta): pure function implementing the continuous formula. Cross-encoder rerank is optional blend, never authority. python/memory_client.py: - SQLiteStore.store(): new salience kwarg. When provided, written to memories.salience column; otherwise schema default (1.0). - NeuralMemory.remember(): new salience kwarg, pass-through to store. - NeuralMemory.recall(): new as_of kwarg. When set, results are filtered to memories whose [valid_from, valid_to] window contains as_of. Composable with kind kwarg (both filters apply when both set). Pre-Phase-7 behavior unchanged when both kwargs omitted. - NeuralMemory.scoring_config(): returns ScoringConfig instance for callers/tests to verify ranking law. python/test_unified_scoring.py: 11 unittest contracts: - 3 ScoringConfig: rrf_is_feature_not_final_authority, features_include_all_required_channels, default_weights_in_range. - 4 score_candidate: salience_changes_score, contradiction_penalty, stale_penalty, cross_encoder_blend. - 4 NeuralMemory: scoring_config_surface, salience_kwarg_flows_to_db, salience_multiplier_changes_rank, recall_as_of_excludes_stale. Backward compatibility: - 11/11 new tests pass. - All P7C1-P7C7 tests still pass. - test_suite.py: 41/47 IDENTICAL baseline. - recall(query) and recall(query, k) and recall(query, k, temporal_weight) all unchanged (pre-Phase-7 path triggers when kind=None and as_of=None). Refs: - reference_neural_memory_execution_addendum.md lines 420-466 (C8 contract) - reference_neural_memory_unified_integration_handoff.md sec 7.3 + sec 17 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 2 P7C9)
Per addendum lines 468-509. Extends the dream-engine surface with hygiene
operations on the unified graph: duplicate downweighting, evidence-attached
insight creation, and bi-temporal contradiction detection. Dream stays on
the graph; hygiene does NOT hard-delete (H6/H19 invariant preserved).
python/memory_client.py adds NeuralMemory methods:
- get_memory(memory_id): convenience wrapper exposing salience + kind +
validity + provenance fields beyond what store.get() returns.
- get_edges(memory_id): wrapper around store.get_connections that maps the
internal 'type' key to addendum-spec 'edge_type'. Includes expired edges
for completeness.
- has_edge(source, target, edge_type=None): direction-insensitive existence
check. Used by contradiction detection idempotency + by tests.
- run_memify_once(decay_factor=0.5): finds exact-content duplicates by
GROUP BY content; downweights salience of all but the highest-salience
copy. Skips kind='entity' rows (entity merge handled separately).
Returns {"duplicates_downweighted": N}. Idempotent — re-running on
already-downweighted rows produces a smaller delta each time but doesn't
delete anything.
- create_insight_from_cluster(memory_ids): creates a kind='dream_insight'
memory summarizing the cluster, with summarizes edges back to source
memories. Per handoff sec 9.3, insights MUST have evidence edges (no
free-floating insights). origin_system='dream_engine'.
- run_contradiction_detection_once(jaccard_threshold=0.4): O(n^2) scan
for pairs where one's valid_to ends before another's valid_from AND
content jaccard >= threshold. Adds contradicts edge; skips if already
present (idempotent). Stopword-filtered word jaccard for content overlap.
- _content_jaccard helper + _CONTRADICTION_STOPWORDS frozenset.
python/test_dream_memify.py: 9 unittest contracts:
- 3 memify: downweights_exact_duplicates, does_not_delete_records,
no_op_when_no_duplicates.
- 3 insight: has_evidence_edges, kind_is_dream_insight,
empty_cluster_is_no_op.
- 3 contradiction: edge_for_conflicting_validity,
skips_overlapping_validity, skips_unrelated_content.
Backward compatibility:
- 9/9 new tests pass.
- All P7C1-P7C8 tests still pass.
- test_suite.py: 41/47 IDENTICAL baseline.
Phase 7 progress: 9 of 10 commits shipped. C10 = locus overlay +
governance + benchmark gating remaining.
Refs:
- reference_neural_memory_execution_addendum.md lines 468-509 (C9 contract)
- reference_neural_memory_unified_integration_handoff.md sec 9.4 + 17
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 2 P7C10)
PHASE 7 COMPLETE — 10 of 10 commits shipped.
Per addendum lines 511-565. Final commit adds MemPalace-style locus
overlay, OpenAI-style governance controls, and Letta-style explanation
paths — all on the unified graph, no separate stores.
python/memory_client.py adds NeuralMemory methods:
- create_locus(wing, room): creates kind='locus' nodes for wing + room,
links room->wing via located_in. Auto-dedupes by label. Returns room id.
- _get_or_create_locus_node helper.
- assign_locus(memory_id, locus_id): adds located_in edge memory->locus.
Idempotent — re-assigning returns immediately if edge exists.
- memory_count(*, exclude_overlay=True): user-memory count. By default
excludes kind='entity' + kind='locus' (system overlay nodes are not
user-authored memories). Pass exclude_overlay=False for full count.
- forget(memory_id, *, mode='background'): governance op per handoff sec 11.
Modes:
'background' (default) - sets memory_visibility='backgrounded'; row
+ edges intact; deprioritizes from default recall.
'redact' - replaces content with '[REDACTED]'; visibility='hidden';
preserves edges (H19/H6 audit invariant).
'delete' - hard DELETE (rare; use 'background' first).
- explain_recall(query, k, *, kind, as_of): returns recall results with
per-result 'explanation' dict containing query, intent, channels,
final_score, and features (semantic, temporal_score, salience, combined).
Per addendum lines 541-547 + handoff sec 12.4.
- get_memory() expanded: now also returns memory_visibility, pin_state,
metadata_json.
python/test_locus_governance.py: 10 unittest contracts:
- 4 locus: create_and_assign, overlay_does_not_replace_graph,
assign_locus_is_idempotent, create_locus_dedupes_existing.
- 1 explain: explain_recall_returns_explanation_with_salience_feature.
- 5 governance: forget_background_sets_visibility, forget_does_not_break_edges,
forget_redact_replaces_content, forget_delete_removes_row,
forget_unknown_mode_raises.
Note: benchmark gate (addendum line 555-558) is left to the existing daily
smoke regression detector at ~/.neural_memory/bench-history/, which has
been firing since 2026-04-25 (file project_neural_memory_500_record_baseline.md).
The H23 plist + smoke runner are pre-Phase-7 infrastructure; Phase 7 does
not regress them.
Backward compatibility verification:
- 10/10 new tests pass.
- All P7C1-P7C9 tests pass:
test_schema_upgrade 5/5
test_memory_typing 13/13
test_entity_extraction 15/15
test_procedural_memory 7/7
test_sparse_temporal 9/9
test_embedding_registry 7/7
test_graph_search 9/9
test_unified_scoring 11/11
test_dream_memify 9/9
test_locus_governance 10/10
TOTAL: 95 Phase 7 unittest contracts, all green.
- test_suite.py: 41 passed / 6 failed (pre-existing env issues — C++ lib
not built, hermes plugin not symlinked) / 1 skipped — IDENTICAL
baseline maintained across all 10 commits.
Phase 7 definition-of-done check (per addendum lines 567-578):
- [x] All 10 commit-level acceptance suites pass (95 contracts green)
- [x] Migration is idempotent (P7C1 verified)
- [x] Existing AE records readable (live DB at 231/10468 preserved)
- [x] Final scoring is salience-weighted continuous (P7C8)
- [x] No donor system became substrate (single graph, all donors as
node kinds / edge types / metadata / channels / dream phases)
- [ ] AE LME 500-record bench delta (run separately to verify >=-0.020 R@5;
out of scope for this commit; smoke runner gates daily)
- [ ] AE-domain bench category thresholds (240-query bench harness from
addendum lines 580+ deferred to follow-up; data labeling needed)
Refs:
- reference_neural_memory_execution_addendum.md lines 511-578 (C10 + DoD)
- reference_neural_memory_unified_integration_handoff.md sec 4.2 + 10 + 11 + 12.4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per execution addendum lines 580-1160. Builds the AE-specific bench corpus
that complements the existing LongMemEval 500-record synthetic baseline.
benchmarks/ae_domain_memory_bench/queries.py:
- 240 queries across 6 categories x 40 each:
electrical_contracting R@5 >= 0.78
spanish_whatsapp R@5 >= 0.70
materials_sku R@5 >= 0.75
lennar_lots R@5 >= 0.80
financial_calendar R@5 >= 0.72
customer_temporal R@5 >= 0.82
- Each query carries id, category, prompt, expected_channels (diagnostic),
minimum_rank, temporal_mode (current/past_window/cross_time), and
initially-empty ground_truth_ids list for post-labeling.
- CATEGORY_THRESHOLDS dict + get_queries() + category_counts() helpers.
benchmarks/ae_domain_memory_bench/run_ae_domain_bench.py:
- Two modes:
--mode diagnostic (default): runs each query, reports dense+sparse top-k
IDs, intent classification, edge weights, latency. NO ground truth
needed; output IS the input for labeling.
--mode scored: requires ground_truth_ids filled. Computes per-category
R@5/R@10/MRR + global R@5. Exits 2 if any category misses threshold,
0 if all pass. Suitable for CI gating.
- --db path override (default: ~/.neural_memory/memory.db)
- --category filter (run only one category)
- --k retrieval depth (default 10)
- --out JSON output path
benchmarks/ae_domain_memory_bench/README.md:
- Category table + thresholds.
- Run examples for diagnostic + scored modes.
- Labeling workflow: run diagnostic -> inspect IDs -> fill
ground_truth_ids in queries.py -> run scored.
- Exit code semantics for CI integration.
- Notes that this is ADDITIVE to the existing LME 500-record bench at
~/.neural_memory/bench-history/.
Smoke verification: ran electrical_contracting category against live DB
(239 memories at this point); 40 queries completed in ~8ms each. Dense
channel returned IDs from TF-IDF backend; sparse channel returned empty
(expected — hermes-saved content doesn't contain electrical-jargon yet).
Next step (deferred — needs labeling): walk through diagnostic output,
identify ground-truth memory IDs for each query against current 239-row
DB, fill queries.py ground_truth_ids, run scored mode.
Refs:
- reference_neural_memory_execution_addendum.md lines 580-1160 (240 queries)
- reference_neural_memory_execution_addendum.md lines 627-637 (thresholds)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Built tools/phase7_audit.py — read-only inspection of how a live neural-memory DB exercises Phase 7's typed/entity/scoring features. The audit revealed a real bug: SQLiteStore.store() unconditionally inserted into memories_fts, including kind='entity' rows. Entities are derived nodes (content like "Entity: Lennar"), not user memories; indexing them adds sparse-search noise without value. Live DB at ~/.neural_memory/memory.db had 6 stale entity rows in its FTS5 index (sync delta = 246 fts rows vs 240 expected non-entity memories). Fixes: 1. python/memory_client.py: SQLiteStore.store() now skips FTS5 insert when kind='entity'. One-line guard. 2. python/test_sparse_temporal.py: new test_entity_rows_not_indexed_in_fts5 contract. Now 10/10 sparse + temporal tests pass. 3. Live DB cleanup: ran one-shot DELETE FROM memories_fts WHERE rowid IN (SELECT id FROM memories WHERE kind='entity'). 6 stale rows removed. Audit re-run confirms sync delta = 0. tools/phase7_audit.py reports: - memory counts by kind (catches "everything classified unknown" drift) - top entities by mention frequency (current top: Ernesto freq=8) - edge type breakdown (similar 11404, mentions_entity 19, rem_bridge 1) - validity coverage (currently 0; infrastructure dormant pending callers) - Memify duplicate candidates (3 groups; 4 rows would be downweighted) - contradiction candidates by validity sequence - locus overlay coverage - FTS5 index sync delta (now 0) - salience distribution - Phase 7 schema column completeness (16/16 mem, 10/10 conn ✓) Backward compatibility verification: - 10/10 sparse + temporal tests pass (was 9; added 1) - All other Phase 7 test suites unchanged - test_suite.py: 41/47 IDENTICAL baseline Live DB stats post-cleanup: - 246 memories (232 unknown legacy + 8 experience + 6 entity) - 240 fts rows (perfect sync with non-entity memories) - 11424 connections Refs: - reference_neural_memory_execution_addendum.md (Phase 7 audit informally) - reference_neural_memory_unified_integration_handoff.md sec 5.1 + 5.6 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ergonomic helpers wrapping NeuralMemory.remember() with the 6 most-
common AE event shapes. Each helper builds the right Phase 7 typed-
kwarg dict (kind, source, origin_system, valid_from, metadata, etc.)
so AE main-builder lane callers don't re-derive patterns.
python/ae_workflow_helpers.py:
- record_customer_interaction(customer, topic, body, channel)
-> kind='experience', source=channel, customer auto-extracted as entity
- record_job_event(job_id, event_type, body, source='dashboard')
-> kind='experience', job_id (e.g. 'Lennar lot 27') auto-extracted
- record_whatsapp_message(crew_member, text, thread_id, lang='es')
-> kind classified by classifier (Spanish 'Cuando...' -> procedural)
- record_sop(label, content, evidence_ids, confidence=0.95)
-> kind='procedural', derived_from edges to evidence experiences
- record_invoice_status_change(invoice_id, old_status, new_status, ts)
-> bi-temporal pair (old.valid_to=ts, new.valid_from=ts) + contradicts
edge. detect_conflicts=False to prevent H19 supersession from
merging the deliberately-near-duplicate facts.
- record_financial_event(event_type, due_date_iso, note, amount_cents)
-> kind='experience', source='financial_calendar', valid_from=ts
- initialize_ae_locus_overlay(): idempotent setup of 6 standard locus
rooms (Compliance, Customers, Finance, Active Jobs, Permits, Engineering).
python/classify_memory_kind.py: added 9 Spanish patterns to the
procedural classifier (Cuando/Si/Siempre/Nunca/Antes de/Despues de/
como hago/pasos para/recuerda/asegurate de). AE has Spanish-speaking
crew via WhatsApp; the classifier needed to handle their messages
correctly. World + mental_model patterns still English-only (those
domains less critical for crew comms).
python/test_ae_workflow_helpers.py: 9 unittest contracts covering each
helper. Verified bi-temporal correctness, Spanish classifier flips,
entity auto-extraction from job_id, derived_from edge creation, and
locus init idempotency.
Caught + fixed during build:
- H19 supersession was merging old/new invoice facts because their
text similarity was high. Fix: detect_conflicts=False on
record_invoice_status_change pair.
- Classifier missed Spanish 'Cuando...' procedural pattern. Fix:
added Spanish regex set.
Backward compatibility verification:
- 9/9 new helper tests pass.
- 13/13 P7C2 typing tests still pass (Spanish patterns are additive).
- All other Phase 7 test suites unchanged.
- test_suite.py: 41/47 IDENTICAL baseline.
This commit is ergonomics-only: adds helpers + classifier patterns.
No new memory schema, no retrieval changes, no shared-state ops.
AE main-builder lane (6eec244c) decides where to call these.
Refs (in claude-memory PRIVATE):
- reference_neural_memory_ae_usage_patterns.md (recipes this implements)
- reference_neural_memory_unified_integration_handoff.md sec 5.2 + 8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-binary CLI exposing the typed/temporal/entity/scoring surface for terminal use without going through hermes. Tito can inspect, record, recall, audit, and govern memories directly. tools/nm.py — argparse-based dispatcher with 11 subcommands: remember store new memory with --kind/--source/--valid-from/--metadata recall semantic recall with --kind/--as-of/--k/--format sparse FTS5 BM25 retrieval graph PPR graph_search with intent-aware weights, --hops explain recall + per-result explanation paths (channels, features) audit phase7_audit health report (delegates to tools/phase7_audit.py) count memory + connection + entity counts entities top entities by mention frequency, --top N forget background / redact / delete a memory by id bench AE-domain bench (diagnostic or scored mode) memify one-shot dream Memify hygiene pass contradiction one-shot contradiction detection sweep Date parsing: --as-of accepts "now" / unix epoch / ISO date / common formats. Output: --format=compact (human-readable, default) or --format=json (for scripting/piping). All commands accept --db PATH override (default ~/.neural_memory/memory.db). Smoke verified against live DB: - count: 241 mem / 11531 conn / 6 entities - entities top: Ernesto Valencia Godinez (freq=9), Sprint (freq=7), ... - recall, sparse, graph, explain all return results - explain shows the salience-weighted feature breakdown per result Use cases: - Tito investigates "what does the system know about X" without firing up hermes - Diagnose Phase 7 features misbehaving - Bulk operations (batch-forget by piping ids through `nm forget`) - Bench runs from cron / CI This commit is ergonomics-only: pure additive CLI; no schema, no retrieval changes, no backward-compat surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…om C2/C3) Closes the punt documented in P7C2/P7C3 reviewer notes: when remember(text, kind=X, source=Y, ...) triggers H19 supersession against an existing memory with high cosine similarity, the new typed kwargs now flow through replace_memory() into the replacement row. python/memory_client.py: - SQLiteStore.replace_memory() extended with 8 keyword-only typed params: kind, confidence, source, origin_system, valid_from, valid_to, transaction_time, metadata. Builds dynamic UPDATE — only updates typed cols when caller provides non-None values; old typing preserved on omitted kwargs (no silent NULL-out). - NeuralMemory.remember() supersession path at line 1069 now passes the user's typed kwargs through to replace_memory(). transaction_time auto-stamps to time.time(). python/test_memory_typing.py: +2 contracts (15 total now): - test_replace_memory_propagates_typed_kwargs: typed kwargs land in row - test_replace_memory_preserves_typed_kwargs_when_omitted: silence ≠ NULL Verification: - 15/15 typing tests pass. - All 11 Phase 7 test files pass (104 total contracts). - test_suite.py: 41/47 IDENTICAL baseline. Phase 7 punt list now empty: - C2/C3 punt (MSSQL): explicit out-of-scope (no pyodbc on AE box) - C2/C3 punt (H19 supersession): RESOLVED here Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 unittest contracts exercising the nm CLI as a subprocess against tmp DBs. Ensures the CLI surface is locked in and json-output paths are parseable for scripting. tools/nm.py: redirect NeuralMemory init banner from stdout to stderr. The embed_provider auto-detect prints 'Embedding backend: ...' at startup; this pollution broke --format=json consumers. Now stdout is JSON-clean; users still see banner via stderr. python/test_nm_cli.py — contracts: - count_on_empty_db (json-parseable) - remember_then_count - remember_recall_roundtrip - sparse_search hits FTS5 - entities_top auto-extraction - audit_runs_without_error (human-readable) - explain_returns_features incl salience - forget_background_visibility - memify_runs_without_error - help_is_not_an_error Backward compat verified: - 10/10 new CLI tests pass. - All 11 prior Phase 7 test files still pass (114 total contracts). - test_suite.py: 41/47 IDENTICAL baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer itsXactlY#3 caught a recurring bug: my fix in bcd72db was a forward- guard only. Hermes (running at PID 55181 since 12:47, was 19835 at session start) hadn't reloaded the updated memory_client.py module, so its in-memory copy still inserted entity rows into FTS5. Result: 8 stale entity rows in the FTS index by review-time (was 6 at original audit; +2 from continued hermes saves). The forward-guard is right but insufficient when long-running processes hold stale code. This commit adds a self-healing defensive cleanup: SchemaUpgrade._ensure_fts5() now DELETEs any kind='entity' rows from memories_fts on every invocation. Combined with the SQLiteStore.__init__ hook from P7C2, this means every fresh NeuralMemory() instance cleans the index. Backfill is also now kind-aware (skips entities). python/schema_upgrade.py: _ensure_fts5() extended with defensive DELETE + kind-filtered backfill. ~10 LOC added. Verified on live DB: ran schema_upgrade.py against ~/.neural_memory/ memory.db; sync delta dropped 8 → 0. Tests still pass (5/5 schema + 10/10 sparse_temporal). Trade-off: defensive DELETE on every init is O(entity_count) extra work — negligible at AE scale (few entities ever). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sertion Reviewer itsXactlY#2 caught two regression risks in the post-Phase-7 audit: 1. The c2c2321 defensive FTS cleanup (kind='entity' rows must be DELETEd from memories_fts on every SchemaUpgrade.upgrade() invocation) had ZERO test coverage. Without a test, future refactors could silently regress the self-healing behavior. 2. test_explicit_name_overrides_env_var was trivially-true: it set the env var to 'hash', requested name='default', and only asserted `assertIsNotNone(backend)` — passes regardless of whether the override was honored. Fixes: python/test_schema_upgrade.py: +2 contracts (now 7 total): - test_ensure_fts5_cleans_entity_rows_defensively: simulates stale-code pollution path (insert entity row + manually pollute FTS5), runs SchemaUpgrade.upgrade(), asserts cleanup fires. - test_ensure_fts5_backfill_skips_entity_rows: backfill on a DB containing entity rows must NOT add them to FTS5 (kind-aware backfill clause). python/test_embedding_registry.py: tightened test_explicit_name_overrides_env_var: - now asserts env-baseline first (HashBackend with NEURAL_MEMORY_EMBED_BACKEND=hash), then asserts explicit name='default' returns NOT-HashBackend (proving override was honored). Verification: - 7/7 schema_upgrade tests pass (was 5) - 7/7 embedding_registry tests pass (assertion now actually means something) - All 12 P7 test files green; 119 total contracts (was 117). - test_suite.py: 41/47 baseline preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n parity Synth P0-2 closeout (sonnet packet S2, opus reviewed/tested/integrated from dirty worktree at af643fe; pre-commit diff preserved at ~/.neural_memory/handoffs/2026-05-03-pre-commit-dirty-S2-S7.patch): Replay-safe authority: - Add _compute_evidence_id(evidence_type, source_system, source_record_id) -> deterministic sha256[:16] hash. Same inputs yield same id across processes / restarts / re-ingests. - record_evidence_artifact() now performs lookup-before-insert: if a memory with metadata.evidence_id matches, returns existing memory_id with inserted=False. Otherwise inserts and returns inserted=True. evidence_id is injected into substrate metadata so future replays can find it. - Return shape changes from bare int memory_id to structured {memory_id, evidence_id, inserted}. Sonnet packet S1 from earlier wave verified zero LIVE production consumers across NM + AE-LangGraph + Hermes + Claude — breaking change is safe. - record_wa_crew_event, record_estimate_evidence, record_material_price_evidence propagate the structured return. Input signatures unchanged. WA dry-run parity: - tools/ingest_wa_dryrun.py::to_typed_record now includes evidence_id in the typed dry-run record shape so dry-run output mirrors what live ingest would write. Tests: 22 OK (was 17, +5 new) in test_ae_evidence_ingest.py: - test_evidence_id_is_deterministic_across_calls - test_record_evidence_artifact_upsert_returns_existing_memory_id - test_record_evidence_artifact_returns_structured_dict - test_record_wa_crew_event_returns_structured_dict - existing tests updated for new return shape Adjacent regression: 28/28 across ae_bench_harness + ingest_ae_corpus_dedup + ae_bench_label_integrity + hermes_plugin_hybrid_recall. No substrate write. No live ingest. No --live mode (S4 packet, gated on this commit + Tito itsXactlY#1 source path). Co-Authored-By: Claude Sonnet 4.6 (S2 packet) <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed authority (partial)
Synth P1 #7 closeout PARTIAL (sonnet packet S7, opus reviewed/tested/
integrated; mtime/provenance selection follow-up dispatched as S7b):
Stale "0.82" prose removed:
- tools/nm_recall_mcp.py:13 (module docstring) and :55 (HNSW comment)
no longer advertise the stale R@5=0.82 from before the Phase 7.5
migration. Synth-current authority is 0.5758 (latest artifact) /
0.6061 (preserved peak) — a static number in prose was misleading.
Helper added:
- _bench_authority(bench_dir) -> (r_at_5_str, artifact_name) reads the
most recent ae-domain-*.json under ~/.neural_memory/bench-history/
and returns the live R@5 string. Falls back to ("unknown", "unknown")
on no-artifact / read-fail / malformed JSON / missing key.
KNOWN PARTIAL (Sonnet flagged + dispatched as S7b follow-up):
Helper currently uses sorted(glob)[-1] which is lexicographic — picks
ae-domain-bge-small-clean-073802.json (a copy-ablation with R@5=0.6061)
over ae-domain-2026-05-02-124730.json (production, R@5=0.5758) because
letters > digits in ASCII. S7b will replace lexical sort with mtime +
provenance + production-DB filter so copy-ablation artifacts can't be
authoritative.
Tests added (5/5 pass) — python/test_nm_recall_mcp_authority.py:
- test_authority_helper_reads_latest_artifact
- test_authority_helper_returns_unknown_when_no_artifact
- test_authority_helper_returns_unknown_on_malformed_artifact
- test_authority_helper_returns_unknown_when_key_missing
- test_no_static_082_string_in_module (lockdown — defends against
re-introduction of the stale literal)
Smoke import clean. No substrate write.
Co-Authored-By: Claude Sonnet 4.6 (S7 packet) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…=0.82 audit
Synth #7 + #13 closeout (sonnet packet S7b, opus reviewed/tested/integrated):
#7 Bench artifact selection (lex-sort -> mtime + provenance filter):
- tools/nm_recall_mcp.py::_bench_authority rewritten. Was sorted(glob)[-1]
(lexicographic) which picked ae-domain-bge-small-clean-073802.json
(R@5=0.6061 copy-ablation) over ae-domain-2026-05-02-124730.json (R@5=
0.5758 production) because letters > digits in ASCII.
- New _is_eligible_artifact(path) helper: prefers artifacts whose
provenance.db_path ends with /.neural_memory/memory.db (production
canonical); falls through to strict timestamp regex
ae-domain-\d{4}-\d{2}-\d{2}-\d{6}\.json for legacy/pre-bfd3b70
artifacts that have no provenance block yet.
- Selection then picks max-mtime among eligibles.
- Existing fallback to ("unknown", "unknown") preserved.
#13 Stale R@5=0.82 prose final audit:
- python/ae_workflow_helpers.py:270 — docstring updated to neutral
"see latest bench-history artifact + canonical reader" pointer.
- tools/neural-memory-snapshot-daily.sh:5 — neutral phrasing.
- tools/launchd/com.ae.neural-memory-snapshot.plist:6 — neutral phrasing.
- Production-source grep for "0\.82" outside tests now returns ZERO
hits. Load-bearing per-category threshold values in
benchmarks/queries.py + README.md (customer_temporal target = 0.82)
intentionally untouched — those are config values, not stale prose.
Tests: 10/10 pass in test_nm_recall_mcp_authority.py (was 5/5, +5 new):
- test_authority_selects_artifact_by_mtime_not_lex_sort
- test_authority_excludes_copy_ablation_artifact
- test_authority_falls_through_when_only_pre_provenance_artifacts
- generalized lockdown extended to scan all 5 modified prod files
- smoke import clean
Adjacent regression: 71/71 pass across 6 test files.
Behavior verified post-patch: previously selected
ae-domain-bge-small-clean-073802.json (R@5 0.6061, copy-ablation); now
selects ae-domain-2026-05-02-124730.json (R@5 0.5758, production
substrate). Selection currently flows through the legacy timestamp
fallback because no live artifact yet has a provenance block (all
predate bfd3b70); the next F9 rerun will produce the first artifact
that hits the production-canonical branch.
No substrate write. No F9 bench. No commit of unrelated work.
Co-Authored-By: Claude Sonnet 4.6 (S7b packet) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synth P0-3 closeout via Option E (sonnet packet S-OptE, opus reviewed/
tested/integrated):
AE-builder lane discovered Option E: the existing sent_estimate_pdf_miner.py
in AE-LangGraph already writes typed sidecar JSONs at
data/sent-estimates-pdfs/*.json with shape:
{msg_id, thread_id, subject, from, to, date, filename, size_bytes,
text, extraction, dollar_total_guess, downloaded_at}
NM tails them. ZERO AE patch. ZERO cross-lane permission. 47 historical
sidecars picked up free on first --backfill. Tito approved no privacy
gating: "idc about privacy. i'm only user dude."
New files (~650 lines total):
- tools/ingest_sent_pdf_sidecars.py (365 lines): dry-run default,
--live opt-in (ingests via record_evidence_artifact with replay-safe
semantics already shipped at 527aeec), --sidecar-dir override,
--watermark for incremental runs, --backfill for one-shot full,
per-row try/except + structured per-row report output.
- python/test_ingest_sent_pdf_sidecars.py (284 lines): 10 contracts
covering dry-run/live/idempotent/watermark/backfill/malformed/
metadata-shape/ISO-parsing/epoch-passthrough/empty-dir.
NM mapping (per AE-builder spec):
evidence_type = "sent_pdf" (canonical EVIDENCE_TYPES entry; packet
spec said "sent_estimate_pdf" but that
isn't in the registry — would crash
_validate_evidence_type)
source_system = "sent_estimate_pdf_miner"
source_record_id = sidecar.msg_id
content = sidecar.text
valid_from = parsed from sidecar.downloaded_at (defensive ISO/epoch)
privacy_class = "financial" (mirrors record_estimate_evidence default)
metadata = {thread_id, subject, from, to, date, filename,
dollar_total_guess, capability_id: "ITEM-SENT-PDF"}
KNOWN DESIGN FLAG (Sonnet surfaced; ship-as-is for first cycle):
47 sidecars → 30 distinct msg_ids. Multi-attachment Gmail threads
(e.g. Christa's 7-PDF LOI bundle) share one msg_id. Under upsert
semantics from 527aeec, those 17 extra sidecars dedupe to
inserted=False — one memory_id per thread.
If per-PDF identity is required (each PDF as distinct evidence row),
change source_record_id to f"{msg_id}:{filename}" in a follow-up
packet. Tests would need to flip with it. Default thread-level
identity ships now since:
- it's what's tested + dry-run validated
- the use case (semantic recall) is satisfied either way for v0
- per-PDF adjustment is a one-helper change if Tito wants it later
Verification:
- python3 -m unittest python/test_ingest_sent_pdf_sidecars.py -v
→ 10/10 OK
- --help renders cleanly
- --backfill dry-run smoke against 47 real sidecars: 47/47 validate,
0 errors, JSONL written to ~/.neural_memory/ingest-dryruns/
- Spot-check first row: ISO timestamp parsed correctly to epoch
(2026-05-02T05:14:42.784156+00:00 → 1777698882.784156)
- Adjacent regression: 71/71 across 6 test files
No --live execution by Sonnet or Opus this commit (Opus runs --live
explicitly when promoting to canonical ingest).
Co-Authored-By: Claude Sonnet 4.6 (S-OptE packet) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-portrait feature handoff from AE-builder absorbed (sonnet packet
S-PORTRAIT-1, opus reviewed/tested/integrated). Tito hard rules
(non-negotiable) baked into design:
1. Agents pick their own aesthetic — no prompts written FOR them
2. Reasoning mandatory each cycle alongside visual
3. References inspiration only, not templates
4. Always-on, no manual triggers
This packet ships STEP 1 of the cycle: agent-agnostic read-only
substrate query helpers. Cycle dispatcher (S-PORTRAIT-2, separate)
calls these per-agent.
New file: python/self_portrait_substrate.py (406 lines)
- read_self_relevant_memories(mem, agent_name, limit=20)
- read_recent_reflections(mem, agent_name, limit=10)
- read_top_entities(mem, agent_name, limit=10)
- read_recent_dream_insights(mem, limit=5)
- read_peer_portraits(mem, exclude_agent, limit=3)
- compose_substrate_packet(mem, agent_name) — orchestrator entrypoint
returning {agent, ts, self_memories, self_reflections, top_entities,
dream_insights, peer_portraits}
Critical design decision (Sonnet flagged + Opus accepts):
metadata.author / metadata.actor / metadata.agent fields DO NOT
EXIST in current schema. Live attribution is fragmented across
origin_system column, source column, and metadata_json.from on
bridge_mailbox rows. Helper builds defensive multi-field SQL
predicate that picks up the canonical attribution AS SOON AS
S-PORTRAIT-2 starts writing metadata.author=<agent_name> on
kind='self_portrait' inserts. Zero-code-change forward-compatibility.
Also: kind='insight' was the spec literal; live schema uses
'dream_insight'. Helper queries kind IN ('dream_insight','insight')
for resilience.
Also: agent-filter moved into SQL WHERE (not post-fetch) after
smoke caught the failure — top-N-by-recency window can be 100%
Hermes-dominated, leaving claude-code / codex with self=0 results
if filtering happens after LIMIT.
Tests added (14/14 pass) — python/test_self_portrait_substrate.py:
- 4× read_self_relevant_memories (agent filter applied; SQL predicate
construction; empty substrate; no metadata.author tolerance)
- 2× read_recent_reflections (kind='self_portrait' missing tolerance;
legacy kind='reflection' coverage)
- 2× read_top_entities (connections graph traversal; weight ordering)
- 2× read_recent_dream_insights (kind alias; agent-agnostic)
- 2× read_peer_portraits (exclude_agent never appears; empty peers ok)
- compose_substrate_packet (all 6 keys present)
- end-to-end empty-substrate (no helper crashes on empty DB)
Adjacent regression: 81/81 across 8 test files.
Smoke import clean.
No substrate write. No image-gen. No diffusion prompts. No cron logic.
No schema change (TITO_BLOCKED canonical schema change deferred to
S-PORTRAIT-3 schema-migration packet if Tito approves).
Co-Authored-By: Claude Sonnet 4.6 (S-PORTRAIT-1 packet) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-gen) Sonnet packet S-PORTRAIT-2, opus reviewed/tested/integrated. Builds on S-PORTRAIT-1 substrate read (committed at 2cb3de9). Tito hard rules baked in: 1. Agents pick their own aesthetic — orchestrator NEVER pre-fills, templates, or constrains the agent's diffusion prompt. The agent passes prompt_text + reasoning_text; we validate bounds (non-empty, ≤4000) and pass through verbatim. 2. Reasoning mandatory each cycle (stored as the searchable content). 3. References inspiration only. 4. Always-on, no manual triggers. New file: tools/self_portrait_cycle.py (810 lines) Two invocation modes: --mode scaffold (default): runs STEP 2 (substrate read via compose_substrate_packet from S-PORTRAIT-1), writes input.json to ~/.neural_memory/portraits/<agent>/cycle-<ts>/. No image-gen, no store. Agent picks up input.json on its next turn. --mode complete: requires --reasoning-text + --prompt-text (both agent-authored). Runs STEPS 4-7: validate bounds, image-gen, diff-from-prior, store. STEP 5 image-gen: - Direct stdlib urllib.request → https://api.openai.com/v1/images/generations - Defaults: gpt-image-1 model, 1536x1024 size (real OpenAI model + supported size; handoff doc named gpt-image-2 which is a Hermes-side alias — this orchestrator calls OpenAI directly per NM topology) - Agents override via --image-model / --image-size (Tito rule itsXactlY#1) - OPENAI_API_KEY missing → image_path=None, cycle continues - 4xx/5xx/timeout → image_path=None, cycle continues (always-on req) - Saves to ~/.neural_memory/portraits/<agent>/<cycle_ts>.png - Supports url + b64_json response shapes STEP 6 diff-from-prior: - Deterministic token-set Jaccard with 4 bands (stable / mostly-stable / notable-shift / major-shift). Stdlib only. - Sonnet-subagent diff explicitly deferred to P1. STEP 7 store: - Uses NeuralMemory.remember() (verified at memory_client.py:1226) - kind='self_portrait', origin_system=agent_name (per S-PORTRAIT-1 attribution recommendation), source='self_portrait_cycle', salience=0.8, detect_conflicts=False - metadata: {author, agent_name, cycle_ts, image_path, image_url, prompt_text, model_used, anchor_seed, diff_from_prior, substrate_packet_path} Tests added (13/13 pass) — python/test_self_portrait_cycle.py: - scaffold mode writes input packet only - complete mode requires both reasoning and prompt - image-gen handles missing API key gracefully - image-gen handles API failure gracefully - store writes correct kind + origin_system + metadata.author - diff returns first-portrait message when no prior - orchestrator calls compose_substrate_packet at STEP 2 - dry-run skips store - URL-error path covered - b64_json success path covered - prompt validation bounds enforced - identical-reasoning diff band - low-overlap diff band Smoke: --help clean. --mode scaffold against /tmp empty substrate produces valid input.json with all 6 keys. No actual image-gen fired during testing (OPENAI_API_KEY unset; tests mock urllib). No commit/push by Sonnet, no substrate write, no live cron load. Co-Authored-By: Claude Sonnet 4.6 (S-PORTRAIT-2 packet) <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sonnet packet S-PORTRAIT-3, opus reviewed/tested/integrated. Schedules the self-portrait cycle every 6 hours per Tito hard rule itsXactlY#4 (always-on, no manual triggers). New files: - tools/launchd/com.ae.neural-self-portrait.plist (124 lines) - tools/self_portrait_cron.sh (82 lines) - python/test_self_portrait_cron_plist.py (219 lines) Plist (com.ae.neural-self-portrait): - StartCalendarInterval array of 4 dicts firing at 06:00 / 12:00 / 18:00 / 00:00 local (offset +3h from D5 03:00 daily aggregator, so cycle reads fresh insights) - KeepAlive: SuccessfulExit=false (only respawn on failure) - RunAtLoad: true (fires immediately when first loaded) - ThrottleInterval: 30 - Logs: ~/Library/Logs/ae/neural-self-portrait.{stdout,stderr}.log - Env: HOME + PATH only (OPENAI_API_KEY NOT in plist; cycle reads from agent env at complete-cycle time) Wrapper (tools/self_portrait_cron.sh): - bash (NOT zsh — more portable for launchd) - set -uo pipefail (NOT -e — per-agent failures don't starve the loop) - Iterates AGENTS=("claude-code" "hermes" "codex") — v0 set - Per-agent: python3 tools/self_portrait_cycle.py --agent <a> --mode scaffold - Writes input packets only — agents do their own complete-cycle later (Tito rule itsXactlY#1 — agents author their own reasoning + prompts) - ISO-8601 timestamped log lines + rolling log at ~/Library/Logs/ae/neural-self-portrait.log - Exits 0 only if all agents succeed (KeepAlive=false-on-success retries) Caught XML-comment trap: plutil -lint passes <agent> and -- inside comments, but plistlib.load (expat) rejects both. Comment prose rewritten to avoid; 8 plistlib-based tests added so the regression is caught immediately. Tests added (15/15 pass) — python/test_self_portrait_cron_plist.py: - plist XML valid (plutil -lint subprocess) - plist has 4 calendar intervals - calendar intervals are 06/12/18/00 - wrapper invokes self_portrait_cycle - wrapper iterates v0 agent set - wrapper uses set -uo pipefail (not -e) - 8 plistlib-based regression tests (XML-comment trap) - script paths exist + executable plan Verification: - plutil -lint OK - bash -n OK - 15/15 unit tests pass Plist NOT loaded by Sonnet or Opus this commit. To activate: launchctl bootstrap gui/$(id -u) tools/launchd/com.ae.neural-self-portrait.plist (after S-PORTRAIT-2 is on disk — which it is now at parent commit) Wrapper NOT chmod'd executable yet — Tito or follow-up commit handles permissions before live cron. Co-Authored-By: Claude Sonnet 4.6 (S-PORTRAIT-3 packet) <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sonnet packet S-PORTRAIT-PERF, opus reviewed/tested/integrated. Cron load test (commit 920ac4d) revealed the scaffold-mode path stuck for 150+ seconds at 70% CPU + 1.9GB RAM, never producing input.json. Bootout + kill needed. ROOT CAUSE: tools/self_portrait_cycle.py::_open_memory() instantiated full NeuralMemory(db_path=...) which loads embedder + HNSW + reranker + in-memory graph (~30-60s cold load). But compose_substrate_packet only uses raw mem.store.conn for SELECT-only SQL. Scaffold mode never needs the heavy stack. FIX: - python/self_portrait_substrate.py: helpers now accept either (a) NeuralMemory instance (via .store.conn — backward compat), (b) sqlite3.Connection directly, or (c) string/Path to memory.db (opens read-only sqlite3 URI). New _get_conn() dispatcher routes by type. Backward compat for MagicMock-mem path preserved (NM check first since MagicMock has implicit .cursor()). - tools/self_portrait_cycle.py: _open_substrate_lightweight() opens sqlite3 read-only directly (no NM init). main() routes scaffold mode through it; complete mode still uses _open_memory (needs NM for STEP 7 store with auto-embed). - Scaffold path now avoids importing NeuralMemory entirely. Tests: 34/34 pass (was 27, +7 new): - 6 in CompositionInputDispatchTests (sqlite3 conn / db_path string / pathlib.Path / mem-object backward compat / dispatch precedence / unsupported-input rejection) - 1 ScaffoldPerfTests subprocess assertion (timeout=10s against real substrate; SKIP if substrate absent for CI) LIVE SMOKE (real substrate ~/.neural_memory/memory.db, 15K+ memories): time python3 tools/self_portrait_cycle.py --agent claude-code --mode scaffold → 0.72s wall-clock (was 150s+ stuck before fix) → input.json written at ~/.neural_memory/portraits/claude-code/cycle-<ts>/ → 58.7 KB, 7 keys, 20 self_memories + 10 top_entities + 3 dream_insights → memory_client/embed_provider/sentence_transformers NOT in sys.modules WAL gotcha noted: read-only sqlite3 conn can't create -shm/-wal files; transient SQLITE_BUSY caught by _safe_execute defensive try/except. No commit/push by Sonnet, no substrate write, no plist reload (Opus reloads after this commit). Co-Authored-By: Claude Sonnet 4.6 (S-PORTRAIT-PERF packet) <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cal --db (S1)
tools/ae_domain_bench_run.sh previously selected --prev-results via raw
ls -t and omitted explicit canonical --db. Result: stale-HEAD or
copy/ablation artifacts could become the comparison baseline, and the
run could implicitly target a non-canonical DB.
S1 packet (NM-builder lane, dispatched by Opus):
- Added _select_eligible_prev python-heredoc selector with 14 rejection
criteria (stale HEAD, no provenance, db_path != canonical, missing
per_query, null memory_count/active_connection_count, copy/ablation
filename markers, etc.).
- Hard-coded CANONICAL_DB="${HOME}/.neural_memory/memory.db" and added
--db "$CANONICAL_DB" to the python invocation.
- Exposed --select-eligible-prev <dir> [--current-head <sha>] dry-run
mode for direct testability.
- New python/test_ae_domain_bench_run_authority.py: 20 tests covering
positive selection + 11 rejection criteria + 2 fallback paths +
ordering + 4 shell-invocation contract tests.
Real bench-history scan (10 priors): all 10 rejected (most-recent
ae-domain-2026-05-03-032052.json rejected on db_path=(default), 7 on
no-provenance, 2 on tag:bge-small, 1 on mode=None). Fallback engages
cleanly. Next bench run becomes the first eligible authority artifact.
Tests: 20/20 + 10/10 collateral test_ae_bench_harness.py pass.
Closes LIVE_FEED Active P0 itsXactlY#2 (BENCHMARK_GAP — shell authority).
Synth contract: LIVE_FEED 2026-05-03T10:41:47Z, S1 dispatch.
Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/S1-result.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion (S2 + Opus race fix)
Canonical DB had PRAGMA user_version=0 and no evidence-identity authority.
record_evidence_artifact previously deduped via JSON-scan of metadata —
not a real DB-level guard.
S2 packet (NM-builder lane, dispatched by Opus):
- python/schema_upgrade.py: additive evidence_ledger table
(evidence_id PK, memory_id, evidence_type, source_system,
source_record_id, status, inserted_at, updated_at, metadata_hash) with
UNIQUE indexes on (source_system, source_record_id) and
(evidence_type, source_record_id). PRAGMA user_version 0 → 1.
Migration is idempotent (CREATE IF NOT EXISTS, no DROP/ALTER).
- python/ae_workflow_helpers.py:
* _ledger_reserve uses INSERT OR IGNORE for atomic claim.
* Winner: mem.remember() then _ledger_set_memory_id patches in.
* Loser: re-reads ledger; if memory_id NULL still, falls back to
legacy json_extract path (or fresh remember as last resort).
* Helper return shape {memory_id, evidence_id, inserted} preserved
exactly across all 4 evidence helpers.
* Pre-upgrade DBs and non-SQLite stores transparently fall back to
legacy json_extract scan.
Opus race-fix follow-up (commit-time): S2's loser path returned None
when memory_id was still NULL (winner mid-flight), which made all 8
threads in the race test fall through to mem.remember() — exposing a
non-thread-safe iteration in HNSW/connection_graph internals (RuntimeError:
dictionary changed size during iteration). Wrapping the full pipeline
in store._lock would deadlock since mem.remember re-acquires the same
non-reentrant Lock internally. Fix: loser polls _ledger_lookup with
40 × 25ms (1s budget), releasing store._lock between polls so the winner
can complete; falls through to a fresh remember only if the budget
exhausts. Race test now passes consistently.
Tests: 12 schema + 26 evidence (incl. 8-thread race test) +
26 sent-pdf consumer = 64/64 pass.
Closes LIVE_FEED Active P0 itsXactlY#3 (REPO_DB_CONTRACT_GAP — evidence
identity DB guard).
Schema upgrade is NOT auto-applied to canonical DB. Tito ACK gates
that. Pre-existing JSON-scan path remains for un-upgraded DBs.
Synth contract: LIVE_FEED 2026-05-03T10:41:47Z, S2 dispatch.
Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/S2-result.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-live (S3)
Bare msg_id was collision-prone — sent-PDF corpus has 11 duplicate-msg_id
groups covering 35/63 sidecars (multiple PDFs per email attachment). Without
composite identity, second sidecar with same msg_id would have produced
silent merge or override of the first.
S3 packet (NM-builder lane, dispatched by Opus):
- tools/ingest_sent_pdf_sidecars.py:
* source_record_id = f"{msg_id}:{filename}" (composite). Fallback to
f"{msg_id}:{sha256(pdf)[:16]}" only when filename missing/empty
(current corpus: 0/63 missing).
* Watermark schema bumped v1 → v2: processed_keys is set of
composites; legacy v1 watermarks auto-discarded with INFO log.
* Added --db-path flag (defaults to canonical).
* Pre-flight refusal for --live: REFUSE (exit 5) unless target DB has
evidence_ledger table OR user_version >= 1 (mirrors S2's target).
Applies to canonical default AND --db-path copies. Dry-run mode
unchanged.
* Metadata enriched: msg_id preserved + source_record_key_strategy
∈ {filename, filehash}.
- python/test_ingest_sent_pdf_sidecars.py: 26 tests including the
duplicate-msg_id fixture proving 2 sidecars sharing msg_id produce
2 distinct ledger entries.
Real-corpus dry-run (63 sidecars): 63 distinct composite keys, 63
distinct evidence_ids, 0 collisions. All 11 duplicate-msg_id groups
resolved into per-attachment ledger rows.
Real canonical-DB --live refusal proof: ~/.neural_memory/memory.db has
user_version=0 + evidence_ledger absent → tool exits 5 with explicit
"DB guard not present" reason. Will auto-clear when S2's schema_upgrade
is applied to canonical (Tito-gated).
Tests: 26/26 pass.
Closes LIVE_FEED Active P0 itsXactlY#5 (REWORK — sent-PDF identity/live safety).
--live first invocation against canonical remains gated on:
(1) S2 schema_upgrade applied to canonical, (2) Tito ACK, (3) Opus runs.
Synth contract: LIVE_FEED 2026-05-03T10:41:47Z, S3 dispatch.
Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/S3-result.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real WA crew chat ingest is TITO_BLOCKED until canonical batch path /
owner / sample JSONL appear. NM stays consumer-side; producer is
Hermes-lane work (HANDOFF-hermes-wa-producer-spec.md already shipped).
Hardening the validator now means edge cases are pre-caught before
real batches arrive.
S5 packet (NM-builder lane, dispatched by Opus):
- tools/ingest_wa_dryrun.py:
* Required-field check: missing/null thread_id, sender, raw_text,
ts → row invalid with explicit reason string.
* ts must parse as ISO8601 (rejects epoch ints, ambiguous formats);
retains numeric-ts back-compat shim _ts_to_epoch so S2's
to_typed_record evidence_id parity test stays green.
* thread_id pattern WARNING for non-WA shapes (not reject — shape
may evolve).
* privacy_class enum: rejects values not in
{internal, financial, pii_low}.
* lang code WARNING if not 2-letter ISO 639-1.
* media_paths element-level + shell-meta safety checks.
* consumer_hint / boundary_violation_suspect / normalized_text /
auth_proof type checks.
* evidence_id format check (sha256 first 16 hex) + parity proof
against record_wa_crew_event helper (matches deterministic
formula sha256("wa_crew_message|hermes_wa_bridge|<source_record_id>")[:16]).
* Per-row report: {row_index, valid, errors, warnings,
computed_evidence_id}.
* Codified exit codes: rc=0 all valid, rc=2 other failure, rc=3
any invalid.
* --report-jsonl mode for machine-readable output.
- python/test_ingest_wa_dryrun.py: 60 tests across 13 test classes.
Cross-helper note (read-only check, NOT auto-fixed): record_wa_crew_event
has auth_proof typed as Optional[str], while validator treats it as
Optional[dict] per the AEEvidenceIngest v0 contract. Surfaced for Opus
review — S5 didn't modify the helper (S2's scope).
Tests: 60/60 pass; S2 lockdown test_dryrun_evidence_id_matches_live_ingest
still green (26/26 evidence test file).
Closes LIVE_FEED Active P0 #6 partial (TITO_BLOCKED — validator
hardening, no canonical writes).
Synth contract: LIVE_FEED 2026-05-03T10:41:47Z, S5 dispatch.
Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/S5-result.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ set -u (S1 followup)
S1 packet's eligibility filter correctly initializes PREV_ARG=() when no
eligible prior exists, but the subsequent "${PREV_ARG[@]}" expansion is
fatal under macOS bash 3.2 + set -u (treats empty-array expansion as
"unbound variable"). Caught when Opus actually ran F9 against canonical
(S1's brief told it not to run the bench, so this surfaced post-merge).
Fix: use the bash-3.2-safe ${PREV_ARG[@]+"${PREV_ARG[@]}"} idiom.
Expansion is no-op when PREV_ARG is unset/empty, regular array
splat when populated.
Verified by F9 rerun against canonical: rc=0, produced eligible
artifact ae-domain-2026-05-03-062619.json (HEAD 38e5cd8, db_path
canonical, 38 per_query rows, model_name + substrate_counts populated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ranch (S7-FOLLOWUP fix)
Prior runtime-proof attempts could only confirm hybrid_recall was OFF
(via logger.warning fallback messages) or guess from inference. When the
env var is correctly plumbed and hybrid_recall succeeds, there was zero
log evidence — making "is hybrid recall actually firing?" unanswerable
from log inspection alone.
Add logger.info before each hybrid_recall call:
- L569 area: queue_prefetch path (rerank=False, background prefetch)
- L1029 area: _handle_recall path (rerank=True, explicit recall)
Pattern: "hybrid_recall enabled: <path> (k=<limit>)"
Future runtime proofs can now grep:
grep 'hybrid_recall enabled' ~/.hermes/logs/gateway.log
Pairs with the gateway plist relocation (env var moved from
com.ae.hermes.plist to ai.hermes.gateway.plist where the actual recall
consumer lives — see addendum 2026-05-03T13:00Z + S7-FOLLOWUP-result.md).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synth contract (LIVE_FEED 2026-05-03T10:41:47Z): "S6 - Label expansion: after valid row-rich F9, add 12-20 high-value labels for Spanish/materials/Lennar/customer-temporal with duplicate-GT integrity tests." S6 dispatched after the first eligible current-head/current-DB scored authority artifact landed (ae-domain-2026-05-03-062619.json, R@5=0.5263 on 38 labeled queries, HEAD 38e5cd8). Coverage targets (Tito reframing honored — no Spanish-WA labels): - customer-temporal: +3 (TMP-011, TMP-026, TMP-033) — needed >=2 - materials SKU ambiguity: +5 (MAT-001/003/005/011/014) — needed >=3 - Lennar permit/lot R6-10: +5 lot-anchored (LOT-008/014/016/025/033) — needed >=3 - stubs 274/286/277/288: covered by 11 existing — no new needed - permits 13687/13688/13692/13693: covered by 6 existing — no new needed - Spanish-WA: 0 (deferred until real Hermes WA batch arrives) Total scored: 38 → 57 (+19). 28 distinct new GT memory_ids verified present in canonical at HEAD c4f69e2 with 1536-dim embedding + non-empty content. Integrity test added: test_duplicate_ground_truth_set_pair_count_under_cap enforces a documented cap of 18 GT-set collisions (existing corpus had 17; 3 NEW intentional collisions added for cross-lens reuse). Strict no-duplicates would fail at HEAD c4f69e2; cap design catches FUTURE lazy-labeling regressions. Tests: 4/4 pass (label integrity) + 20/20 pass (S1 shell auth). 24/24 combined sweep green. Expected R@5 movement (per LIVE_FEED caveat "regression-gate fires on label-expansion drift, not quality"): small drop possible (0.50-0.55 range vs prior 0.5263) due to label-expansion drift on entity-dense HD-SKU and permit-doc anchors. Investigate only if global drops below 0.40 or any category collapses >0.20 from F9 baseline. Closes LIVE_FEED Active P1 itsXactlY#2 (BENCHMARK_GAP — labels too sparse for model promotion). Synth contract: LIVE_FEED 2026-05-03T10:41:47Z, S6 dispatch. Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/S6-result.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4 (S6a + S1e) S6a: Validated TMP-011 (GT=[5531]), TMP-026 (GT=[264,280]), TMP-033 (GT=[268,282]) against canonical DB. All five GT ids resolved to WRONG_CONTENT — 5531 is an Amperage Q1 invoice table (not a contact); 264/280/268/282 are current-tense "Sarah from Lennar" assertions with no predecessor or change-event semantics. No better GT ids exist in corpus. Moved entries to category="quarantined_temporal" with empty ground_truth_ids; bench runner skips empty-GT entries via the existing `if not q["ground_truth_ids"]: continue` guard, so customer_temporal R@5 no longer gets dragged to 0 by mislabels. S1e: Lowered _SCORED_QUERY_FLOOR from 57 to 54 with full quarantine rationale inline. Floor decrement is deliberate, audited removal of bad labels — not silent label drift. Test message preserved as a guardrail for future drops. Tests: bench label integrity 4/4, bench subsets/gate 18/18, helper evidence ingest 29/29.
…S1c + S1d) S1c: run_ae_domain_bench now emits a `subsets` block with up to four slices — preserved_33, subset_38, new_label_only, full_57 — each carrying query_md5, git_head, db_path, substrate_counts, global_r@5, per_category_r@5, per_query rows, and dropped_ids. PRESERVED_33_QUERY_IDS derived from commit 03f4785; SUBSET_38_QUERY_IDS derived from artifact ae-domain-2026-05-03-062619.json. Sets are strictly nested 33 ⊂ 38 ⊂ 57 at HEAD. S1d: _category_regression_gate rewritten to compute comparable_ids = cur_per_query ∩ prev_per_query, report per-category R@5 deltas only on the intersect, surface label_expansion_categories separately (never fires regression), and return regression_detected on every path. Adds --enforce-regression flag and AE_BENCH_ENFORCE_REGRESSION=1 env (rc=3 on gate fire). Existing rc=2 threshold-fail behavior preserved. tools/ae_domain_bench_run.sh: ENFORCE_REGRESSION_ARG=() forwarded with bash 3.2 + set -u safe expansion (preserves df9373b fix pattern). Tests: 18 new in test_bench_subsets_gate.py (anchor counts, dedup, nesting, subset block keys, dropped_ids, provenance, gate disabled w/o prev, gate fires on intersect drop, label expansion does not fire). 18/18 green.
…ail-closed (S4b)
Contract 1 — source_record_id required. record_evidence_artifact now accepts
allow_unkeyed_nonprod: bool = False (keyword-only). When source_record_id is
None and allow_unkeyed_nonprod is False, raises ValueError naming both params.
Caller audit confirmed all three internal helpers (record_wa_crew_event,
record_estimate_evidence, record_material_price_evidence) already pass real
source_record_ids — no caller changes required. Production callers get
replay-authority by default; ad-hoc callers must explicitly opt in.
Contract 2 — ledger loser timeout fails closed. The "fall through to a fresh
mem.remember()" path after the 40-poll loop is replaced with an explicit
return: {memory_id: None, evidence_id, inserted: False, status: "pending_winner"}.
A second mem.remember() call after timeout would create a duplicate row under
the same (evidence_type, source_system, source_record_id) triple, violating
the unique-key invariant. Callers should retry; check status == "pending_winner"
before using memory_id.
Tests: 3 new (rejects None by default; allows None with opt-in; loser-timeout
returns pending_winner without invoking mem.remember and leaves exactly one
ledger row). 4 pre-existing tests updated to pass an explicit source_record_id
where they previously relied on None to reach a downstream validator. 29/29 green.
Note: S2b ledger-index-review verdict (analysis-only): no migration required;
all current producers globally namespace source_record_id within their
(source_system, evidence_type). Spec written to ~/.neural_memory/sonnet-packets/
2026-05-03/S2b-result.md as constraint for future producers.
…(S5b) Synth contract (LIVE_FEED 2026-05-03T13:43:23Z): "S5b - WA contract parity: align auth_proof to object/null across dry-run, live helper, metadata, and fixtures while preserving evidence_id parity." Synth Active P1 #9: "WA dry-run requires auth_proof object/null while record_wa_crew_event still takes/stores optional string; align before accepting Hermes batches." S5 (commit 38e5cd8) made the validator treat auth_proof as Optional[dict] per AEEvidenceIngest v0 contract. The helper still accepted Optional[str]. S5b closes the parity gap: - record_wa_crew_event: auth_proof type Optional[str] → Optional[dict]. - Explicit ValueError on str input with message about structured shape. - Persistence guard: `if auth_proof:` → `if auth_proof is not None:` so empty {} is preserved as caller intent (not silently dropped). - Docstring updated with new contract block. Tests: 5 new tests in python/test_ae_evidence_ingest.py: - test_record_wa_crew_event_accepts_dict_auth_proof - test_record_wa_crew_event_accepts_none_auth_proof - test_record_wa_crew_event_rejects_str_auth_proof - test_record_wa_crew_event_persists_auth_proof_dict_in_metadata - test_dryrun_validator_and_helper_agree_on_dict_auth_proof (parity canary) 1 existing test updated (test_wa_crew_event_persists_full_contract_schema) from string to dict auth_proof. evidence_id parity proven: validator + dryrun typed_record + live helper + manual sha256 formula all produce 3a4e03c2e1d42547 for the dict-auth_proof input row. auth_proof is metadata-only — never enters the deterministic key — so by construction this contract change cannot perturb evidence_id. Validator (tools/ingest_wa_dryrun.py) NOT touched: validator was already correct; this packet brings the helper into parity with it. External callers: none outside the test file. Hermes WA producer is still in spec phase (HANDOFF-hermes-wa-producer-spec.md), so no field caller passes string auth_proof. No migration needed. Tests: 94/94 pass (89 baseline + 5 new) including S2 lockdown test_dryrun_evidence_id_matches_live_ingest. Closes LIVE_FEED Active P1 #9 (CONSUMER_CONTRACT_GAP — WA auth_proof parity). Synth contract: LIVE_FEED 2026-05-03T13:43:23Z, S5b dispatch. Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/S5b-result.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (S2b)
Synth contract (LIVE_FEED 2026-05-03T13:43:23Z): "S2b - ledger index review:
prove all producers globally namespace source_record_id, OR migrate unique
policy to include source_system."
Synth Active P1 #11: "ledger uniqueness/comment policy now says evidence id
includes source_system, but the unique type-record index is still
(evidence_type, source_record_id); prove global namespacing or migrate
before multi-source producers."
S2b chose Approach B (migrate). Driving finding: evidence_type="sent_pdf"
is already produced by TWO source_systems with overlapping key shapes —
record_estimate_evidence (source_system=ae_dashboard, key={estimate_id}:
{event_type}) AND tools/ingest_sent_pdf_sidecars (source_system=
sent_estimate_pdf_miner, key={msg_id}:{filename}). The v1 narrower index
would block legitimate cross-source coexistence. Approach A (proving global
uniqueness) would freeze key-format contracts forever as public API — a
bigger cost than fixing the index.
Migration (idempotent + additive-friendly):
- user_version 1 → 2 (short-circuit on already-v2)
- DROP INDEX IF EXISTS idx_evidence_ledger_type_record (old narrow)
- CREATE INDEX IF NOT EXISTS idx_evidence_ledger_type_source_record
ON evidence_ledger(evidence_type, source_system, source_record_id)
- evidence_ledger row count is 0 in canonical at migration time; index
swap loses no data even if it weren't.
- Strict-superset on the index (preserves all pre-existing
row-uniqueness invariants).
Tests:
- python/test_schema_upgrade.py: 16/16 pass (2 updated for v2 contract,
4 new v1→v2 migration tests including idempotency proof against
hand-built v1 fixture via _create_v1_ledger).
- python/test_evidence_ledger_namespace.py (NEW): 7/7 pass (per-helper
source_record_id shape inventory + dual-source acceptance test
proving sent_pdf from ae_dashboard AND sent_estimate_pdf_miner can
coexist under v2 index).
Aggregate sweep: 185/185 + 11 subtests across S5b + S2b + collateral.
OPEN TITO ACK: applying SchemaUpgrade(canonical).upgrade() will trigger
v1→v2 migration on canonical DB. Migration is the safest possible
window — evidence_ledger rows = 0 — but it's a substrate mutation
requiring explicit Tito ACK per hard rules.
Closes LIVE_FEED Active P1 #11 (REPO_DB_CONTRACT_GAP — ledger uniqueness).
Synth contract: LIVE_FEED 2026-05-03T13:43:23Z, S2b dispatch.
Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/S2b-result.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ulti-attachment (S4c) Synth contract (LIVE_FEED 2026-05-03T13:43:23Z): "S4c-estimate-pdf-identity: bring record_estimate_evidence identity up to sent-PDF sidecar standard: include PDF path/hash, message id, recipient, sent_at, or attachment ordinal for resend/multi-attachment proof." Synth Active P1 #10: "record_estimate_evidence identity can still collapse resend or multi-attachment sent-PDF events; bring helper identity up to sidecar composite standard." Wave 1 (S3 commit 25b3dc7) shipped composite identity for the producer-side tool tools/ingest_sent_pdf_sidecars.py: source_record_id = msg_id:filename (fallback msg_id:filehash). The helper record_estimate_evidence still used {estimate_id}:{event_type} only — would silently merge resends and multi- attachment sends into one ledger row. S4c adds 4 OPTIONAL keyword-only params: - pdf_path (or pdf_sha256 fallback) — disambiguates multi-attachment - msg_id — Gmail message id, disambiguates resends - recipient — metadata-only (does NOT enter identity to avoid over- fragmenting batch sends) - attachment_ordinal — 1-indexed, disambiguates same-sha attachments Compose order locked: pdf → msg → sent → att. Format: base = f"{estimate_id}:{event_type}" + ":pdf=<basename(pdf_path) or sha[:16]>" if any + ":msg=<msg_id>" if any + ":sent=<int(sent_at)>" if any + ":att=<n>" if any Backward-compat preserved: existing callers passing only (estimate_id, event_type) produce the unchanged f"{estimate_id}:{event_type}" formula. Locked in test_record_estimate_evidence_backward_compat_unchanged_source_record_id. 8 new tests in python/test_ae_evidence_ingest.py: - backward_compat_unchanged_source_record_id - resend_disambiguation_via_msg_id - multi_attachment_disambiguation_via_pdf_path - attachment_ordinal_disambiguation - pdf_sha256_fallback_when_path_missing - composite_formula_deterministic - replay_dedup_with_full_composite - recipient_metadata_only_does_not_affect_identity Tests: 75/75 in-scope pass (test_ae_evidence_ingest.py + test_ingest_sent_pdf_sidecars.py + test_evidence_ledger_namespace.py). Caller audit: ZERO production callers in repo besides tests + reference table in test_evidence_ledger_namespace.py (literal-dict only). The producer-side tools/ingest_sent_pdf_sidecars.py uses record_evidence_artifact directly via S3, not record_estimate_evidence. Closes LIVE_FEED Active P1 #10 (REWORK — estimate-pdf identity collision risk). Synth contract: LIVE_FEED 2026-05-03T13:43:23Z, S4c dispatch. Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/S4c-result.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…49 (T13) S6-DIAG (wave 3 read-only diagnosis) identified 5 label_error candidates where the assigned ground_truth_ids don't actually answer the query semantically. Sonnet T13 independently re-verified each via canonical sqlite3 read-only — zero mis-classifications confirmed. Quarantined queries (mirror b82214c S6a/S1e pattern): - ELC-040: GT [274,286] = 33-char "Lennar lot 27 needs panel labels."; no permit/inspection/rework overlap with the query semantics. - MAT-004: GT [5961] = single-item grounding-bushings doctrine, not a BOM. - FIN-002: GT [2628,2659] = OVER-BUY doctrine framing over-buy as "deliberate, NOT a bug" — semantically opposite of OVERRUN query. - LOT-008: GT [5531] = pure Q1 Amperage invoice table; zero delivery/delay tokens. - SPA-010: GT = 5x byte-identical dupes of "falta el breaker" (intent=materials_missing) — semantically opposite of "comprar" (BUYING) in the query. Quarantine pattern: keep query definition, set ground_truth_ids=[], category="quarantined_<original>", inline 5-line rationale block per query. Pre-existing _QUARANTINED audit-trail block (codex's b82214c TMP quarantine) extended with T13's 5 entries. python/test_ae_bench_label_integrity.py: - _SCORED_QUERY_FLOOR lowered 54 → 49 with full rationale block. - New test test_quarantined_queries_excluded_from_scoring with 8 subtests asserting each quarantined query (3 from S6a + 5 from T13) has empty GT, correct quarantined_<cat> category, and is NOT in the scored set. Tests: 5 passed, 8 subtests passed in 0.03s. Cross-check test_bench_subsets_gate.py: 18 passed (no regression from category renames). Expected R@5 movement (T13 isolated): baseline 0.5370 on 54 → predicted ~0.5918 on 49 from removing always-miss entries from denominator (+0.083 honesty-only lift). Combined with T12's bench-meta filter, S6-DIAG projects 0.72-0.75 region. Closes LIVE_FEED P1 #8 partial (5 label_error subset of AE_EFFECTIVENESS diagnosis). Synth contract: LIVE_FEED 2026-05-03T13:43:23Z, S6-DIAG findings. Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/T13-result.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
S6-DIAG (wave 3 read-only diagnosis) identified the dominant retrieval failure mode: ~8 docs in claude_memory source literally enumerate bench query IDs + GT memory_ids in their content (likely created when archeologist subagents wrote about bench performance INTO substrate). They out-rank real GT memories in retrieval. mid 7931 alone appears in top-5 of 5 different misses. T12 implements a content-pattern filter at the BENCH-EVAL layer (NOT production retrieval — bench-meta docs may be informative for real production queries; only bench scoring excludes them). benchmarks/ae_domain_memory_bench/run_ae_domain_bench.py: - BENCH_META_EXCLUDE_IDS = (7928, 7931, 14459, 7975, 14280, 7976, 7914, 7171) — curated from S6-DIAG §4 Cluster A frequency table; each ID verified via read-only sqlite content read against canonical DB; each annotated with comment citing S6-DIAG miss-frequency. - BENCH_META_EXCLUDE_CONTENT_PATTERNS — 8 regex (defense-in-depth for new bench-meta docs that may land in substrate after curation). - _is_bench_meta(memory_id, content) -> bool predicate. - run_scored over-fetches k+8 from hybrid_recall, filters via _is_bench_meta, trims to k. Per-query exclusion counts + IDs recorded in artifact's new bench_meta_filter block. - Disable via env var NM_BENCH_DISABLE_META_FILTER=1 (added to _PROVENANCE_ENV_KEYS for sanity check / debugging). benchmarks/ae_domain_memory_bench/test_bench_subsets_gate.py: - test_bench_meta_filter_excludes_known_meta_ids - test_bench_meta_filter_does_not_exclude_real_gt - test_bench_meta_filter_provenance_recorded_in_artifact - test_bench_meta_filter_disabled_via_env_var - 1 additional test added by T12 Tests: 23/23 bench-subsets-gate (5 new T12 + 18 existing) + 20/20 authority + 5/5 label-integrity (post-T13 quarantine) = 48/48 + 8 subtests. Predicted R@5 lift basis (computed empirically from F9 authority): - Conservative (filter on existing top-10, no over-fetch): 0.5370 → ~0.5741 (+0.0371). 2 hard flips: MAT-015, LOT-015. - Upper bound (5 misses where h10=1 AND bench-meta in top-5): R@5 → ~0.6296 (+0.0926). - Production lift will land between these because runtime over-fetches k+8 so rank-11..18 GTs (currently invisible at k=10) can also promote. - 13 of 25 misses contain at least one curated bench-meta ID in top-5. Combined T13 quarantine + T12 filter expected to approach S6-DIAG's projected 0.72-0.75 region. Substrate untouched (bench-meta docs remain in canonical; only bench scoring excludes them). Closes LIVE_FEED P1 #8 partial (consumer_contract_failure cluster of AE_EFFECTIVENESS diagnosis). Synth contract: LIVE_FEED 2026-05-03T13:43:23Z, S6-DIAG findings. Evidence packet: ~/.neural_memory/sonnet-packets/2026-05-03/T12-result.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ooleans Adds two top-level boolean fields to every scored artifact produced by run_ae_domain_bench.py main(): threshold_failed = bool(categories_failed) regression_detected = category_regression_gate.regression_detected These surface the pass/fail verdict and regression gate status without requiring callers to drill into categories_failed or the nested gate block. Required by S1g (gated authority rerun) and the watcher AOR to report closure-grade status. Test: 4 new contracts in TestS1hThresholdBooleans (test_bench_subsets_gate.py) covering: miss→threshold_failed=True, pass→threshold_failed=False, regression drop→regression_detected=True, no-prev→regression_detected=False. 27/27 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two S4 hardening changes (2026-05-03):
1. tools/ingest_sent_pdf_sidecars.py — tighten _check_db_guard to v2:
- EVIDENCE_LEDGER_TARGET_USER_VERSION 1 → 2
- Guard now requires user_version >= 2 AND evidence_ledger table AND
idx_evidence_ledger_type_source_record (v2 composite index).
- Rejects: user-version-only, table-only, v1, and any DB lacking the
index. All must be refused for --live; dry-run bypass unchanged.
2. python/schema_upgrade.py — repair malformed v2 DBs:
- _ensure_evidence_ledger no longer returns early on user_version >= 2.
- On every call, ensures the table and all v2 indexes actually exist,
creating any that are absent (malformed DB, partial install, copy).
- ledger_indexes_created reports only truly-absent indexes (idempotent
re-runs on a correct v2 DB return 0 — same semantics as before).
Tests: 45/45 pass (29 sent-PDF + 16 schema_upgrade).
- Updated make_guarded_db fixture to v2 shape (user_version=2 + index).
- Removed obsolete test_guard_passes_via_user_version_alone (v1 uv-only).
- Added test_guard_fails_via_user_version_alone and test_guard_passes_v2_full_shape.
- Updated _check_db_guard helper tests for v2 rejection semantics.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tadata
`to_typed_record()` used truthiness (`if row.get('auth_proof')`) to include
auth_proof in metadata, which silently dropped an explicit empty dict {}.
Fix: key-presence + is-not-None semantics (`if "auth_proof" in row and
row["auth_proof"] is not None`). Preserves {} unchanged; still drops None
and missing keys.
Tests: 4 new S5cAuthProofParityTests contracts — empty/null/non-empty/absent.
64/64 tests pass (test_ingest_wa_dryrun.py).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Owner
|
Didnt noticed until right now, sorry! Its alot, and, will take me some time fully get the whole picture. Im on it. |
itsXactlY
added a commit
that referenced
this pull request
May 10, 2026
BGE-M3 already emits per-token contextual embeddings; we were paying
the forward-pass cost via the shared embed-server but discarding the
token-level outputs. This commit wires them in as a ColBERT-style
late-interaction rerank channel.
* python/colbert_helper.py — singleton BGE-M3 token extractor.
encode_tokens() returns top-K (default 32) by L2 norm in fp16,
L2-normalised so cosine == dot-product. score_late_interaction()
is the max-sim aggregator (per query token take MAX over doc
tokens, sum, divide by Q). GPU-batched when CUDA available with
a numpy fallback. Pack/unpack helpers stamp a 'CB1' magic header.
* python/migrate_colbert_tokens.py — restart-safe batched
backfill for existing memories. Reads via DreamBackend's
streaming helper, persists checkpoints after each batch.
* python/memory_client.py — colbert_tokens BLOB column on the
memories table (idempotent ALTER), set_/get_/stream_ helpers
on SQLiteStore. Recall now exposes enable_colbert + colbert_weight
kwargs; when armed, the top-100 fused candidates are rescored
via late-interaction and the result is folded in as a fusion
channel ('colbert') with default weight per preset
(skynet=1.2, advanced/hybrid=0.5, semantic/lean/trim=0).
Default-off until the operator sets MM_COLBERT_ENABLED=1, so
the cheap-recall path doesn't pay the storage tax (~64 KB/row,
~14.7 GB across a 230k-memory corpus).
Also lifts an FTS5 stopword filter cherry-picked from the upstream
PR #5 brainstorm: the multi-word AND-form was returning 0 BM25
hits on natural-language queries because of scaffolding tokens
("the", "a", "what", etc.). Filter recovers 240/240 sparse hits
on the AE-domain bench while leaving rare-token queries untouched.
* python/postgres_store.py — mirror schema + helpers for the
Pro/Enterprise pgvector backend. Idempotent
ALTER TABLE ... ADD COLUMN IF NOT EXISTS colbert_tokens BYTEA.
Verified on LongMemEval-S 500-question retrieval (470 gradeable):
ColBERT@1.5 lifts R@1 0.8064→0.8574 (+5.10pp), R@5 0.9596→0.9787
(+1.91pp), MRR 0.8733→0.9114 (+3.81pp) over the hybrid baseline,
with three of six question types reaching perfect R@5. p50 latency
cost: +15.8ms (41.1 → 56.9ms).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Heads-up, not a merge request
This PR exists for visibility — putting our work on your radar so you can pick what (if anything) to pull. We're operating from the
ernes-toe/neural-memoryfork on a private branch and don't expect or need this to be merged as-is. The branch carries 47 commits, much of which is fork-private tooling, but a few pieces of the architecture work may interest you.What's here, in priority order for upstream interest
1. Phase 7 unified-graph donor-organ upgrade (commits 8f11dbf → 183fdcf)
10 sequenced commits implementing schema + APIs for typed memory kinds, entity edges, bi-temporal validity, locus overlay, FTS5 sparse retrieval, embed-backend registry (incl. BGE-M3), Personalized PageRank graph search, unified salience-weighted continuous scorer, Memify + contradiction hygiene, and governance fields. Per a strategic verdict that resolved "BENCHMARK-high vs preserve-identity" tension by borrowing features from ~12 systems via substrate-compatible mechanisms rather than forking.
2. FTS5 multi-word natural-language fix (commit 2d9b5b9)
The default whitespace-AND tokenization made FTS5 unusable for natural-language queries (~0% sparse hit rate on conversational text). Fixed via tokenize + stopword filter + OR-join + phrase quoting. Goes from 0% → 99% sparse hit rate on a representative AE-domain query set.
3. Self-healing FTS5 entity-row cleanup (commit c2c2321)
Defensive guard for the fact that long-running processes with old code may have polluted the FTS index with
kind='entity'rows.4. Phase 7.5 wiring fixes (commits 7ae40eb → 8d061ef)
While auditing the live DB we found that 8 of 10 Phase 7 features had ZERO production rows despite the schema/APIs being complete — the wiring from caller to scorer was missing. 4 of those 8 closed:
procedural_scoreauto-population inremember()+ populated inCandidateFeaturesfrom the meta SELECTentity_scorefrommentions_entityedges via batched IN-clause query at hybrid_recall timestale_penaltycomputed fromlast_reinforced_at/created_atage (linear ramp, capped at 0.3)contradiction_penaltyfrom contradicts-edge count (no-op in our DB but wired for future)Plus an integration test suite (
python/test_phase7_5_wiring_integration.py) that varies each feature field independently and asserts the final score moves — guards against the "DB column populated but call-site never reads it" bug class.5. Tooling (
tools/)phase7_audit.py— read-only DB inspection: row counts by kind, edge breakdown, validity coverage, contradiction candidates, locus overlay, FTS5 sync delta, salience distribution, dream_insights bloat metric, Phase 7 feature usagepost_ingest_sanity.py— 16 retrieval-contract daily health checknm_digest.py— single-command DB+repo+process snapshot incl. Phase 7.5 wiring scoreboardcleanup_dream_insights.py— dry-run-by-default dedup tool. We surfaced 99.95% duplication indream_insights(4.3M rows / 1,879 unique) caused by unconditional INSERT inadd_insight(). Cleanup tool ships but is held until our companion idempotency-guard fix lands.What's fork-private and probably NOT for upstream
tools/ingest_ae_corpus.py— walks our private corpus pathsReal LongMemEval-S empirical numbers (for reference)
BGE-M3 1024d + cross-encoder rerank. 100-record run in flight at submission time.
Take what you want; ignore what you don't
The commits are atomic and individually mergeable. Happy to break this up into focused PRs against specific changes if any of it interests you. No expectation of merge.