Skip to content

nikshilov/pulse

Repository files navigation

Garden Pulse

Empathic memory engine for AI companions. Event-level semantic retrieval + typed belief vocabulary + emotion-aware ranking.

Tests Python Go License Part of Garden

Part of the Garden project — therapeutic-grade AI companion infrastructure.


The problem

Most memory engines for LLM companions answer the wrong question:

"What text is most similar to this query?"

Garden Pulse answers the harder one:

"Given who this person is, what moment from their life should surface right now?"

Pure cosine retrieval treats these pairs identically:

Pair Cosine says Empathic companion needs
"relapsed after 10 years sober" vs "had a beer with a friend" same-ish wildly different — one is crisis, one is mundane
invitation from a close friend vs invitation from a distant coworker same-ish wildly different — relationship weight matters
"found a paper on memory retrieval" vs "read a tweet about memory" same-ish wildly different — one changes your project

Garden Pulse disambiguates via typed belief weights, emotional signatures, and recency per belief class.


Architecture

┌─────────────────────────────────────────────────────────┐
│  Ingestion: conversations → typed observations         │
│  triage (Sonnet) → extract (Opus) → graph apply        │
├─────────────────────────────────────────────────────────┤
│  Storage: SQLite + event-level embeddings              │
│  • entities / relations / facts                        │
│  • events with typed belief_class + confidence_floor   │
│  • event_embeddings (text-embedding-3-large, 3072d)    │
├─────────────────────────────────────────────────────────┤
│  Retrieval:                                            │
│                                                         │
│  score = cosine(query, event)                          │
│        × exp(-λ[belief_class] · days_ago)              │
│        (floored at cosine × confidence_floor)          │
│                                                         │
│  5 belief classes (axiom → hypothesis)                 │
│  confidence_floor preserves core wounds                │
│  archivable flag pins defining moments                 │
└─────────────────────────────────────────────────────────┘

LLM provider layer

Chat/extract calls go through an alias-driven model registry, never raw vendor IDs. Aliases are resolved against a RegistryConfig with strict policy validation, then dispatched through provider adapters that share a common Go interface.

┌─────────────────────────────────────────────────────────┐
│  Caller (server, extractor, smoke CLI)                  │
│      │                                                  │
│      ▼                                                  │
│  model.Router.Chat({alias, system, messages, ...})      │
│      │   policy: pro/do_only must be DigitalOcean       │
│      │   policy: api.openai.com forbidden in P0         │
│      │   policy: runtime provider override blocked      │
│      ▼                                                  │
│  Provider adapter                                       │
│   • anthropic        → api.anthropic.com (subscription) │
│   • doinference      → inference.do-ai.run (DO Pro)     │
│   • openaicompat     → local LM Studio / vLLM / proxy   │
└─────────────────────────────────────────────────────────┘

The production server (cmd/pulse) wires the router with the Anthropic provider and the anthropic/opus alias by default — this preserves the existing subscription billing path. To swap models or add a backend, edit examples/models.json (or set PULSE_MODELS_PATH) — no code change required. Secrets stay in environment variables (api_key_env in the registry); JSON config files are validated with DisallowUnknownFields and never carry keys.

Smoke a model end-to-end without touching the server:

go run ./cmd/pulse-smoke --model anthropic/opus --prompt "respond with ok"

Design notes: docs/superpowers/specs/2026-05-03-pulse-multi-provider-model-layer.md.


Benchmark

Empathic Memory Bench v3 (2026-04-24) — Pulse v3 SOTA

8 LLM judges × 8 model checkpoints (6 vendor families) × 35 tests × 5 axes, on bench v3 corpus:

System overall core stateful chain multi_signal
cosine 4.83 6.05 0.17 1.75 5.34
bm25 3.07 4.24 0.01 1.83 1.59
hybrid 4.43 5.65 0.29 2.58 3.64
Pulse v3 6.38 6.90 6.44 4.50 6.26

Delta vs best baseline per axis: overall +1.55 (+32% vs cosine), stateful +6.15 (×22 vs hybrid / ×38 vs cosine), chain +1.92 (+74% vs hybrid), core +0.85 (no regression vs cosine). Opus wins 26/35 tests (74%). Krippendorff α on stateful axis = 0.81 (strong) — cross-judge consensus.

Judges: Moonshot Kimi K2.6 + K2-0711-preview, Z.ai GLM-5 + GLM-5.1, Alibaba Qwen3-Max, DeepSeek V3.2, OpenAI GPT-5.4, Anthropic Claude Opus 4.7 — 8 model checkpoints across 6 vendor families.

External validation (three independent benchmarks)

benchmark score notes
LongMemEval_S (ICLR 2025, 500 Qs) 68.89% overall, -3.2 pts vs oracle
ES-MemEval (Feb 2026, 1427 Qs) 76% (1.519/2.0 LLM-judge) comparable to gpt-4o+RAG
LoCoMo (ACL 2024, 1986 Qs × 10 convs) 32.51% F1, 62.78% adv refusal cosine + Cohere embed

These three are run with Pulse v2_pure — the cosine-plus-recency baseline Pulse v3 collapses to when no state / emotion / anchor signals are provided (none of the external datasets carry those fields). v3 == v2_pure on this data by construction, so the external numbers validate Pulse's foundation, not v3's conditional boosts specifically (those are validated only on bench v3's stateful / chain / multi-signal axes, which is what empathic-memory bench v3 exists for).

See github.com/nikshilov/bench for reproduction scripts and raw JSON.

v2_pure baseline (2026-04-18) — unchanged

On a 47-query empathic subset of the project owner's real personal corpus (85 events, 3-judge cross panel: gpt-4o / gpt-4o-mini / gemini-2.5-flash):

System Mean /30
Pulse v2_pure 28.71 ± 1.40 🏆
LangMem 28.95 ± 1.61
sqlite-vec 28.82 ± 1.44
LlamaIndex 28.09 ± 2.86
Mem0 (infer=False) 21.75 ± 0.61

Pulse v2_pure still wins the OpenAI-embedding cluster by +6.96 pts (+33%) over Mem0 on that bench. Key finding at the time: storage format matters more than ranking sophistication — full event text retrieval beats digest/fact extraction on the same embedder. v3 extends v2 with conditional boosts that stack only when their signals genuinely exist.


Pulse v3 (2026-04-24) — conditional multi-signal ranking

v3 wraps v2_pure with five conditional boosts — each term activates only when its input signal exists, so queries without state / emotion / anchor information produce bit-identical results to v2_pure (no regression on plain retrieval).

score = cosine
      × exp(-λ[belief_class, user_flag] · days_ago)   # anchor-aware decay
      × (1 + β · emotion_alignment)    if query_emotion ≥ 0.5   # conditional emotion boost
      × (1 + γ · state_fit)            if body stressed/restored # conditional state boost
      × (1 + δ_anchor · user_flag)     if rank ≤ 8              # anchor-priority
      × (1 + δ_date  · date_proximity) if snapshot_days_ago set  # date-proximity

Key ideas:

  • Anchor-aware decay (λ_anchor = 0.001) — events with user_flag=true (structural-truth anchors like marriage, grief, identity events) decay twice as slowly as regular events. Half-life 693d vs 347d. Matches the v2 user_model tier exactly.
  • Conditional gating — Phase D (2026-04-20) proved that an always-on emotion cosine term monotonically hurts retrieval (β=0 → β=3 drops NDCG from 43.77 to 27.72). v3 activates emotion boost only when the query has a dominant emotion (max ≥ 0.5 after query-emotion inference). Same discipline applies to state boost (gated by body-stressed or body-restored signals) and anchor boost (gated by user_flag=true).
  • Emotion-hint query augmentation (Phase 5.5) — when user_state.mood_vector has a dominant emotion, a short hint string is appended to the query before embedding (e.g. "conflict navigation repair" for anger, "wound self-blame rejection" for shame). This is what lifts the stateful axis from 3.60 to 6.60 single-handedly.
  • Date-proximity boost (Phase 5.2) — when user_state.snapshot_days_ago is provided (e.g. from a real Apple Health snapshot), events whose days_ago is close get a small boost via a stepped curve (same day = 1.0, within a week = 0.7, etc.).
  • Chain expansion — if return_chain=True, top-K events are expanded via event_chains table (BFS depth 3) and returned as an ordered sequence rather than a set.

Schema additions (migration 015):

  • event_emotions — Plutchik-10 floats per event (joy, sadness, anger, fear, trust, disgust, anticipation, surprise, shame, guilt)
  • event_chainsparent_id → child_id with strength and kind (for causal/temporal links)
  • query_emotion_cache — inference cache for the emotion classifier

Source files:

Tests: scripts/tests/test_retrieval_v3.py — includes no-regression property (v3 with no state == v2_pure).


Belief vocabulary

Five typed classes with per-class decay rates (migration 014):

Class Decay λ Half-life Floor default Use case
axiom 0.0 0.0 Permanent truths — core wounds, companion identity
self_model 0.0005 ~1400 days 0.0 Companion's introspective facts
user_model 0.001 ~700 days 0.0 User's psychological profile, long-term preferences
operational 0.003 ~230 days 0.0 Day-to-day context (default)
hypothesis 0.005 ~140 days 0.0 Provisional reads awaiting confirmation

confidence_floor ∈ [0, 1] — minimum post-decay score. A user_model belief with floor=0.85 stays salient against semantic match even at 10+ years old.

archivable: 0 — pins event against consolidation/archival.

provenance — audit trail: memory_pattern / interactive_memory / idle_background / sleep_reflection / manual.


Quickstart

Requirements

  • Python 3.11+
  • Go 1.22+
  • OpenAI API key (for text-embedding-3-large)
  • Anthropic API key (for extraction pipeline: Sonnet triage + Opus extract)

Install

git clone git@github.com:nikshilov/pulse.git
cd pulse

# Go build
go build -o bin/pulse ./cmd/pulse

# Python deps (if using scripts)
pip install openai anthropic

One-line dev workflow (Make)

make build       # compile bin/pulse
make test        # Go (./...) + Python (scripts/tests/)
make run         # start the server on 127.0.0.1:18789
make demo        # ingest -> retrieve end-to-end (see examples/03-end-to-end)
make help        # list all targets

Runnable examples live in examples/ — three minimal Python scripts (stdlib only) demonstrating ingest, retrieval, and a chained end-to-end demo against a running server.

Initialize the graph

# Creates pulse.db with all 14 migrations applied
bin/pulse

Python retrieval API

import sqlite3
from extract.retrieval_v2 import embed_events, retrieve_events

con = sqlite3.connect("pulse.db")

# One-time: backfill embeddings for existing events
embed_events(con, embedder_model="openai-text-embedding-3-large")

# Retrieve top-3 relevant events
events = retrieve_events(
    con,
    query="how are things with my partner today?",
    top_k=3,
    embedder_model="openai-text-embedding-3-large",
)

for e in events:
    print(f"  #{e['id']} ({e['belief_class']}, {e['days_ago']}d ago): {e['text'][:80]}")
    print(f"     score={e['score']:.3f}  cosine={e['cosine']:.3f}  λ={e['effective_lambda']}")

Seed an axiom (event that never decays)

INSERT INTO events (title, description, sentiment, ts, belief_class, confidence_floor, archivable)
VALUES (
  'core-wound',
  'Never been chosen without proving value first. Mother''s conditional worth.',
  -2.0,
  '2020-01-15T00:00:00Z',
  'axiom',     -- no decay
  0.85,        -- preserved even if retrieval finds 0.2 cosine match
  0            -- never archivable
);

Tests

python3 -m pytest scripts/tests/ -q
# 387 passed, 7 skipped

Test coverage includes:

  • Retrieval correctness (cosine, recency, top-k bounds, cross-model isolation)
  • Belief vocabulary (axiom zero-decay, hypothesis fast-decay, floor preservation, CHECK constraints)
  • Graceful fallback on pre-migration-014 databases
  • All 18 migrations applied cleanly
  • E2E extraction pipeline
  • Domain marker coercion + glossary post-process propagation
  • Qwen3 multi-pass extraction (single / two-pass critic / four-pass specialist)
  • Backend dispatch (anthropic-legacy / sonnet-max / local-qwen3) + fallback rate-limit guards

Extraction backend system (Week of 2026-05-06)

After the DigitalOcean billing crisis (2026-05-06), the extraction pipeline moved from Anthropic SDK to a pluggable backend system. All backends are selectable per-run via env or CLI:

# Default: Sonnet 4.6 via Max subscription (free, best quality)
python scripts/pulse_extract.py --db pulse.db --backend sonnet-max

# Local-only: Qwen3-30B-A3B 4bit MLX via mlx-lm
PULSE_EXTRACT_QWEN3_TWO_PASS=true \
  python scripts/pulse_extract.py --db pulse.db --backend local-qwen3

# Per-dimension specialist: 4 sequential Qwen3 calls (entities → R/F/Ev)
PULSE_EXTRACT_QWEN3_SPECIALIST=true \
  python scripts/pulse_extract.py --db pulse.db --backend local-qwen3

# Opt-in Opus quality (free under Max but +67% latency)
PULSE_EXTRACT_MAX_MODEL=opus \
  python scripts/pulse_extract.py --db pulse.db --backend sonnet-max

# Legacy Anthropic SDK path (preserved as rollback)
python scripts/pulse_extract.py --db pulse.db --backend anthropic-legacy --budget 5

Bench results (60 fixtures, see bench repo):

backend F1 (lenient) latency $ per 10K obs
sonnet-max + glossary 0.41 30s $0 (Max plan)
opus + glossary 0.41 (similar) 50s $0 (Max plan)
local-qwen3 4bit MLX (1-pass) 0.29 14s $0
local-qwen3 + 2-pass critic 0.45 (Phase A) 28s $0
anthropic-legacy (Sonnet+Opus) 0.45 35s ~$200

Production default: sonnet-max. local-qwen3 for offline / opt-in bulk.

Domain markers (migration 018)

Facts and events carry an explicit domain enum: real | fiction_content | fiction_meta | meta_authorial. This separates real-life observations from authorial work on a fiction project (e.g. the user's autobiographical novel "Соня"), preventing retrieval from mixing fictional events with real-life decisions.

Powered by extract/pulse_glossary.py — 21 known fictional characters + book-work regex patterns. Glossary hint is injected into extraction system prompts; post-process propagates domain from fiction-kind entities to their facts/events as a safety net.

Sonya book skeleton ingestion

scripts/pulse_book_skeleton.py ingests Obsidian-vault chapters as lightweight observations (~500 chars per chapter: title, top characters, first paragraph anchor). Avoids extracting every scene as graph events (would explode graph size 4×). Full chapter content stays on disk; lazy_load_chapter() reads on demand when retrieval semantically points to a chapter and full content is needed.

Standalone graph viewer

scripts/bench_graph_viewer.py --db PULSE.db --out viewer.html produces a self-contained HTML viewer with vis-network: convex-hull domain zone overlays, per-entity primary_domain colour-coding, click-to-detail panel with facts/events/relations/observations, search/filter, optional cluster-by- domain physics. Used to inspect Pulse memory visually after extraction.


Roadmap

  • v1 — entity-level keyword-BFS retrieval (superseded)
  • v2_pure — event-level semantic retrieval (current production default)
  • Belief vocabulary — migration 014, 5 typed classes
  • v3 emotion + state graph — Plutchik-10 tags, chain table, conditional emotion/state/anchor/date boosts, SOTA on bench v3 (overall +32% vs cosine, stateful ×22 vs hybrid)
  • External validation — LongMemEval_S 68.89%, ES-MemEval 76%, LOCOMO 32.51%
  • Judge-built GT bench at scale — 200+ queries, multi-corpus
  • MCP server — expose retrieve_memory as a tool for any MCP-compatible harness
  • Longitudinal evaluation — track retrieval quality as user's corpus grows over 6+ months

How this compares to other memory engines

Dimension Mem0 Zep Graphiti LangMem sqlite-vec Garden Pulse
Storage format LLM-extracted facts Messages Temporal KG Key-value Vector only Full events + typed classes
Retrieval Vector Vector+graph Cypher+vector Vector Vector Vector + per-class decay + floor
Emotional weight none none none none none built-in (v3 shipped)
Stateful retrieval no no no no no yes (mood_vector + body state)
Belief types none none none none none 5 typed classes
Core-wound preservation no no no no no yes (confidence_floor)
Empathic bench SOTA on bench v3 (6.38 overall vs cosine 4.83; stateful ×22 vs hybrid / ×38 vs cosine)

Garden Pulse is purpose-built for personal, emotional memory where events carry weight beyond their semantic content. Other engines are excellent at "find similar text" — Pulse answers "what matters for this person now?".


Contributing

See CONTRIBUTING.md for dev setup, test commands, code style, and PR guidelines. Runnable examples in examples/.

Issue tracker: https://github.com/nikshilov/pulse/issues


License

MIT — see LICENSE.


Built as part of Garden. Maintained by Elle and Nikita Shilov.

About

An emotional memory engine for truly empathic ai companions

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors