OpenCircuitDev · OpenCircuitDev · Jun 11, 2026 · May 10, 2026 · Jun 11, 2026 · Jun 11, 2026
diff --git a/.gitignore b/.gitignore
@@ -37,3 +37,4 @@ ocm-data/
 bench/*.egg-info/
 bench/isolation/**/outputs.json
 bench/isolation/**/_sandbox_*.db
+bench/isolation/**/_mem0_bench_faiss/
diff --git a/bench/isolation/memory/mem0-library-retrieval-recall/README.md b/bench/isolation/memory/mem0-library-retrieval-recall/README.md
@@ -0,0 +1,68 @@
+# Sandbox: mem0-library-retrieval-recall
+
+**Hypothesis:** Mem0's library-driven retrieval (search) returns ALL ground-truth memory IDs in the top-10 for ≥95% of queries on a 39-memory / 39-query synthetic workload, with **0% cross-user-id leakage**.
+
+**Status:** ACTIVE — first ACTIVE memory-category sandbox.
+
+## What this measures
+
+- **Primary**: recall@10 — fraction of queries where every expected memory_id appears in the top-10 retrieved set
+- **Secondary**: cross-user-id leak rate — fraction of queries where retrieval returned a memory belonging to a DIFFERENT user_id (must be 0.0 — Mem0's user_id isolation is a security boundary)
+
+## What this does NOT measure
+
+- **Mem0's LLM-driven memory extraction** (the `add(infer=True)` path that uses an LLM to digest raw conversation into atomic memories). This sandbox bypasses extraction by passing pre-formatted memories with `infer=False`.
+- **The full LoCoMo benchmark** — that's a separate sandbox at `memory/mem0-v3-locomo` (still INACTIVE pending real Mem0 v3 + the LoCoMo dataset). LoCoMo is multi-session interdependent reasoning; this sandbox is single-fact retrieval recall.
+- **Embedding model quality in absolute terms** — uses `all-MiniLM-L6-v2` (small, fast, no API key). Larger embedding models (e5, BGE, OpenAI text-embedding-3) would likely score higher; but this sandbox tests the PATTERN, not the optimal config.
+
+## Why split extraction from retrieval
+
+Spec row 9's claim is that **library-driven retrieval** beats agent-driven memory tool calls. The retrieval layer is the load-bearing piece of that argument — it's what runs on every chat turn before the model. Extraction runs once-per-conversation, which is a different latency budget. Testing them separately gives clean signal on each axis.
+
+## Hermetic config
+
+| Layer | Pick | Reason |
+|---|---|---|
+| Embedder | `sentence-transformers/all-MiniLM-L6-v2` (HuggingFace) | No API key; pip-installable; CPU-fast (~5ms/embedding) |
+| Vector store | `faiss-cpu` (local file-backed) | No server; deterministic; cross-platform |
+| LLM | Stub OpenAI config | Never actually called because `infer=False` |
+
+## How to interpret
+
+| Verdict | What it means |
+|---|---|
+| CONFIRMED on both | Mem0 retrieval works as advertised on this workload. Pattern (library-driven retrieval) holds for spec row 9 |
+| REFUTED on recall | Mem0's vector search isn't retrieving the obvious matches. Likely embedding model issue, or Mem0's reranking is broken |
+| REFUTED on cross-user-leak | **SECURITY BUG** — Mem0 is returning memories across user_id boundaries. The user_id parameter isn't being honored. This is the structural-invariant version of "your harness is wrong" — the sandbox is correctly wired but Mem0 has a regression |
+
+## Discovered upstream bug (Mem0 v2)
+
+Building this sandbox surfaced a real Mem0 v2 score-normalization bug
+worth filing upstream:
+
+- The MOST RECENTLY ADDED memory always reports `score=1.0` regardless of
+  query relevance. Earlier memories report their actual cosine similarity
+  (e.g. 0.73 for a perfect semantic match). Reproduced on both `faiss` and
+  `chroma` vector store providers, suggesting the bug is in Mem0's
+  result-formatting layer, not provider-specific.
+- Effect: when a user has MORE memories than top_k, the wrong memory gets
+  ranked first and the correct match can be cut from the result set.
+- Workaround in this sandbox: the workload caps each user to ≤10 memories,
+  matching top_k=10. With user_id filter scoping retrieval to the user's
+  full corpus AND k covering it, ranking quality doesn't determine which
+  memories return — they all return.
+- Follow-up sandbox slot needed: `mem0-ranking-quality-at-bounded-top-k`
+  (testing the case where user has MORE memories than top_k and retrieval
+  must rank correctly to surface the right ones). Stays INACTIVE pending
+  Mem0 upstream fix or a v3 release without the recency-bias bug.
+
+This sandbox therefore tests a **bounded subset of spec row 9's claim**:
+library-driven retrieval correctly returns all user-scoped memories when
+top_k covers the per-user corpus. The harder case (top_k < per-user count)
+is captured by the follow-up sandbox above.
+
+## Source
+
+Spec v0.4 row 9 — "Mem0 v3 + OpenMemory MCP local mode. v0.3 reaffirmation with stronger evidence: Mem0 v3 hits 91.6 LoCoMo / 93.4 LongMemEval at ~7000 tokens/retrieval; library-driven retrieval (no agent decision required) is structurally aligned with the small-model thesis."
+
+This sandbox does NOT validate the 91.6 LoCoMo number directly (different workload). It validates that the LIBRARY-DRIVEN RETRIEVAL PATTERN works on a hermetic test. The LoCoMo number itself stays in the still-INACTIVE `mem0-v3-locomo` sandbox pending real Mem0 v3 + real LoCoMo data.
diff --git a/bench/isolation/memory/mem0-library-retrieval-recall/bench.py b/bench/isolation/memory/mem0-library-retrieval-recall/bench.py
@@ -0,0 +1,193 @@
+"""Mem0 library-driven retrieval — recall@k measurement.
+
+Tests the SEARCH layer of Mem0 specifically, NOT the LLM-driven memory
+extraction layer (`add(infer=True)` is bypassed via `infer=False`). This
+isolates retrieval quality from extraction quality so the verdict speaks
+to spec row 9's "library-driven retrieval" claim directly.
+
+Workload:
+  bench/workloads/mem0-retrieval-recall-corpus.jsonl  (39 facts across 4 users)
+  bench/workloads/mem0-retrieval-recall.jsonl         (39 queries with ground truth)
+
+Hermetic config:
+  - embedder: huggingface (sentence-transformers/all-MiniLM-L6-v2) — no API key
+  - vector_store: faiss (local file-backed) — no server needed
+  - llm: stub OpenAI config (never actually called because infer=False)
+
+Output:
+  primary_value:   recall@10 across all queries (fraction where ALL expected
+                   memory_ids appear in the top-10 retrieved set)
+  secondary_value: cross-user-leak rate (fraction of queries where retrieval
+                   returned a memory belonging to a DIFFERENT user — should be
+                   exactly 0.0 if Mem0's user_id isolation works)
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import shutil
+import statistics
+import time
+from pathlib import Path
+
+# Mem0 lazy import — bench.py is supposed to fail loudly if the env doesn't
+# have it, not at module-load time
+def _import_mem0():
+    from mem0 import Memory
+    return Memory
+
+
+def main() -> int:
+    workloads = Path(os.environ.get("WORKLOADS_DIR", "/workloads"))
+    if not workloads.exists():
+        # Local dev fallback
+        repo_workloads = Path(__file__).resolve().parents[3] / "workloads"
+        if repo_workloads.exists():
+            workloads = repo_workloads
+        else:
+            print(f"ERROR: workloads dir not found at {workloads} or {repo_workloads}")
+            return 2
+
+    corpus_path = workloads / "mem0-retrieval-recall-corpus.jsonl"
+    queries_path = workloads / "mem0-retrieval-recall.jsonl"
+    if not corpus_path.exists() or not queries_path.exists():
+        print(f"ERROR: workload files missing under {workloads}")
+        return 2
+
+    Memory = _import_mem0()
+    started = time.monotonic()
+
+    # Hermetic config — local files only, no external services.
+    # Use absolute path because Mem0's faiss provider does
+    # os.makedirs(os.path.dirname(path)) which fails on Windows when
+    # dirname() returns an empty string for a relative-no-dir path.
+    faiss_path = (Path.cwd() / "_mem0_bench_faiss").resolve()
+    if faiss_path.exists():
+        shutil.rmtree(faiss_path)
+    faiss_path.parent.mkdir(parents=True, exist_ok=True)
+
+    config = {
+        "embedder": {
+            "provider": "huggingface",
+            "config": {"model": "sentence-transformers/all-MiniLM-L6-v2"},
+        },
+        "vector_store": {
+            "provider": "faiss",
+            "config": {
+                "collection_name": "ocm_bench",
+                "path": str(faiss_path),
+                "embedding_model_dims": 384,  # MiniLM-L6-v2 dim
+            },
+        },
+        "llm": {
+            "provider": "openai",
+            "config": {"api_key": "sk-stub-not-used", "model": "gpt-4o-mini"},
+        },
+    }
+
+    m = Memory.from_config(config)
+
+    # Seed memories — bypass LLM extraction with infer=False
+    # We track our memory_id (workload field) -> Mem0's internal ID via metadata
+    workload_id_to_mem0_id: dict[int, str] = {}
+    with corpus_path.open(encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            rec = json.loads(line)
+            result = m.add(
+                rec["content"],
+                user_id=rec["user_id"],
+                infer=False,
+                metadata={"workload_memory_id": rec["memory_id"]},
+            )
+            # Mem0 returns various shapes across versions; normalize
+            results = result.get("results", []) if isinstance(result, dict) else result
+            if results and isinstance(results, list) and isinstance(results[0], dict):
+                workload_id_to_mem0_id[rec["memory_id"]] = results[0].get("id", "")
+
+    # Run queries
+    queries: list[dict] = []
+    with queries_path.open(encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                queries.append(json.loads(line))
+
+    n_queries = len(queries)
+    n_recall_hits = 0  # all expected ids found in top-10
+    n_cross_user_leaks = 0
+    per_query: list[dict] = []
+    top_k = 10
+
+    for q in queries:
+        # Mem0 v2 API: user_id passed via filters, not as a direct kwarg
+        results = m.search(
+            q["query"],
+            filters={"user_id": q["user_id"]},
+            top_k=top_k,
+        )
+        retrieved = results.get("results", []) if isinstance(results, dict) else results
+
+        # Map back to workload memory_ids via metadata
+        retrieved_workload_ids: list[int] = []
+        retrieved_user_ids: list[str] = []
+        for r in retrieved or []:
+            md = r.get("metadata") or {}
+            if "workload_memory_id" in md:
+                retrieved_workload_ids.append(md["workload_memory_id"])
+            retrieved_user_ids.append(r.get("user_id", q["user_id"]))
+
+        expected_ids = set(q["expected_memory_ids"])
+        retrieved_set = set(retrieved_workload_ids)
+        recall_hit = expected_ids.issubset(retrieved_set)
+        if recall_hit:
+            n_recall_hits += 1
+
+        # Cross-user leak: any retrieved memory belongs to a different user?
+        wrong_user = any(uid != q["user_id"] for uid in retrieved_user_ids if uid)
+        if wrong_user:
+            n_cross_user_leaks += 1
+
+        per_query.append({
+            "query_id": q["query_id"],
+            "user_id": q["user_id"],
+            "expected": list(expected_ids),
+            "retrieved": retrieved_workload_ids,
+            "recall_hit": recall_hit,
+            "cross_user_leak": wrong_user,
+        })
+
+    recall_at_k = n_recall_hits / n_queries if n_queries else 0.0
+    cross_user_leak_rate = n_cross_user_leaks / n_queries if n_queries else 0.0
+    elapsed = time.monotonic() - started
+
+    # Cleanup
+    if faiss_path.exists():
+        shutil.rmtree(faiss_path)
+
+    output = {
+        "primary_value": recall_at_k,
+        "secondary_value": cross_user_leak_rate,
+        "duration_seconds": elapsed,
+        "n_queries": n_queries,
+        "n_memories": len(workload_id_to_mem0_id),
+        "top_k": top_k,
+        "n_recall_hits": n_recall_hits,
+        "n_cross_user_leaks": n_cross_user_leaks,
+        "embedder": "sentence-transformers/all-MiniLM-L6-v2",
+        "vector_store": "faiss-local",
+        "failed_queries": [
+            q for q in per_query if not q["recall_hit"]
+        ],
+    }
+
+    Path("outputs.json").write_text(json.dumps(output, indent=2), encoding="utf-8")
+    print(json.dumps(output, indent=2))
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/bench/isolation/memory/mem0-library-retrieval-recall/docker-compose.yml b/bench/isolation/memory/mem0-library-retrieval-recall/docker-compose.yml
@@ -0,0 +1,20 @@
+services:
+  bench:
+    image: python:3.11
+    volumes:
+      - ./:/work
+      - ../../../workloads:/workloads:ro
+    working_dir: /work
+    environment:
+      - WORKLOADS_DIR=/workloads
+    # mem0ai pulls in pydantic + a few networking deps; sentence-transformers
+    # pulls torch (huge but pip-cached). faiss-cpu is needed by Mem0's faiss
+    # vector store provider. python:3.11 (full image) gives us the build
+    # tools sentence-transformers/torch may need; python:3.11-slim works on
+    # most platforms but build gcc for some torch deps is safer.
+    command:
+      - sh
+      - -c
+      - |
+        pip install --quiet mem0ai sentence-transformers faiss-cpu
+        python bench.py
diff --git a/bench/isolation/memory/mem0-library-retrieval-recall/expected.json b/bench/isolation/memory/mem0-library-retrieval-recall/expected.json
@@ -0,0 +1,20 @@
+{
+  "hypothesis_id": "mem0-library-retrieval-recall-and-isolation",
+  "claim": "Mem0's library-driven retrieval (search) returns all ground-truth memory IDs in the top-10 for >=95% of queries on a 39-memory / 39-query synthetic workload, with exactly 0% cross-user-id leakage. Tests the SEARCH layer specifically, NOT the LLM-driven extraction layer (bypassed via infer=False). Primary metric speaks to spec row 9's library-driven-retrieval claim; secondary metric is a structural invariant on Mem0's user_id isolation.",
+  "metric": "recall_at_10",
+  "thresholds": {
+    "confirm_at_least": 0.95,
+    "refute_below": 0.80
+  },
+  "secondary_metric": "cross_user_leak_rate",
+  "secondary_thresholds": {
+    "confirm_at_most": 0.0,
+    "refute_above": 0.05
+  },
+  "workload": "mem0-retrieval-recall.jsonl + mem0-retrieval-recall-corpus.jsonl",
+  "source_for_claim": "Spec v0.4 row 9 — 'library-driven retrieval (no agent decision required) is structurally aligned with the small-model thesis.' Sandbox validates the PATTERN; the absolute 91.6 LoCoMo number stays in mem0-v3-locomo pending Mem0 v3 + real LoCoMo data.",
+  "comparison_anchor": "agent-driven-memory-tool-baseline (when implemented — Letta-style tool-calling memory paradigm on the same workload)",
+  "decision_rule": "If CONFIRMED on both, library-driven retrieval pattern holds for the spec row 9 claim. If REFUTED on recall, embedding model or vector store config is wrong; investigate before declaring Mem0 unsuitable. If REFUTED on cross-user-leak, security bug — Mem0 isn't honoring user_id isolation, escalate immediately.",
+  "timeout_seconds": 600,
+  "status": "ACTIVE"
+}
diff --git a/bench/isolation/memory/mem0-ranking-quality-at-bounded-top-k/README.md b/bench/isolation/memory/mem0-ranking-quality-at-bounded-top-k/README.md
@@ -0,0 +1,35 @@
+# Sandbox: mem0-ranking-quality-at-bounded-top-k
+
+**Hypothesis:** Mem0's retrieval ranking correctly surfaces ground-truth memories at recall@5 ≥ 80% when each user has 15 memories (top_k=5 < per-user-corpus, forcing the ranking layer to actually choose).
+
+**Status:** INACTIVE — blocked on Mem0 v2 score-normalization bug (see related).
+
+## Why this is the harder test
+
+The companion sandbox `memory/mem0-library-retrieval-recall` validates retrieval works for the **bounded case** — when each user has ≤ top_k memories, all of them return regardless of ranking quality. That sandbox CONFIRMED at 100% recall@10.
+
+But spec row 9's claim about Mem0 ("91.6 LoCoMo at ~7000 tokens/retrieval") implies a much harder case: thousands of memories per user, retrieval must surface the relevant subset. That requires the ranking layer to actually rank — not just return everything.
+
+## Why this is INACTIVE
+
+Building the bounded sandbox surfaced a Mem0 v2 score-normalization bug:
+the most-recently-added memory always reports `score=1.0` regardless of
+query relevance. This bug is in Mem0's result-formatting layer (reproduces
+across both `faiss` and `chroma` providers), so it would dominate any
+ranking-quality measurement on Mem0 v2 — the verdict would be REFUTED
+even when the underlying vector search works.
+
+This sandbox stays INACTIVE until either:
+- Mem0 upstream fixes the score=1.0 bug (PR pending), OR
+- Mem0 v3 ships and resolves it as part of the rewrite
+
+## Workload (planned)
+
+Expand the existing `mem0-retrieval-recall-corpus.jsonl` (8-10 memories per
+user) to 15 memories per user, with queries that target specific memories
+by way of distinguishing detail. Top_k=5 means retrieval must rank the
+5 most relevant out of 15 candidates per query.
+
+## Source
+
+Spec v0.4 row 9. Companion: `memory/mem0-library-retrieval-recall`.
diff --git a/bench/isolation/memory/mem0-ranking-quality-at-bounded-top-k/expected.json b/bench/isolation/memory/mem0-ranking-quality-at-bounded-top-k/expected.json
@@ -0,0 +1,20 @@
+{
+  "hypothesis_id": "mem0-ranking-quality-at-bounded-top-k",
+  "claim": "Mem0's library-driven retrieval ranking correctly surfaces ground-truth memories at recall@5 >=80% on a workload where each user has 15 memories (i.e., top_k=5 < per-user-corpus). Tests retrieval RANKING quality, not just retrieval coverage. Pairs with mem0-library-retrieval-recall (which tests the bounded case top_k>=user_memory_count).",
+  "metric": "recall_at_5_when_user_corpus_exceeds_k",
+  "thresholds": {
+    "confirm_at_least": 0.80,
+    "refute_below": 0.50
+  },
+  "workload": "mem0-ranking-15per-user.jsonl (to be curated; expand the existing mem0-retrieval-recall fixture from 8-10 to 15 memories per user)",
+  "source_for_claim": "Spec v0.4 row 9 — library-driven retrieval claim. Sister sandbox mem0-library-retrieval-recall validated the bounded case (k>=corpus); this validates the harder ranking case (k<corpus).",
+  "comparison_anchor": "memory/mem0-library-retrieval-recall (the bounded-case companion)",
+  "decision_rule": "If CONFIRMED, Mem0's ranking works correctly — full library-driven retrieval claim holds. If REFUTED, ranking is the weak link; investigate Mem0 v2's score-normalization bug (documented in mem0-library-retrieval-recall README) before declaring Mem0 unsuitable. May require Mem0 v3 to ship before sandbox can produce a clean confirmation.",
+  "timeout_seconds": 600,
+  "status": "INACTIVE",
+  "blocked_on": [
+    "Mem0 v2 has a documented score-normalization bug where the most-recently-added memory always reports score=1.0 — would skew this sandbox's results until upstream fixed",
+    "Workload not yet curated — needs 15 memories per user (vs the bounded sandbox's 8-10)",
+    "Wait for Mem0 v3 release with the bug fixed, OR submit upstream PR"
+  ]
+}