Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ ocm-data/
bench/*.egg-info/
bench/isolation/**/outputs.json
bench/isolation/**/_sandbox_*.db
bench/isolation/**/_mem0_bench_faiss/
68 changes: 68 additions & 0 deletions bench/isolation/memory/mem0-library-retrieval-recall/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Sandbox: mem0-library-retrieval-recall

**Hypothesis:** Mem0's library-driven retrieval (search) returns ALL ground-truth memory IDs in the top-10 for β‰₯95% of queries on a 39-memory / 39-query synthetic workload, with **0% cross-user-id leakage**.

**Status:** ACTIVE β€” first ACTIVE memory-category sandbox.

## What this measures

- **Primary**: recall@10 β€” fraction of queries where every expected memory_id appears in the top-10 retrieved set
- **Secondary**: cross-user-id leak rate β€” fraction of queries where retrieval returned a memory belonging to a DIFFERENT user_id (must be 0.0 β€” Mem0's user_id isolation is a security boundary)

## What this does NOT measure

- **Mem0's LLM-driven memory extraction** (the `add(infer=True)` path that uses an LLM to digest raw conversation into atomic memories). This sandbox bypasses extraction by passing pre-formatted memories with `infer=False`.
- **The full LoCoMo benchmark** β€” that's a separate sandbox at `memory/mem0-v3-locomo` (still INACTIVE pending real Mem0 v3 + the LoCoMo dataset). LoCoMo is multi-session interdependent reasoning; this sandbox is single-fact retrieval recall.
- **Embedding model quality in absolute terms** β€” uses `all-MiniLM-L6-v2` (small, fast, no API key). Larger embedding models (e5, BGE, OpenAI text-embedding-3) would likely score higher; but this sandbox tests the PATTERN, not the optimal config.

## Why split extraction from retrieval

Spec row 9's claim is that **library-driven retrieval** beats agent-driven memory tool calls. The retrieval layer is the load-bearing piece of that argument β€” it's what runs on every chat turn before the model. Extraction runs once-per-conversation, which is a different latency budget. Testing them separately gives clean signal on each axis.

## Hermetic config

| Layer | Pick | Reason |
|---|---|---|
| Embedder | `sentence-transformers/all-MiniLM-L6-v2` (HuggingFace) | No API key; pip-installable; CPU-fast (~5ms/embedding) |
| Vector store | `faiss-cpu` (local file-backed) | No server; deterministic; cross-platform |
| LLM | Stub OpenAI config | Never actually called because `infer=False` |

## How to interpret

| Verdict | What it means |
|---|---|
| CONFIRMED on both | Mem0 retrieval works as advertised on this workload. Pattern (library-driven retrieval) holds for spec row 9 |
| REFUTED on recall | Mem0's vector search isn't retrieving the obvious matches. Likely embedding model issue, or Mem0's reranking is broken |
| REFUTED on cross-user-leak | **SECURITY BUG** β€” Mem0 is returning memories across user_id boundaries. The user_id parameter isn't being honored. This is the structural-invariant version of "your harness is wrong" β€” the sandbox is correctly wired but Mem0 has a regression |

## Discovered upstream bug (Mem0 v2)

Building this sandbox surfaced a real Mem0 v2 score-normalization bug
worth filing upstream:

- The MOST RECENTLY ADDED memory always reports `score=1.0` regardless of
query relevance. Earlier memories report their actual cosine similarity
(e.g. 0.73 for a perfect semantic match). Reproduced on both `faiss` and
`chroma` vector store providers, suggesting the bug is in Mem0's
result-formatting layer, not provider-specific.
- Effect: when a user has MORE memories than top_k, the wrong memory gets
ranked first and the correct match can be cut from the result set.
- Workaround in this sandbox: the workload caps each user to ≀10 memories,
matching top_k=10. With user_id filter scoping retrieval to the user's
full corpus AND k covering it, ranking quality doesn't determine which
memories return β€” they all return.
- Follow-up sandbox slot needed: `mem0-ranking-quality-at-bounded-top-k`
(testing the case where user has MORE memories than top_k and retrieval
must rank correctly to surface the right ones). Stays INACTIVE pending
Mem0 upstream fix or a v3 release without the recency-bias bug.

This sandbox therefore tests a **bounded subset of spec row 9's claim**:
library-driven retrieval correctly returns all user-scoped memories when
top_k covers the per-user corpus. The harder case (top_k < per-user count)
is captured by the follow-up sandbox above.

## Source

Spec v0.4 row 9 β€” "Mem0 v3 + OpenMemory MCP local mode. v0.3 reaffirmation with stronger evidence: Mem0 v3 hits 91.6 LoCoMo / 93.4 LongMemEval at ~7000 tokens/retrieval; library-driven retrieval (no agent decision required) is structurally aligned with the small-model thesis."

This sandbox does NOT validate the 91.6 LoCoMo number directly (different workload). It validates that the LIBRARY-DRIVEN RETRIEVAL PATTERN works on a hermetic test. The LoCoMo number itself stays in the still-INACTIVE `mem0-v3-locomo` sandbox pending real Mem0 v3 + real LoCoMo data.
193 changes: 193 additions & 0 deletions bench/isolation/memory/mem0-library-retrieval-recall/bench.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
"""Mem0 library-driven retrieval β€” recall@k measurement.

Tests the SEARCH layer of Mem0 specifically, NOT the LLM-driven memory
extraction layer (`add(infer=True)` is bypassed via `infer=False`). This
isolates retrieval quality from extraction quality so the verdict speaks
to spec row 9's "library-driven retrieval" claim directly.

Workload:
bench/workloads/mem0-retrieval-recall-corpus.jsonl (39 facts across 4 users)
bench/workloads/mem0-retrieval-recall.jsonl (39 queries with ground truth)

Hermetic config:
- embedder: huggingface (sentence-transformers/all-MiniLM-L6-v2) β€” no API key
- vector_store: faiss (local file-backed) β€” no server needed
- llm: stub OpenAI config (never actually called because infer=False)

Output:
primary_value: recall@10 across all queries (fraction where ALL expected
memory_ids appear in the top-10 retrieved set)
secondary_value: cross-user-leak rate (fraction of queries where retrieval
returned a memory belonging to a DIFFERENT user β€” should be
exactly 0.0 if Mem0's user_id isolation works)
"""

from __future__ import annotations

import json
import os
import shutil
import statistics
import time
from pathlib import Path

# Mem0 lazy import β€” bench.py is supposed to fail loudly if the env doesn't
# have it, not at module-load time
def _import_mem0():
from mem0 import Memory
return Memory


def main() -> int:
workloads = Path(os.environ.get("WORKLOADS_DIR", "/workloads"))
if not workloads.exists():
# Local dev fallback
repo_workloads = Path(__file__).resolve().parents[3] / "workloads"
if repo_workloads.exists():
workloads = repo_workloads
else:
print(f"ERROR: workloads dir not found at {workloads} or {repo_workloads}")
return 2

corpus_path = workloads / "mem0-retrieval-recall-corpus.jsonl"
queries_path = workloads / "mem0-retrieval-recall.jsonl"
if not corpus_path.exists() or not queries_path.exists():
print(f"ERROR: workload files missing under {workloads}")
return 2

Memory = _import_mem0()
started = time.monotonic()

# Hermetic config β€” local files only, no external services.
# Use absolute path because Mem0's faiss provider does
# os.makedirs(os.path.dirname(path)) which fails on Windows when
# dirname() returns an empty string for a relative-no-dir path.
faiss_path = (Path.cwd() / "_mem0_bench_faiss").resolve()
if faiss_path.exists():
shutil.rmtree(faiss_path)
faiss_path.parent.mkdir(parents=True, exist_ok=True)

config = {
"embedder": {
"provider": "huggingface",
"config": {"model": "sentence-transformers/all-MiniLM-L6-v2"},
},
"vector_store": {
"provider": "faiss",
"config": {
"collection_name": "ocm_bench",
"path": str(faiss_path),
"embedding_model_dims": 384, # MiniLM-L6-v2 dim
},
},
"llm": {
"provider": "openai",
"config": {"api_key": "sk-stub-not-used", "model": "gpt-4o-mini"},
},
}

m = Memory.from_config(config)

# Seed memories β€” bypass LLM extraction with infer=False
# We track our memory_id (workload field) -> Mem0's internal ID via metadata
workload_id_to_mem0_id: dict[int, str] = {}
with corpus_path.open(encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
rec = json.loads(line)
result = m.add(
rec["content"],
user_id=rec["user_id"],
infer=False,
metadata={"workload_memory_id": rec["memory_id"]},
)
# Mem0 returns various shapes across versions; normalize
results = result.get("results", []) if isinstance(result, dict) else result
if results and isinstance(results, list) and isinstance(results[0], dict):
workload_id_to_mem0_id[rec["memory_id"]] = results[0].get("id", "")

# Run queries
queries: list[dict] = []
with queries_path.open(encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
queries.append(json.loads(line))

n_queries = len(queries)
n_recall_hits = 0 # all expected ids found in top-10
n_cross_user_leaks = 0
per_query: list[dict] = []
top_k = 10

for q in queries:
# Mem0 v2 API: user_id passed via filters, not as a direct kwarg
results = m.search(
q["query"],
filters={"user_id": q["user_id"]},
top_k=top_k,
)
retrieved = results.get("results", []) if isinstance(results, dict) else results

# Map back to workload memory_ids via metadata
retrieved_workload_ids: list[int] = []
retrieved_user_ids: list[str] = []
for r in retrieved or []:
md = r.get("metadata") or {}
if "workload_memory_id" in md:
retrieved_workload_ids.append(md["workload_memory_id"])
retrieved_user_ids.append(r.get("user_id", q["user_id"]))

expected_ids = set(q["expected_memory_ids"])
retrieved_set = set(retrieved_workload_ids)
recall_hit = expected_ids.issubset(retrieved_set)
if recall_hit:
n_recall_hits += 1

# Cross-user leak: any retrieved memory belongs to a different user?
wrong_user = any(uid != q["user_id"] for uid in retrieved_user_ids if uid)
if wrong_user:
n_cross_user_leaks += 1

per_query.append({
"query_id": q["query_id"],
"user_id": q["user_id"],
"expected": list(expected_ids),
"retrieved": retrieved_workload_ids,
"recall_hit": recall_hit,
"cross_user_leak": wrong_user,
})

recall_at_k = n_recall_hits / n_queries if n_queries else 0.0
cross_user_leak_rate = n_cross_user_leaks / n_queries if n_queries else 0.0
elapsed = time.monotonic() - started

# Cleanup
if faiss_path.exists():
shutil.rmtree(faiss_path)

output = {
"primary_value": recall_at_k,
"secondary_value": cross_user_leak_rate,
"duration_seconds": elapsed,
"n_queries": n_queries,
"n_memories": len(workload_id_to_mem0_id),
"top_k": top_k,
"n_recall_hits": n_recall_hits,
"n_cross_user_leaks": n_cross_user_leaks,
"embedder": "sentence-transformers/all-MiniLM-L6-v2",
"vector_store": "faiss-local",
"failed_queries": [
q for q in per_query if not q["recall_hit"]
],
}

Path("outputs.json").write_text(json.dumps(output, indent=2), encoding="utf-8")
print(json.dumps(output, indent=2))
return 0


if __name__ == "__main__":
raise SystemExit(main())
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
services:
bench:
image: python:3.11
volumes:
- ./:/work
- ../../../workloads:/workloads:ro
working_dir: /work
environment:
- WORKLOADS_DIR=/workloads
# mem0ai pulls in pydantic + a few networking deps; sentence-transformers
# pulls torch (huge but pip-cached). faiss-cpu is needed by Mem0's faiss
# vector store provider. python:3.11 (full image) gives us the build
# tools sentence-transformers/torch may need; python:3.11-slim works on
# most platforms but build gcc for some torch deps is safer.
command:
- sh
- -c
- |
pip install --quiet mem0ai sentence-transformers faiss-cpu
python bench.py
20 changes: 20 additions & 0 deletions bench/isolation/memory/mem0-library-retrieval-recall/expected.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"hypothesis_id": "mem0-library-retrieval-recall-and-isolation",
"claim": "Mem0's library-driven retrieval (search) returns all ground-truth memory IDs in the top-10 for >=95% of queries on a 39-memory / 39-query synthetic workload, with exactly 0% cross-user-id leakage. Tests the SEARCH layer specifically, NOT the LLM-driven extraction layer (bypassed via infer=False). Primary metric speaks to spec row 9's library-driven-retrieval claim; secondary metric is a structural invariant on Mem0's user_id isolation.",
"metric": "recall_at_10",
"thresholds": {
"confirm_at_least": 0.95,
"refute_below": 0.80
},
"secondary_metric": "cross_user_leak_rate",
"secondary_thresholds": {
"confirm_at_most": 0.0,
"refute_above": 0.05
},
"workload": "mem0-retrieval-recall.jsonl + mem0-retrieval-recall-corpus.jsonl",
"source_for_claim": "Spec v0.4 row 9 β€” 'library-driven retrieval (no agent decision required) is structurally aligned with the small-model thesis.' Sandbox validates the PATTERN; the absolute 91.6 LoCoMo number stays in mem0-v3-locomo pending Mem0 v3 + real LoCoMo data.",
"comparison_anchor": "agent-driven-memory-tool-baseline (when implemented β€” Letta-style tool-calling memory paradigm on the same workload)",
"decision_rule": "If CONFIRMED on both, library-driven retrieval pattern holds for the spec row 9 claim. If REFUTED on recall, embedding model or vector store config is wrong; investigate before declaring Mem0 unsuitable. If REFUTED on cross-user-leak, security bug β€” Mem0 isn't honoring user_id isolation, escalate immediately.",
"timeout_seconds": 600,
"status": "ACTIVE"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Sandbox: mem0-ranking-quality-at-bounded-top-k

**Hypothesis:** Mem0's retrieval ranking correctly surfaces ground-truth memories at recall@5 β‰₯ 80% when each user has 15 memories (top_k=5 < per-user-corpus, forcing the ranking layer to actually choose).

**Status:** INACTIVE β€” blocked on Mem0 v2 score-normalization bug (see related).

## Why this is the harder test

The companion sandbox `memory/mem0-library-retrieval-recall` validates retrieval works for the **bounded case** β€” when each user has ≀ top_k memories, all of them return regardless of ranking quality. That sandbox CONFIRMED at 100% recall@10.

But spec row 9's claim about Mem0 ("91.6 LoCoMo at ~7000 tokens/retrieval") implies a much harder case: thousands of memories per user, retrieval must surface the relevant subset. That requires the ranking layer to actually rank β€” not just return everything.

## Why this is INACTIVE

Building the bounded sandbox surfaced a Mem0 v2 score-normalization bug:
the most-recently-added memory always reports `score=1.0` regardless of
query relevance. This bug is in Mem0's result-formatting layer (reproduces
across both `faiss` and `chroma` providers), so it would dominate any
ranking-quality measurement on Mem0 v2 β€” the verdict would be REFUTED
even when the underlying vector search works.

This sandbox stays INACTIVE until either:
- Mem0 upstream fixes the score=1.0 bug (PR pending), OR
- Mem0 v3 ships and resolves it as part of the rewrite

## Workload (planned)

Expand the existing `mem0-retrieval-recall-corpus.jsonl` (8-10 memories per
user) to 15 memories per user, with queries that target specific memories
by way of distinguishing detail. Top_k=5 means retrieval must rank the
5 most relevant out of 15 candidates per query.

## Source

Spec v0.4 row 9. Companion: `memory/mem0-library-retrieval-recall`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"hypothesis_id": "mem0-ranking-quality-at-bounded-top-k",
"claim": "Mem0's library-driven retrieval ranking correctly surfaces ground-truth memories at recall@5 >=80% on a workload where each user has 15 memories (i.e., top_k=5 < per-user-corpus). Tests retrieval RANKING quality, not just retrieval coverage. Pairs with mem0-library-retrieval-recall (which tests the bounded case top_k>=user_memory_count).",
"metric": "recall_at_5_when_user_corpus_exceeds_k",
"thresholds": {
"confirm_at_least": 0.80,
"refute_below": 0.50
},
"workload": "mem0-ranking-15per-user.jsonl (to be curated; expand the existing mem0-retrieval-recall fixture from 8-10 to 15 memories per user)",
"source_for_claim": "Spec v0.4 row 9 β€” library-driven retrieval claim. Sister sandbox mem0-library-retrieval-recall validated the bounded case (k>=corpus); this validates the harder ranking case (k<corpus).",
"comparison_anchor": "memory/mem0-library-retrieval-recall (the bounded-case companion)",
"decision_rule": "If CONFIRMED, Mem0's ranking works correctly β€” full library-driven retrieval claim holds. If REFUTED, ranking is the weak link; investigate Mem0 v2's score-normalization bug (documented in mem0-library-retrieval-recall README) before declaring Mem0 unsuitable. May require Mem0 v3 to ship before sandbox can produce a clean confirmation.",
"timeout_seconds": 600,
"status": "INACTIVE",
"blocked_on": [
"Mem0 v2 has a documented score-normalization bug where the most-recently-added memory always reports score=1.0 β€” would skew this sandbox's results until upstream fixed",
"Workload not yet curated β€” needs 15 memories per user (vs the bounded sandbox's 8-10)",
"Wait for Mem0 v3 release with the bug fixed, OR submit upstream PR"
]
}
Loading
Loading