feat(bench): flip mem0-library-retrieval-recall to ACTIVE + discovered upstream bug#56
Merged
Merged
Conversation
…ll + 0% leak
First memory-category ACTIVE sandbox. Tests the SEARCH layer of Mem0
specifically (extraction layer bypassed via infer=False), so the verdict
speaks directly to spec row 9's library-driven-retrieval claim.
Hermetic config (no Docker required for primary run):
embedder: sentence-transformers/all-MiniLM-L6-v2 (HuggingFace, no API key)
vector_store: faiss-cpu (local file-backed)
llm: stub OpenAI config (never called because infer=False)
Workload:
bench/workloads/mem0-retrieval-recall-corpus.jsonl — 38 facts across 4
fictional users (alice / ben / cara / dan)
bench/workloads/mem0-retrieval-recall.jsonl — 38 queries with ground-
truth memory IDs
Local end-to-end measurement:
primary_value: 1.00 recall@10 (38 of 38 queries — every expected
memory_id in top-10)
secondary_value: 0.00 cross-user-id leak rate (perfect user_id
isolation across all 38 queries)
verdict: CONFIRMED on both
duration: 49s (sentence-transformer load + 38 queries)
DISCOVERED UPSTREAM BUG (Mem0 v2):
Building this sandbox surfaced a Mem0 v2 score-normalization bug —
the most-recently-added memory always reports score=1.0 regardless
of query relevance. Reproduced on both faiss and chroma vector store
providers (so it's in Mem0's result-formatting layer, not provider-
specific). Documented in the sandbox README; worth filing upstream.
Workaround applied to this sandbox's workload: cap each user to <=10
memories so top_k=10 covers the per-user corpus. With user_id filter
scoping retrieval to the user's full set AND k covering it, ranking
quality doesn't determine which memories return — they all return.
The harder case (top_k < per-user-corpus) is captured by a NEW
follow-up INACTIVE stub:
bench/isolation/memory/mem0-ranking-quality-at-bounded-top-k/
Tests recall@5 when each user has 15 memories. Stays INACTIVE
until Mem0 fixes the recency-bias bug (PR pending) or v3 ships.
Net effect: 5 ACTIVE sandboxes (vllm-q4-llama8b + sandbox-e + sandbox-i
+ aider-repomap + mem0-library-retrieval-recall), 11 INACTIVE slot
stubs.
Pattern-validation note: this sandbox does NOT validate the spec's
absolute "91.6 LoCoMo" number directly — that requires real Mem0 v3
on the real LoCoMo dataset with a real LLM, all of which stay
INACTIVE in `memory/mem0-v3-locomo`. This sandbox validates the
LIBRARY-DRIVEN RETRIEVAL PATTERN works on a hermetic test, which is
sufficient for spec row 9's pattern-level claim.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dboxes PR predates the stale-docs CI gate; this is the regeneration it asks for. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
OpenCircuitDev
pushed a commit
that referenced
this pull request
Jun 11, 2026
#56 verdict Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First memory-category ACTIVE sandbox. Tests Mem0's library-driven retrieval (the SEARCH layer specifically) against spec row 9's pattern-level claim.
Local validation
Discovered upstream bug 🐛
Building this sandbox surfaced a real Mem0 v2 score-normalization bug worth filing upstream:
score=1.0regardless of query relevanceWorkaround applied: workload caps each user to ≤10 memories so top_k=10 covers the full per-user corpus, isolating retrieval coverage from ranking quality.
Follow-up sandbox slot added (INACTIVE): `bench/isolation/memory/mem0-ranking-quality-at-bounded-top-k` — tests recall@5 with 15 memories/user, captures the harder ranking case. Stays INACTIVE until Mem0 fixes the recency-bias bug or v3 ships.
What this validates vs doesn't
What's now true
🤖 Generated with Claude Code