feat(bench): activate mem0-v3-locomo sandbox#71
Merged
Conversation
Build the LoCoMo recall harness per the merged plan. Operator ratified
fetch-on-run (NEEDS_APPROVAL resolved 2026-06-11). All blocked_on items
cleared; status flipped INACTIVE -> ACTIVE.
Files created:
docker-compose.yml Python 3.11 + SHA-pinned LOCOMO_URL + extra_hosts
bench.py fetch (SHA256-pinned) + build_hermetic_mem0 + ingest +
score (per-QA recall + true flat token list) + main +
optional --with-llm Ollama diagnostic
.gitignore _locomo_workload/ + _mem0_locomo_faiss/ + outputs.json
test_locomo_sandbox_structure.py (PASSES locally)
test_locomo_bench_local.py regex PASSES; mem0 tests skip on py 3.12+
Files modified:
expected.json ACTIVE; blocked_on cleared; model corrected (llama3 8B Q4)
README.md CC BY-NC 4.0 attribution + run instructions + wall-time note
pyproject.toml register requires_mem0 pytest marker
docs/coverage.md + docs/metrics.md regenerated
Verdict pending: docker compose run --rm bench (in bench/isolation/memory/mem0-v3-locomo)
Relative paths caused a doubling bug: cwd=sandbox_path + -f sandbox_path/compose.yml combined to double the prefix when Docker resolved the path relative to the changed cwd. All 9 existing runner tests pass.
… harness hardening The keystone measurement: pure-vector Mem0 (faiss, no BM25/entity fusion, no fact extraction, MiniLM embedder) scores 30.79 LoCoMo recall on the operator dev box — nowhere near the published 91.6 (their production tri-signal config). Decision-rule branch: config investigation, with an externally-evidenced fix staircase (BM25+RRF, cross-encoder rerank, Qwen3-class embedder — Hindsight reaches 89.6 with exactly this recipe). Harness hardening from the live runs: pip cache volume (first run burned ~50 min re-downloading torch and died silently), CPU-only torch index (default Linux torch drags ~3GB of CUDA into a CPU container), unbuffered un-quieted output (silence is indistinguishable from a hang). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CI runs py3.11 WITHOUT mem0ai installed — the version gate let the import fire and 3 tests failed. Availability gate covers both that and any future py3.12-compatible mem0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bench/isolation/memory/mem0-v3-locomo— the LoCoMo recall sandbox for Mem0 v3 library-driven retrieval (spec row 9).bench.py(~270 lines) +docker-compose.yml. Mirrorsmem0-library-retrieval-recall(feat(bench): flip mem0-library-retrieval-recall to ACTIVE + discovered upstream bug #56) hermetic pattern (MiniLM-L6-v2 + faiss-cpu +infer=False) plus theamnesia-abOllamahost.docker.internalpattern for the optional--with-llmdiagnostic.locomo10.jsonfetched at run time, SHA256-pinned to upstream commit. 10 conversations, 5,882 turns, 1,986 QA items.INACTIVE→ACTIVE;blocked_oncleared; model name corrected (Qwen3 8B → llama3 8B Q4 via Ollama).locomo_recall_score ≥ 88 confirm / < 80 refute.Test plan
test_locomo_sandbox_structure.py— structure smoke: PASSES locallytest_locomo_bench_local.py::test_session_regex_only_matches_pure_session_keys— PASSES locallytest_locomo_bench_local.pymem0-dependent tests — SKIP on Python 3.13 (run in Docker py 3.11); will execute in Docker via verdict runcd bench/isolation/memory/mem0-v3-locomo && docker compose run --rm benchbench/history.jsonl; regendocs/coverage.md+docs/metrics.mdNotes
locomo10.jsonis not committed (.gitignored, fetch-on-run). CC BY-NC 4.0 attribution is inREADME.md.--with-llmmode is optional (off by default) — routes through Ollamallama3:latestonhost.docker.internal:11434. The verdict path is retrieval-only (no Ollama needed).🤖 Generated with Claude Code