feat(bench): activate mem0-v3-locomo sandbox by OpenCircuitDev · Pull Request #71 · OpenCircuitDev/opencircuitmodel

OpenCircuitDev · 2026-06-11T23:33:59Z

Summary

Activates bench/isolation/memory/mem0-v3-locomo — the LoCoMo recall sandbox for Mem0 v3 library-driven retrieval (spec row 9).
Operator ratified NEEDS_APPROVAL (LoCoMo = fetch-on-run, CC BY-NC, not bundled in repo). 2026-06-11.
Harness built: bench.py (~270 lines) + docker-compose.yml. Mirrors mem0-library-retrieval-recall (feat(bench): flip mem0-library-retrieval-recall to ACTIVE + discovered upstream bug #56) hermetic pattern (MiniLM-L6-v2 + faiss-cpu + infer=False) plus the amnesia-ab Ollama host.docker.internal pattern for the optional --with-llm diagnostic.
Dataset: locomo10.json fetched at run time, SHA256-pinned to upstream commit. 10 conversations, 5,882 turns, 1,986 QA items.
Status flipped: INACTIVE → ACTIVE; blocked_on cleared; model name corrected (Qwen3 8B → llama3 8B Q4 via Ollama).
Verdict pending: contract requires locomo_recall_score ≥ 88 confirm / < 80 refute.

Test plan

test_locomo_sandbox_structure.py — structure smoke: PASSES locally
test_locomo_bench_local.py::test_session_regex_only_matches_pure_session_keys — PASSES locally
test_locomo_bench_local.py mem0-dependent tests — SKIP on Python 3.13 (run in Docker py 3.11); will execute in Docker via verdict run
Docker verdict run: cd bench/isolation/memory/mem0-v3-locomo && docker compose run --rm bench
Record verdict in README + bench/history.jsonl; regen docs/coverage.md + docs/metrics.md

Notes

locomo10.json is not committed (.gitignored, fetch-on-run). CC BY-NC 4.0 attribution is in README.md.
--with-llm mode is optional (off by default) — routes through Ollama llama3:latest on host.docker.internal:11434. The verdict path is retrieval-only (no Ollama needed).
Wall-time estimate: ~6-8 min/repeat, ~18-25 min for 3 repeats.

🤖 Generated with Claude Code

Build the LoCoMo recall harness per the merged plan. Operator ratified fetch-on-run (NEEDS_APPROVAL resolved 2026-06-11). All blocked_on items cleared; status flipped INACTIVE -> ACTIVE. Files created: docker-compose.yml Python 3.11 + SHA-pinned LOCOMO_URL + extra_hosts bench.py fetch (SHA256-pinned) + build_hermetic_mem0 + ingest + score (per-QA recall + true flat token list) + main + optional --with-llm Ollama diagnostic .gitignore _locomo_workload/ + _mem0_locomo_faiss/ + outputs.json test_locomo_sandbox_structure.py (PASSES locally) test_locomo_bench_local.py regex PASSES; mem0 tests skip on py 3.12+ Files modified: expected.json ACTIVE; blocked_on cleared; model corrected (llama3 8B Q4) README.md CC BY-NC 4.0 attribution + run instructions + wall-time note pyproject.toml register requires_mem0 pytest marker docs/coverage.md + docs/metrics.md regenerated Verdict pending: docker compose run --rm bench (in bench/isolation/memory/mem0-v3-locomo)

Relative paths caused a doubling bug: cwd=sandbox_path + -f sandbox_path/compose.yml combined to double the prefix when Docker resolved the path relative to the changed cwd. All 9 existing runner tests pass.

… harness hardening The keystone measurement: pure-vector Mem0 (faiss, no BM25/entity fusion, no fact extraction, MiniLM embedder) scores 30.79 LoCoMo recall on the operator dev box — nowhere near the published 91.6 (their production tri-signal config). Decision-rule branch: config investigation, with an externally-evidenced fix staircase (BM25+RRF, cross-encoder rerank, Qwen3-class embedder — Hindsight reaches 89.6 with exactly this recipe). Harness hardening from the live runs: pip cache volume (first run burned ~50 min re-downloading torch and died silently), CPU-only torch index (default Linux torch drags ~3GB of CUDA into a CPU container), unbuffered un-quieted output (silence is indistinguishable from a hang). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

CI runs py3.11 WITHOUT mem0ai installed — the version gate let the import fire and 3 tests failed. Availability gate covers both that and any future py3.12-compatible mem0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Brand and others added 4 commits June 11, 2026 17:32

fix(bench): resolve sandbox_path to absolute in _execute_compose

26dbd0c

Relative paths caused a doubling bug: cwd=sandbox_path + -f sandbox_path/compose.yml combined to double the prefix when Docker resolved the path relative to the changed cwd. All 9 existing runner tests pass.

OpenCircuitDev merged commit 65cc8ae into main Jun 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): activate mem0-v3-locomo sandbox#71

feat(bench): activate mem0-v3-locomo sandbox#71
OpenCircuitDev merged 4 commits into
mainfrom
feat/mem0-v3-locomo-activation

OpenCircuitDev commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OpenCircuitDev commented Jun 11, 2026

Summary

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant