Skip to content

feat(bench): activate mem0-v3-locomo sandbox#71

Merged
OpenCircuitDev merged 4 commits into
mainfrom
feat/mem0-v3-locomo-activation
Jun 12, 2026
Merged

feat(bench): activate mem0-v3-locomo sandbox#71
OpenCircuitDev merged 4 commits into
mainfrom
feat/mem0-v3-locomo-activation

Conversation

@OpenCircuitDev

Copy link
Copy Markdown
Owner

Summary

  • Activates bench/isolation/memory/mem0-v3-locomo — the LoCoMo recall sandbox for Mem0 v3 library-driven retrieval (spec row 9).
  • Operator ratified NEEDS_APPROVAL (LoCoMo = fetch-on-run, CC BY-NC, not bundled in repo). 2026-06-11.
  • Harness built: bench.py (~270 lines) + docker-compose.yml. Mirrors mem0-library-retrieval-recall (feat(bench): flip mem0-library-retrieval-recall to ACTIVE + discovered upstream bug #56) hermetic pattern (MiniLM-L6-v2 + faiss-cpu + infer=False) plus the amnesia-ab Ollama host.docker.internal pattern for the optional --with-llm diagnostic.
  • Dataset: locomo10.json fetched at run time, SHA256-pinned to upstream commit. 10 conversations, 5,882 turns, 1,986 QA items.
  • Status flipped: INACTIVEACTIVE; blocked_on cleared; model name corrected (Qwen3 8B → llama3 8B Q4 via Ollama).
  • Verdict pending: contract requires locomo_recall_score ≥ 88 confirm / < 80 refute.

Test plan

  • test_locomo_sandbox_structure.py — structure smoke: PASSES locally
  • test_locomo_bench_local.py::test_session_regex_only_matches_pure_session_keys — PASSES locally
  • test_locomo_bench_local.py mem0-dependent tests — SKIP on Python 3.13 (run in Docker py 3.11); will execute in Docker via verdict run
  • Docker verdict run: cd bench/isolation/memory/mem0-v3-locomo && docker compose run --rm bench
  • Record verdict in README + bench/history.jsonl; regen docs/coverage.md + docs/metrics.md

Notes

  • locomo10.json is not committed (.gitignored, fetch-on-run). CC BY-NC 4.0 attribution is in README.md.
  • --with-llm mode is optional (off by default) — routes through Ollama llama3:latest on host.docker.internal:11434. The verdict path is retrieval-only (no Ollama needed).
  • Wall-time estimate: ~6-8 min/repeat, ~18-25 min for 3 repeats.

🤖 Generated with Claude Code

Brand and others added 4 commits June 11, 2026 17:32
Build the LoCoMo recall harness per the merged plan. Operator ratified
fetch-on-run (NEEDS_APPROVAL resolved 2026-06-11). All blocked_on items
cleared; status flipped INACTIVE -> ACTIVE.

Files created:
  docker-compose.yml  Python 3.11 + SHA-pinned LOCOMO_URL + extra_hosts
  bench.py            fetch (SHA256-pinned) + build_hermetic_mem0 + ingest +
                      score (per-QA recall + true flat token list) + main +
                      optional --with-llm Ollama diagnostic
  .gitignore          _locomo_workload/ + _mem0_locomo_faiss/ + outputs.json
  test_locomo_sandbox_structure.py  (PASSES locally)
  test_locomo_bench_local.py        regex PASSES; mem0 tests skip on py 3.12+

Files modified:
  expected.json   ACTIVE; blocked_on cleared; model corrected (llama3 8B Q4)
  README.md       CC BY-NC 4.0 attribution + run instructions + wall-time note
  pyproject.toml  register requires_mem0 pytest marker
  docs/coverage.md + docs/metrics.md  regenerated

Verdict pending: docker compose run --rm bench (in bench/isolation/memory/mem0-v3-locomo)
Relative paths caused a doubling bug: cwd=sandbox_path + -f sandbox_path/compose.yml
combined to double the prefix when Docker resolved the path relative to the changed cwd.
All 9 existing runner tests pass.
… harness hardening

The keystone measurement: pure-vector Mem0 (faiss, no BM25/entity fusion,
no fact extraction, MiniLM embedder) scores 30.79 LoCoMo recall on the
operator dev box — nowhere near the published 91.6 (their production
tri-signal config). Decision-rule branch: config investigation, with an
externally-evidenced fix staircase (BM25+RRF, cross-encoder rerank,
Qwen3-class embedder — Hindsight reaches 89.6 with exactly this recipe).

Harness hardening from the live runs: pip cache volume (first run burned
~50 min re-downloading torch and died silently), CPU-only torch index
(default Linux torch drags ~3GB of CUDA into a CPU container), unbuffered
un-quieted output (silence is indistinguishable from a hang).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CI runs py3.11 WITHOUT mem0ai installed — the version gate let the import
fire and 3 tests failed. Availability gate covers both that and any future
py3.12-compatible mem0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@OpenCircuitDev OpenCircuitDev merged commit 65cc8ae into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant