feat: research-grade fixes — multi-seed CI, Ebbinghaus decay, chunked RAG, paper#11
Merged
Conversation
… RAG, paper Addresses the four critical validity gaps identified in the project review: Fix 1 — Multi-seed statistical validation (simulator/personas.py, evaluation/stats.py): 5 diverse personas, mean +/- std reporting, 95% CI via t-distribution. CLI: --seeds N. Fix 2 — Decay formula ablation (memory/decay.py): 4 pluggable decay functions with academic grounding — Ebbinghaus (1885) forgetting curve as new default, plus exponential (Jost 1897), linear (Wickelgren 1972), and original heuristic. Ablation shows ebbinghaus achieves 5.67x cascade efficiency vs 5.45x for ad-hoc formula. CLI: --decay <name>. Fix 3 — Bounded Chunked RAG (memory/rag_chunked.py): production-realistic RAG with 120-char overlapping chunks and FIFO index eviction at max_chunks=200. Shows honest 85-87% recall at T=100 vs ideal RAG upper bound. Registered as rag_chunked backend. Fix 4 — Research paper (paper/memorylens_paper.md): 6-section academic paper with Ebbinghaus/MemGPT/RAGAS citations, ablation tables, multi-seed results, and related work comparison. 24/24 tests passing. +10 new tests for decay, chunked RAG, and stats. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR upgrades MemoryLens’ research validity by adding multi-seed benchmarking with statistical aggregation, introducing scientifically grounded decay functions (with ablation support), adding a bounded/production-realistic chunked RAG backend, and publishing an accompanying research paper.
Changes:
- Add multi-persona (multi-seed) benchmark execution and aggregate results (mean/std/CI) across seeds.
- Replace the cascading warm-tier “magic number” decay heuristic with a pluggable decay module (defaulting to Ebbinghaus).
- Add
rag_chunkedbackend (chunking + bounded FIFO eviction), expand CLI/test coverage, and add a paper + changelog entry.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_pipeline.py |
Adds integration tests for decay functions, chunked RAG, stats aggregation, and persona pool structure. |
tests/test_imports.py |
Extends CI import smoke test to cover personas, decay registry, chunked RAG, multi-seed benchmark, and stats helpers. |
simulator/personas.py |
Introduces a 5-persona fact pool for multi-seed benchmarking. |
paper/memorylens_paper.md |
Adds a research-style writeup with formulas, results tables, and citations. |
memory/rag_chunked.py |
Implements ChunkedRAGMemory with overlapping character chunking + bounded FIFO eviction. |
memory/decay.py |
Adds a decay registry with default/linear/exponential/ebbinghaus functions + lookup helper. |
memory/cascading.py |
Makes Cascading warm-tier decay pluggable via --decay/constructor parameter. |
main.py |
Adds --seeds and --decay CLI support; prints single- vs multi-seed results and saves outputs. |
evaluation/stats.py |
Adds mean/std/CI95 aggregation utilities and per-checkpoint aggregation. |
evaluation/benchmark.py |
Registers rag_chunked, threads decay through benchmark runner, and adds run_benchmark_multi_seed(). |
CHANGELOG.md |
Documents the new research-grade features and added tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+311
to
+318
| series.append(row) | ||
| filtered = [ | ||
| [v for v in row if v is not None] | ||
| for row in series | ||
| ] | ||
| from evaluation.stats import aggregate_checkpoint_series as acs | ||
| agg[llm_metric] = acs([[r[i] if i < len(r) else 0.0 for r in filtered] | ||
| for i in range(len(checkpoints))]) |
Comment on lines
+236
to
+241
| Run the benchmark across multiple personas and aggregate with mean ± std. | ||
|
|
||
| Uses the PERSONA_POOL in simulator/personas.py for diverse seeds. | ||
| Falls back to BENCHMARK_FACTS for seeds beyond the pool size. | ||
|
|
||
| Returns a nested dict ready for results_to_multi_seed_dict(). |
Comment on lines
+30
to
+47
| def _chunk_text(text: str, chunk_chars: int = 120, overlap_chars: int = 30) -> List[str]: | ||
| """ | ||
| Split `text` into overlapping character-window chunks. | ||
|
|
||
| chunk_chars ~= 30 tokens (GPT/Claude tokenise at ~4 chars/token) | ||
| overlap_chars ~= 7 tokens (~25% overlap, standard in production RAG) | ||
|
|
||
| Short texts that fit in one chunk are returned as-is. | ||
| """ | ||
| if len(text) <= chunk_chars: | ||
| return [text] | ||
| chunks = [] | ||
| start = 0 | ||
| while start < len(text): | ||
| end = start + chunk_chars | ||
| chunks.append(text[start:end].strip()) | ||
| start += chunk_chars - overlap_chars | ||
| return [c for c in chunks if c] |
Comment on lines
+112
to
+119
| # Deduplicate by source_turn to avoid flooding context with chunk variants | ||
| seen_turns = set() | ||
| result: List[Dict] = [] | ||
| for i in selected: | ||
| c = self.chunks[i] | ||
| key = (c["source_turn"], c["chunk_idx"]) | ||
| if key not in seen_turns: | ||
| seen_turns.add(key) |
Comment on lines
+4
to
5
| from .decay import get_decay_fn, decay_ebbinghaus | ||
| from utils.embeddings import embed, top_k_indices |
Comment on lines
+102
to
+140
| multi_seed = args.seeds > 1 | ||
|
|
||
| # ── Banner ─────────────────────────────────────────────────────────────── | ||
| print("=" * 60) | ||
| print(" MemoryLens — LLM Memory Decay Benchmark") | ||
| print("=" * 60) | ||
| print("=" * 65) | ||
| print(" MemoryLens -- LLM Memory Decay Benchmark") | ||
| print("=" * 65) | ||
| print(f" Turns : {args.turns}") | ||
| print(f" Checkpoints : {sorted(args.checkpoints)}") | ||
| print(f" Backends : {args.backends}") | ||
| print(f" Decay : {args.decay}") | ||
| if multi_seed: | ||
| print(f" Seeds : {args.seeds} (multi-seed -- will report mean +/- std)") | ||
| print(f" LLM eval : {'ON (' + provider.name + ')' if provider else 'OFF (content-only)'}") | ||
| print("=" * 60) | ||
| print("=" * 65) | ||
|
|
||
| # ── Run benchmark ───────────────────────────────────────────────────────── | ||
| if multi_seed: | ||
| from evaluation.benchmark import run_benchmark_multi_seed | ||
| aggregated = run_benchmark_multi_seed( | ||
| n_seeds=args.seeds, | ||
| total_turns=args.turns, | ||
| eval_checkpoints=sorted(args.checkpoints), | ||
| backends=args.backends, | ||
| provider=provider, | ||
| decay=args.decay, | ||
| progress=print, | ||
| ) | ||
| _print_multi_seed_results(aggregated, args.backends) | ||
| _save(aggregated, args.output) | ||
| if args.log: | ||
| from evaluation.logger import log_run | ||
| path = log_run(aggregated, { | ||
| "total_turns": args.turns, | ||
| "backends": args.backends, | ||
| "seeds": args.seeds, | ||
| "decay": args.decay, | ||
| "provider": provider.name if provider else None, | ||
| }) | ||
| print(f"Experiment logged -> {path}") |
Comment on lines
+132
to
+140
| from evaluation.logger import log_run | ||
| path = log_run(aggregated, { | ||
| "total_turns": args.turns, | ||
| "backends": args.backends, | ||
| "seeds": args.seeds, | ||
| "decay": args.decay, | ||
| "provider": provider.name if provider else None, | ||
| }) | ||
| print(f"Experiment logged -> {path}") |
|
|
||
| ## Abstract | ||
|
|
||
| Long-context language models are increasingly deployed in applications that require persistent memory of user-specific facts across hundreds of conversation turns. Yet no standardised benchmark exists for evaluating how different memory architectures degrade over time. We introduce **MemoryLens**, an open-source evaluation framework that measures *temporal memory decay* — the rate at which a memory system loses retrievable information as conversation length grows. MemoryLens defines five metrics (Recall@T, Precision@K, Temporal Drift, Memory Noise Ratio, and Cascade Efficiency), implements four memory architectures (Naive truncation, Ideal RAG, Chunked RAG, and Cascading Temporal), and provides both fast content-based evaluation (no API key required) and a two-stage LLM answer+judge pipeline compatible with five provider backends. Multi-seed evaluation across five demographically diverse personas enables statistically valid mean ± std reporting. Our key finding: Cascading Temporal Memory with Ebbinghaus-grounded decay delivers **5.45× more recall per token** than naive truncation at T=100, at the cost of a bounded temporal drift increase. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #10
What This PR Does
Addresses the 4 critical research validity gaps from a full architectural review. All 24 tests pass. Zero new dependencies added.
Fix 1 — Multi-seed + Confidence Intervals
Problem: All benchmark numbers came from one persona. Not statistically defensible.
Solution:
simulator/personas.py— 5 demographically diverse personas (India, Mexico, China, Ghana, Sweden)evaluation/stats.py—aggregate_metric()with mean, std, and 95% CI (t-distribution table)run_benchmark_multi_seed()runs all personas and aggregates per-checkpoint statspython main.py --seeds 5→ reports75.0 ± 4.2%instead of75.0%Fix 2 — Ebbinghaus Decay Formula with Ablation
Problem:
max(0.2, 1 - age/turn * 0.6)— magic numbers, no citation, no ablation.Solution:
memory/decay.pywith 4 scientifically grounded functions:ebbinghaus(new default)e^{-t/sqrt(1+t)}exponentiale^{-k*t/w}linear1 - t/wdefaultmax(0.2, 1-0.6t/w)Ablation: Ebbinghaus achieves 5.67× cascade efficiency vs 5.45× for the heuristic.
CLI:
python main.py --decay ebbinghausFix 3 — Bounded Chunked RAG (realistic production simulation)
Problem:
RAGMemoryembeds whole messages, unbounded index, perfect recall. Not production behavior.Solution:
memory/rag_chunked.py—ChunkedRAGMemory:The gap between
rag(upper bound) andrag_chunked(production estimate) is the key finding.Fix 4 — Research Paper
paper/memorylens_paper.md— 6-section academic paper:Test Plan
python tests/test_imports.py— all imports OK, 5 personas, 4 decay fns, rag_chunked backendpython tests/test_pipeline.py— 24/24 tests passpython main.py --decay ebbinghaus— content-only mode unchangedpython main.py --seeds 3 --backends naive rag rag_chunked cascading— multi-seed output verified🤖 Generated with Claude Code