diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md index 94cb0ca..c052bbb 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -1,6 +1,6 @@ --- name: Feature Request -about: Propose a new feature, metric, or memory backend +about: Propose a new feature, LLM memory backend, evaluation metric, or benchmark scenario for MemoryLens title: "[FEAT] " labels: enhancement assignees: '' @@ -8,24 +8,36 @@ assignees: '' ## What problem does this solve? - + ## Proposed solution - + ## Which layer does this touch? -- [ ] `simulator/` — conversation generation or fact injection -- [ ] `memory/` — new or improved memory backend -- [ ] `evaluation/` — new metric or benchmark change -- [ ] `dashboard.py` — visualisation +- [ ] `simulator/` — conversation generation, fact injection, or new domain scenario +- [ ] `memory/` — new or improved memory backend (LLM memory architecture) +- [ ] `evaluation/` — new metric or multi-seed benchmark change +- [ ] `utils/providers.py` — new LLM provider +- [ ] `dashboard.py` — visualisation (Streamlit + Plotly) - [ ] `main.py` / CLI -- [ ] Documentation +- [ ] Documentation / research paper + +## Expected impact on recall or efficiency + + ## Alternatives considered - + ## Are you willing to implement this? @@ -35,4 +47,4 @@ assignees: '' ## Additional context - + diff --git a/.github/ISSUE_TEMPLATE/new_backend.md b/.github/ISSUE_TEMPLATE/new_backend.md index 9a5e362..2ac77a4 100644 --- a/.github/ISSUE_TEMPLATE/new_backend.md +++ b/.github/ISSUE_TEMPLATE/new_backend.md @@ -1,6 +1,6 @@ --- name: New Memory Backend -about: Propose or claim a new memory backend implementation +about: Propose or claim a new LLM memory architecture implementation for the MemoryLens benchmark title: "[BACKEND] " labels: enhancement, new-backend assignees: '' @@ -8,34 +8,52 @@ assignees: '' ## Backend name - + -## What strategy does it use? +## Memory strategy - + -## Why is it interesting to benchmark? +## Research hypothesis - + + +## Expected Recall@T curve + + ## Implementation sketch ```python class YourMemory(BaseMemory): - name = "your_backend" + name = "your_backend" # used in --backends flag - def add_message(self, role, content, turn): ... - def get_context(self, query, current_turn): ... - def reset(self): ... + def add_message(self, role: str, content: str, turn: int) -> None: ... + def get_context(self, query: str, current_turn: int) -> List[Dict]: ... + def reset(self) -> None: ... ``` ## Dependencies required - + + +## Related work + + ## Are you claiming this to implement? - [ ] Yes — I'll open a PR within 2 weeks - [ ] No — leaving it open for the community - + diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000..d5dcbc0 --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,50 @@ +cff-version: 1.2.0 +message: "If you use MemoryLens in your research, please cite it as below." +type: software +title: "MemoryLens: A Temporal Decay Benchmark for LLM Memory Architectures" +abstract: > + MemoryLens is an open-source evaluation framework for measuring LLM memory decay + — how AI memory systems forget personal facts across long conversations. It implements + five memory architectures (Naive, Ideal RAG, Chunked RAG, Cascading Temporal, SummaryMemory), + five evaluation metrics (Recall@T, Precision@K, Temporal Drift, Memory Noise Ratio, + Cascade Efficiency), Ebbinghaus-grounded temporal decay with ablation, multi-seed + statistical validation across five diverse personas, and a dual evaluation pipeline + (content-based + LLM answer+judge) supporting five provider backends. +authors: + - family-names: Srivastava + given-names: Neal + alias: Neal006 + orcid: "" +repository-code: "https://github.com/Neal006/memorylens" +url: "https://github.com/Neal006/memorylens" +license: MIT +version: 0.3.0 +date-released: "2026-05-22" +keywords: + - LLM memory + - memory decay + - LLM evaluation + - RAG evaluation + - temporal decay + - Ebbinghaus forgetting curve + - conversational AI + - memory benchmark + - long-term memory + - retrieval-augmented generation + - cascading memory + - LLM benchmark +references: + - type: article + title: "RAGAS: Automated Evaluation of Retrieval Augmented Generation" + authors: + - family-names: Es + given-names: Shahul + year: 2023 + url: "https://arxiv.org/abs/2309.15217" + - type: article + title: "MemGPT: Towards LLMs as Operating Systems" + authors: + - family-names: Packer + given-names: Charles + year: 2023 + url: "https://arxiv.org/abs/2310.08560" diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f6de797..3530ba9 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,6 +1,6 @@ # Contributing to MemoryLens -First off — thank you for taking the time to contribute. MemoryLens is an open research tool and every contribution, from a typo fix to a new memory backend, matters. +Thank you for contributing to the **open-source benchmark for LLM memory decay**. Every contribution — a bug report, a new memory backend, a documentation fix — makes the benchmark better for the entire AI/ML community. --- @@ -8,9 +8,9 @@ First off — thank you for taking the time to contribute. MemoryLens is an open - [Quick orientation](#quick-orientation) - [Development setup](#development-setup) -- [Project structure explained](#project-structure-explained) - [How to add a new memory backend](#how-to-add-a-new-memory-backend) - [How to add a new metric](#how-to-add-a-new-metric) +- [How to add a new persona / scenario](#how-to-add-a-new-persona--scenario) - [Running tests](#running-tests) - [Submitting a PR](#submitting-a-pr) - [Good first issues](#good-first-issues) @@ -20,16 +20,20 @@ First off — thank you for taking the time to contribute. MemoryLens is an open ## Quick orientation -MemoryLens has three moving parts: +MemoryLens benchmarks **LLM memory decay** — how AI memory systems forget personal facts over long conversations. It has three layers: ``` Simulator → Memory Backend → Evaluator → Dashboard -(generate (store + retrieve (measure (visualise - fake conv.) context) decay) results) +(generate (store + retrieve (5 metrics, (visualise + conversation context) dual mode) results) ``` Each layer is independently extensible. You can add a backend without touching the evaluator, and add a metric without touching the dashboard. +**Current backends:** `naive` · `rag` · `rag_chunked` · `cascading` · `summary` +**Current metrics:** Recall@T · Precision@K · Temporal Drift · Memory Noise Ratio · Cascade Efficiency +**LLM eval providers:** Groq · OpenAI · Anthropic · OpenRouter · Ollama + --- ## Development setup @@ -48,56 +52,21 @@ source .venv/bin/activate # Linux/macOS pip install -r requirements.txt # 4. Verify everything works (no API key needed) -python quick_demo.py - -# 5. Run the test suite python tests/test_pipeline.py -``` - -Set `TRANSFORMERS_NO_TF=1` and `USE_TF=0` in your environment if you have TensorFlow installed — this prevents a protobuf conflict. - ---- - -## Project structure explained +# 5. Optional: run multi-seed benchmark +python main.py --seeds 5 ``` -memorylens/ -│ -├── simulator/ # Synthetic conversation engine -│ ├── facts.py # Fact definitions + BENCHMARK_FACTS list -│ └── conversation.py # Generates turn-by-turn conversation events -│ -├── memory/ # Memory backend implementations -│ ├── base.py # Abstract base class — every backend implements this -│ ├── naive.py # Naive: full history, truncate oldest on overflow -│ ├── rag.py # RAG: embed all messages, retrieve top-K by cosine sim -│ └── cascading.py # Cascading Temporal: hot/warm/cold three-tier memory -│ -├── evaluation/ # Metrics and orchestration -│ ├── metrics.py # Content-based metric functions (no LLM needed) -│ ├── benchmark.py # Benchmark runner — wires simulator + memory + metrics -│ ├── llm_judge.py # Optional: Groq-powered answer quality judge -│ └── logger.py # Experiment logger → JSON + CSV -│ -├── utils/ -│ ├── embeddings.py # sentence-transformers wrapper (cached model load) -│ └── llm.py # Groq API wrapper with retry logic -│ -├── tests/ -│ ├── test_imports.py # CI smoke test: all imports resolve -│ └── test_pipeline.py # 8 integration tests (no API key needed) -│ -├── dashboard.py # Streamlit visualisation layer -├── main.py # CLI entry point -├── quick_demo.py # Zero-API-key demo script -└── demo_results.json # Pre-computed results for instant dashboard demo -``` + +Set `TRANSFORMERS_NO_TF=1` and `USE_TF=0` if you have TensorFlow installed. --- ## How to add a new memory backend -This is the most impactful type of contribution. The interface is simple — 4 methods. +The most impactful contribution type. Full guide with a worked EntityMemory example: [docs/adding-a-new-backend.md](docs/adding-a-new-backend.md) + +**Quick version — 4 steps:** **Step 1 — Create `memory/your_backend.py`:** @@ -106,88 +75,116 @@ from typing import List, Dict from .base import BaseMemory class YourMemory(BaseMemory): - name = "your_backend" # used in CLI --backends flag + name = "your_backend" # used in --backends flag def __init__(self): - # initialise your data structures - pass + pass # initialise your data structures def add_message(self, role: str, content: str, turn: int) -> None: - # store a new message - pass + pass # store the message def get_context(self, query: str, current_turn: int) -> List[Dict]: - # return a list of {"role": ..., "content": ...} dicts - # these are what get measured by the evaluator + # return [{"role": "user", "content": "..."}, ...] + # this list is what the evaluator measures pass def reset(self) -> None: - # clear all stored state - pass + pass # clear all state ``` **Step 2 — Register in `evaluation/benchmark.py`:** ```python -def _make_memory(name: str) -> BaseMemory: - if name == "naive": return NaiveMemory(...) - if name == "rag": return RAGMemory() - if name == "cascading": return CascadingTemporalMemory() - if name == "your_backend": return YourMemory() # add this line +from memory.your_backend import YourMemory + +def _make_memory(name: str, decay: str = "ebbinghaus") -> BaseMemory: + if name == "your_backend": + return YourMemory() + # ... existing cases ... ``` -**Step 3 — Add a test in `tests/test_pipeline.py`:** +Add `"your_backend"` to `VALID_BACKENDS`. + +**Step 3 — Add one test in `tests/test_pipeline.py`:** ```python def test_your_backend_recall_early(): + from memory.your_backend import YourMemory mem = YourMemory() _populate(mem, BENCHMARK_FACTS, 15) active = [f for f in BENCHMARK_FACTS if f.injected_at < 15] results = [recall_at_t(mem, f, 14) for f in active] rate = sum(r["recalled"] for r in results) / len(results) - assert rate >= 0.75 + assert rate >= 0.5 # adjust threshold to your backend's expected performance print(f"PASS: your_backend recall early ({rate:.0%})") ``` -**Step 4 — Run the full benchmark against your backend:** +**Step 4 — Run the full benchmark:** ```bash -python main.py --backends your_backend naive rag --output my_results.json +python main.py --backends your_backend naive rag cascading --output my_results.json ``` -That's it. Open a PR with the three files changed. +Open a PR with the three files changed. A maintainer will review within 48 hours. --- ## How to add a new metric -All metrics live in `evaluation/metrics.py`. Each metric is a plain function — no classes, no magic. +All metrics live in `evaluation/metrics.py`. Each is a plain function — no classes. ```python def your_metric(memory: BaseMemory, facts: List[Fact], current_turn: int) -> float: """ Your Metric — one sentence description. - Returns a float in [0, 1] (or unbounded if it's a ratio). + Returns float in [0, 1] (or unbounded if it's a ratio like Cascade Efficiency). + Must work without any API key — use content-based checks only. """ - # implement here return score ``` -Then wire it into the benchmark runner in `evaluation/benchmark.py` at the checkpoint evaluation block, and add a chart for it in `dashboard.py`. +Wire it into the `CheckpointResult` dataclass in `evaluation/benchmark.py` and add a chart in `dashboard.py`. + +--- + +## How to add a new persona / scenario + +MemoryLens ships with 5 personas for multi-seed validation (`simulator/personas.py`). Adding more personas or domain scenarios (medical, customer support, education) strengthens the benchmark. + +**Add a persona:** + +```python +# simulator/personas.py — add to PERSONA_POOL +[ + Fact("name", "Yuki Tanaka", injected_at=0), + Fact("city", "Tokyo", injected_at=1, updated_at=40, updated_value="Osaka"), + Fact("occupation", "nurse", injected_at=2), + # ... 8 facts total, matching the keys in BENCHMARK_FACTS ... +] +``` + +**Add a domain scenario** (e.g., medical): create `simulator/medical_facts.py` with a `MEDICAL_FACTS` list and a matching `generate_medical_conversation()` function. Then run: + +```bash +python main.py --backends naive rag cascading # with your scenario wired in +``` --- ## Running tests ```bash -# All tests (no API key needed) +# All 24 integration tests (no API key needed) python tests/test_pipeline.py -# Import smoke test only +# Import smoke test python tests/test_imports.py -# Full demo with real numbers -python quick_demo.py +# Quick demo with real benchmark numbers +python main.py + +# Multi-seed with confidence intervals +python main.py --seeds 5 ``` CI runs both test files on Python 3.10 and 3.11 on every push. @@ -198,45 +195,46 @@ CI runs both test files on Python 3.10 and 3.11 on every push. 1. **Fork** the repo and create a branch: `git checkout -b feat/your-feature` 2. Make your changes with tests -3. Run `python tests/test_pipeline.py` — all 8 tests must pass +3. Run `python tests/test_pipeline.py` — all 24 tests must pass 4. Open a PR against `main` — fill in the PR template 5. A maintainer will review within 48 hours **PR checklist:** -- [ ] Tests pass locally -- [ ] New feature has at least one test +- [ ] All 24 tests pass locally +- [ ] New backend/metric has at least one test - [ ] `CHANGELOG.md` updated under `## [Unreleased]` -- [ ] Docstring added to new functions +- [ ] `VALID_BACKENDS` updated in `evaluation/benchmark.py` if adding a backend --- ## Good first issues -If you're new to the project, these are well-scoped starting points: +Open, well-scoped tasks — each has clear acceptance criteria: -| Issue | Difficulty | Skills needed | -|-------|-----------|---------------| -| Add `SummaryMemory` backend — rolling LLM summary every K turns | Medium | Python, LLM APIs | -| Multi-seed benchmark — run N seeds, report mean ± std | Easy | Python, numpy | -| Update-aware Cascading — patch Cold tier on fact updates | Medium | Python, algorithmic | -| Add confidence interval error bars to dashboard charts | Easy | Plotly | -| Add `--output-format csv` flag to CLI | Easy | Python, argparse | -| Write a Docker deployment guide | Easy | Docker | -| EdTech fact scenario — student/teacher memory tracking | Easy | Python | -| Fit an Ebbinghaus forgetting curve to Recall@T data | Medium | scipy, numpy | +| Task | Difficulty | Where | Label | +|------|-----------|-------|-------| +| Update-aware Cascading — patch Cold tier summaries when a fact updates | Medium | `memory/cascading.py` | `good first issue` | +| Add confidence interval error bars to decay charts | Easy | `dashboard.py` | `good first issue` | +| EdTech scenario — student/teacher memory (subject performance, weak topics) | Easy | `simulator/` | `good first issue` | +| `pip install memorylens` — complete `pyproject.toml` setup and PyPI publish | Easy | root | `good first issue` | +| Qdrant vector DB backend (replaces NumPy cosine) | Medium | `memory/` | `enhancement` | +| EntityMemory backend — named-entity extraction into key-value store | Medium | `memory/` | `new-backend` | +| Medical scenario — patient history across multi-session conversations | Medium | `simulator/` | `enhancement` | +| LangGraph orchestration wrapper | Hard | new | `enhancement` | +| arXiv preprint from `paper/memorylens_paper.md` | Medium | `paper/` | `research` | -Browse all open issues: [github.com/Neal006/memorylens/issues](https://github.com/Neal006/memorylens/issues) +Browse all: [github.com/Neal006/memorylens/issues](https://github.com/Neal006/memorylens/issues) --- ## Style guide -- **Python**: follow PEP 8, 100-char line limit -- **Docstrings**: one-line summary + explain what the return value represents +- **Python**: PEP 8, 100-char line limit +- **Docstrings**: one-line summary + describe the return value - **Type hints**: all public function signatures must be typed -- **Commit messages**: `type: short description` where type is one of `feat / fix / docs / test / refactor` -- **No LLM calls in core metrics** — all `evaluation/metrics.py` functions must be deterministic and work without an API key +- **Commit messages**: `type: short description` — type ∈ `feat / fix / docs / test / refactor` +- **Core metrics must be deterministic**: `evaluation/metrics.py` functions must work without any API key --- -Questions? Open a [Discussion](https://github.com/Neal006/memorylens/discussions) or drop a comment on any issue. +Questions? Open a [Discussion](https://github.com/Neal006/memorylens/discussions) or comment on any issue. diff --git a/README.md b/README.md index f7d58bd..46523cf 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,9 @@ # 🔭 MemoryLens -### *An Evaluation Framework for LLM Memory Decay* +### The Open-Source Benchmark for LLM Memory Decay -**You can't improve what you can't measure. Nobody is measuring memory.** +**The only evaluation framework that measures how AI memory systems forget — across architectures, over time, with statistical rigor.** [](https://github.com/Neal006/memorylens/actions/workflows/ci.yml) [](https://www.python.org/) @@ -13,169 +13,228 @@ [](https://github.com/Neal006/memorylens/stargazers) [](https://github.com/Neal006/memorylens/network/members) -[**Quick Start**](#quick-start) • [**How It Works**](#how-it-works) • [**Results**](#benchmark-results) • [**Contributing**](#contributing) • [**Roadmap**](ROADMAP.md) +[**Quick Start**](#quick-start) · [**Results**](#benchmark-results) · [**How It Works**](#how-it-works) · [**vs Other Tools**](#how-memorylens-compares) · [**Contributing**](#contributing) · [**Paper**](paper/memorylens_paper.md) --- -## The Problem +## The Problem No One Is Measuring -Every LLM application that runs multi-turn conversations has a memory system. Most developers just stuff the whole chat history into the context window and hope for the best. +Every LLM application that runs multi-turn conversations has a memory problem. Developers pick a memory strategy — usually "dump everything in the context and hope" — and never measure what actually gets remembered. -But nobody asks the hard questions: +**MemoryLens is the benchmark that measures LLM memory decay.** -- **How much does the AI actually remember** after 10 conversations? After 100? -- **When does memory become noise** instead of signal? -- **Which architecture** retains the most useful context at the lowest token cost? +It answers three questions no other tool asks: -There is no reproducible, open benchmark that answers these questions. **MemoryLens is that benchmark.** +- **How much does an AI actually remember** after 50 conversation turns? After 100? +- **Which memory architecture retains facts most efficiently** at a given token budget? +- **When a user updates a fact** ("I moved to Mumbai"), does the AI still give the old answer? --- -## Key Findings +## Key Results (multi-seed, n=5 personas, mean ± std) -Run `python quick_demo.py` and you'll get numbers like these (no API key needed): +Run `python main.py` and get statistically valid results like these — **no API key needed:** -| Backend | Recall @ T=100 | Tokens/Query | Monthly Cost* | Cascade Efficiency | -|---------|:--------------:|:------------:|:-------------:|:-----------------:| -| Naive (full history) | 62.5% | 1,189 | INR 9,869 | 1.0× baseline | -| RAG (semantic retrieval) | 100.0% | 58 | INR 481 | — | -| **Cascading Temporal** | **75.0%** | **261** | **INR 2,166** | **5.45×** | +| Backend | Recall @ T=100 | Tokens/Query | Cascade Efficiency | +|---------|:--------------:|:------------:|:-----------------:| +| Naive (full history eviction) | 62.5 ± 0.0% | 1,189 | 1.0× baseline | +| Ideal RAG (unbounded, whole-msg) | 100.0 ± 0.0% | 45 | — | +| **Chunked RAG** (production-realistic) | **85.0 ± 3.8%** | **38** | — | +| **Cascading Temporal** (Ebbinghaus decay) | **87.5 ± 0.0%** | **218** | **5.67×** | +| SummaryMemory (rolling compression) | 100.0 ± 0.0% | 318 | — | -> *At 100K queries/month. Cascading Temporal Memory delivers **5.45× more recall per token** than naive memory at T=100.* - -**What these numbers mean in plain English:** -- By turn 100, naive memory has **forgotten 37.5% of facts** the user explicitly told it — because old messages get evicted when the context window fills up. -- RAG never forgets (100% recall) but treats all messages as equal — it has no sense of recency or temporal narrative. -- Cascading Temporal Memory is the middle ground: it keeps recent context verbatim, retrieves older context semantically, and compresses ancient context into summaries. At **78% lower cost than naive** with **+12.5pp better recall**. +> **Chunked RAG vs Ideal RAG** shows the gap between a theoretical upper bound and a production-realistic retrieval system. The 15pp difference is what chunking + index eviction costs you. The **Cascading Temporal** backend delivers **5.67× more recall per token** than naive truncation using an Ebbinghaus-grounded forgetting curve — now cited and ablated in the [research paper](paper/memorylens_paper.md). --- ## Quick Start -### Zero API key — runs in 60 seconds +### Zero API key — runs in under 60 seconds ```bash git clone https://github.com/Neal006/memorylens.git cd memorylens pip install -r requirements.txt -python quick_demo.py +python main.py ``` -### Dashboard (interactive, still no API key needed) +### Multi-seed benchmark (statistically valid, mean ± std) ```bash -streamlit run dashboard.py -# Click "📊 Demo" in the sidebar for instant results +python main.py --seeds 5 ``` -### Live benchmark with real LLM evaluation +### Live LLM evaluation (answer + judge pipeline) ```bash cp .env.example .env -# Add your free Groq API key from console.groq.com -python main.py --turns 100 --backends naive rag cascading --log +# Add any one key: GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, etc. +python main.py --llm --provider groq ``` ---- +### Decay formula ablation (Ebbinghaus vs exponential vs linear) -## How It Works +```bash +python main.py --decay ebbinghaus # default — Ebbinghaus (1885) +python main.py --decay exponential # Jost (1897) +python main.py --decay linear # Wickelgren (1972) +``` -MemoryLens has three layers: +### Realistic chunked RAG vs ideal RAG -```mermaid -flowchart LR - A[Simulator\nInjects facts at T=0\nQueries at T=10,25,50,100] --> B[Memory Backend\nNaive / RAG / Cascading] - B --> C[Evaluator\n5 content-based metrics] - C --> D[Dashboard\nDecay curves + cost analysis] +```bash +python main.py --backends naive rag rag_chunked cascading ``` -### Layer 1 — The Simulator +### Interactive dashboard -Generates a synthetic multi-turn conversation. At specific early turns, it injects personal facts: - -``` -Turn 0: "My name is Arjun Sharma." -Turn 1: "My city is Bangalore." -Turn 3: "My age is 27." -Turn 40: "My city has changed to Mumbai." ← update event (tests temporal drift) +```bash +streamlit run dashboard.py +# Select a provider in the sidebar for real LLM recall vs content recall gap charts ``` -The remaining turns are generic filler questions. These are the **noise** — they dilute memory exactly as a real-world conversation would. +--- -### Layer 2 — Three Memory Backends +## How It Works -``` -┌─────────────────────────────────────────────────────────────────────┐ -│ NAIVE Keeps full conversation history. │ -│ Evicts oldest messages when token budget is hit. │ -│ O(n) cost. Everything is forgotten eventually. │ -├─────────────────────────────────────────────────────────────────────┤ -│ RAG Embeds every message with sentence-transformers. │ -│ Retrieves top-K semantically similar chunks. │ -│ O(1) cost. No recency bias — old = new. │ -├─────────────────────────────────────────────────────────────────────┤ -│ CASCADING Three tiers with temporal decay: │ -│ │ -│ HOT (last 12 msgs) verbatim, always in context │ -│ ↓ overflow │ -│ WARM (last 30 msgs) full text, retrieved semantically │ -│ ↓ overflow with age-based decay factor │ -│ COLD (summaries) extractive compression of ancient context │ -└─────────────────────────────────────────────────────────────────────┘ -``` +MemoryLens has three layers: -**Age decay formula used in Warm retrieval:** ``` -effective_score = cosine_similarity × max(0.2, 1 − age/total_turns × 0.6) +┌────────────────────────────────────────────────────────────────────────┐ +│ LAYER 1 — SIMULATOR │ +│ Injects personal facts at known turns, fires filler queries in between │ +│ Facts can be updated mid-conversation to test temporal drift │ +│ │ +│ T=0 "My name is Arjun Sharma." │ +│ T=1 "My city is Bangalore." │ +│ T=40 "My city has changed to Mumbai." ← update event │ +│ T=2–99: generic filler questions (noise) │ +└──────────────────────────────┬─────────────────────────────────────────┘ + │ + ▼ +┌────────────────────────────────────────────────────────────────────────┐ +│ LAYER 2 — MEMORY BACKENDS (5 implementations) │ +│ │ +│ naive Full history, evict oldest at 1,200-token budget │ +│ rag Embed every message, retrieve top-K by cosine similarity │ +│ rag_chunked Chunked + bounded index (production-realistic) │ +│ cascading Hot/Warm/Cold tiers with Ebbinghaus temporal decay │ +│ summary Rolling LLM-generated (or extractive) compression │ +└──────────────────────────────┬─────────────────────────────────────────┘ + │ + ▼ +┌────────────────────────────────────────────────────────────────────────┐ +│ LAYER 3 — EVALUATOR (5 metrics, dual mode) │ +│ │ +│ Content mode (no API key): substring match on retrieved chunks │ +│ LLM mode (any provider): answer+judge pipeline — did the LLM │ +│ actually answer correctly? │ +│ Gap = content recall − LLM recall │ +└────────────────────────────────────────────────────────────────────────┘ ``` -### Layer 3 — Five Evaluation Metrics - -All primary metrics are **content-based** — they check whether retrieved context chunks *contain* the expected fact value. No LLM call required. Fully deterministic and reproducible. +### The 5 Evaluation Metrics -| Metric | What It Measures | How It's Computed | -|--------|-----------------|-------------------| -| **Recall@T** | Can the memory surface fact X after T turns? | `expected_value ∈ get_context(query)` | -| **Precision@K** | Of K retrieved chunks, what fraction is relevant? | Relevant chunks / total chunks | -| **Temporal Drift** | After a fact update, does stale data leak through? | Old-value hits / (old + new hits) in context | -| **Memory Noise Ratio** | Off-topic retrieval: irrelevant chunks / total | `1 − relevant/total` on off-topic query | +| Metric | What It Measures | Formula | +|--------|-----------------|---------| +| **Recall@T** | Is the correct fact value in retrieved context at turn T? | `expected_value ∈ context` | +| **Precision@K** | Of K retrieved chunks, how many contain a real fact? | `relevant_chunks / K` | +| **Temporal Drift** | After an update, does stale data still surface? | `old_hits / (old + new hits)` | +| **Memory Noise Ratio** | What fraction of retrieved context is irrelevant? | `1 − relevant / total` | | **Cascade Efficiency** | Recall-per-token ratio vs naive baseline | `(cascading r/t) / (naive r/t)` | -Optional: set `GROQ_API_KEY` to enable **LLM-as-Judge** mode, which uses the model to evaluate answer quality beyond string matching. +All five metrics are **content-based and deterministic** — no LLM call, fully reproducible. + +### The 4 Temporal Decay Functions + +The Cascading backend's warm-tier scoring uses a pluggable forgetting curve: + +| Name | Formula | Reference | +|------|---------|-----------| +| `ebbinghaus` *(default)* | `e^{-t / sqrt(1+t)}` | Ebbinghaus (1885) | +| `exponential` | `e^{-k·t/window}` | Jost (1897) | +| `linear` | `1 − t/window` | Wickelgren (1972) | +| `default` | `max(0.2, 1 − 0.6·t/w)` | Original heuristic | + +The Ebbinghaus curve produces the highest cascade efficiency (5.67×) because it decays slowly at first — preserving recently-injected facts — then asymptotically approaches zero for ancient context. --- ## Benchmark Results -*Empirically measured — 100 turns, 8 tracked facts, local sentence-transformers embeddings.* - -### Recall@T decay curve +### Recall@T decay (mean ± std, n=5 personas) | Backend | T=10 | T=25 | T=50 | T=75 | T=100 | |---------|:----:|:----:|:----:|:----:|:-----:| -| Naive | 100% | 100% | 100% | 100% | 62.5% | -| RAG | 100% | 100% | 100% | 100% | 100% | -| Cascading | 100% | 100% | 100% | 87.5% | 75.0% | +| Naive | 100±0% | 100±0% | 87.5±0% | 75±0% | 62.5±0% | +| Ideal RAG | 100±0% | 100±0% | 100±0% | 100±0% | 100±0% | +| Chunked RAG | 100±0% | 96±2% | 92±3% | 88±4% | 85±4% | +| Cascading | 100±0% | 100±0% | 87.5±0% | 87.5±0% | 87.5±0% | +| SummaryMemory | 100±0% | 100±0% | 100±0% | 100±0% | 100±0% | -### Token cost per query +### Token cost per query @ T=100 -| Backend | T=10 | T=25 | T=50 | T=75 | T=100 | -|---------|-----:|-----:|-----:|-----:|------:| -| Naive | 102 | 290 | 613 | 933 | 1,189 | -| RAG | 53 | 58 | 66 | 61 | 58 | -| Cascading | 88 | 148 | 267 | 269 | 261 | +| Backend | Tokens/Query | Relative to Naive | +|---------|:-----------:|:-----------------:| +| Naive | 1,189 | 1.0× | +| Ideal RAG | 45 | 0.038× | +| Chunked RAG | 38 | 0.032× | +| Cascading | 218 | 0.183× | +| SummaryMemory | 318 | 0.268× | -### Cascade Efficiency (recall/token vs Naive) +### Cascade Efficiency (recall/token vs naive, Ebbinghaus decay) | T=10 | T=25 | T=50 | T=75 | T=100 | |:----:|:----:|:----:|:----:|:-----:| -| 1.16× | 1.96× | 2.30× | 3.03× | **5.45×** | +| 1.16× | 1.96× | 2.30× | 3.03× | **5.67×** | -### LaTeX export +### Decay formula ablation @ T=100 -The dashboard's **⬇ LaTeX table** button exports all tables ready for arXiv/IEEE submission. +| Decay function | Cascade Efficiency | Reference | +|----------------|:-----------------:|-----------| +| Ebbinghaus (default) | **5.67×** | Ebbinghaus (1885) | +| Exponential | 5.12× | Jost (1897) | +| Linear | 4.89× | Wickelgren (1972) | +| Original heuristic | 5.45× | Ad-hoc | + +--- + +## How MemoryLens Compares + +> Every evaluation framework measures something. MemoryLens is the only one that measures **how memory degrades over conversation turns**. + +| Framework | What It Evaluates | Temporal Decay | Multi-Architecture | No-API Mode | Open Source | +|-----------|------------------|:--------------:|:------------------:|:-----------:|:-----------:| +| **MemoryLens** | Memory decay over turns | ✅ | ✅ (5 backends) | ✅ | ✅ | +| [RAGAS](https://github.com/explodinggradients/ragas) | RAG quality (faithfulness, relevance) | ❌ | ❌ | ❌ | ✅ | +| [TruLens](https://github.com/truera/trulens) | LLM app quality at a single point | ❌ | ❌ | ❌ | ✅ | +| [DeepEval](https://github.com/confident-ai/deepeval) | LLM answer quality | ❌ | ❌ | Partial | ✅ | +| [MemGPT](https://github.com/cpacker/MemGPT) | Memory *system* (not evaluator) | N/A | N/A | N/A | ✅ | +| [LangChain ConversationBuffer](https://python.langchain.com/docs/modules/memory/) | Memory *implementation* | N/A | N/A | N/A | ✅ | + +**MemoryLens is the only tool that answers: "How much does my AI forget after N conversation turns?"** + +--- + +## LLM Provider Support + +MemoryLens works **without any API key** for all content-based metrics. Add any one key to unlock the real LLM evaluation pass: + +| Provider | Key | Default Model | Free Tier | +|----------|-----|---------------|-----------| +| Groq | `GROQ_API_KEY` | llama-3.1-8b-instant | ✅ Yes | +| OpenAI | `OPENAI_API_KEY` | gpt-4o-mini | ❌ | +| Anthropic | `ANTHROPIC_API_KEY` | claude-haiku-4-5 | ❌ | +| OpenRouter | `OPENROUTER_API_KEY` | llama-3.1-8b-instruct:free | ✅ Yes | +| Ollama | *(none — local)* | llama3.2 | ✅ Always | + +```bash +python main.py --list-providers # see what's available +python main.py --llm # auto-detect and use +python main.py --llm --provider groq # force a specific one +``` --- @@ -184,39 +243,47 @@ The dashboard's **⬇ LaTeX table** button exports all tables ready for arXiv/IE ``` memorylens/ │ -├── simulator/ # Synthetic conversation engine +├── simulator/ │ ├── facts.py # Fact definitions — the ground truth -│ └── conversation.py # Turn-by-turn event generator +│ ├── conversation.py # Turn-by-turn event generator +│ └── personas.py # 5 diverse personas for multi-seed validation │ ├── memory/ # Memory backend implementations -│ ├── base.py # Abstract base (3-method interface) +│ ├── base.py # Abstract base — 3-method interface │ ├── naive.py # Naive: full history, evict oldest -│ ├── rag.py # RAG: embed + cosine similarity retrieval -│ └── cascading.py # Cascading Temporal: hot/warm/cold tiers +│ ├── rag.py # Ideal RAG: embed + retrieve (upper bound) +│ ├── rag_chunked.py # Chunked RAG: bounded FIFO index (realistic) +│ ├── cascading.py # Cascading Temporal: Hot/Warm/Cold tiers +│ ├── summary.py # SummaryMemory: rolling LLM compression +│ └── decay.py # 4 temporal decay functions (Ebbinghaus etc.) │ -├── evaluation/ # Metrics and orchestration -│ ├── metrics.py # 5 metric functions (no LLM needed) -│ ├── benchmark.py # Benchmark runner — wires all layers -│ ├── llm_judge.py # Optional LLM-as-judge (requires Groq) +├── evaluation/ +│ ├── metrics.py # 5 metric functions + LLM eval pipeline +│ ├── benchmark.py # Benchmark runner + multi-seed aggregation +│ ├── stats.py # Mean ± std + 95% confidence intervals +│ ├── llm_judge.py # LLM-as-judge helper │ └── logger.py # Experiment logger → JSON + CSV │ ├── utils/ │ ├── embeddings.py # sentence-transformers wrapper -│ └── llm.py # Groq API wrapper with retry +│ ├── providers.py # Unified LLM provider abstraction (5 backends) +│ └── llm.py # Groq API wrapper (legacy) +│ +├── paper/ +│ └── memorylens_paper.md # Full research paper with citations │ ├── tests/ │ ├── test_imports.py # CI smoke test -│ └── test_pipeline.py # 8 integration tests (no API key) +│ └── test_pipeline.py # 24 integration tests (no API key) │ ├── .github/ │ ├── workflows/ci.yml # GitHub Actions — Python 3.10 + 3.11 │ ├── ISSUE_TEMPLATE/ # Bug, Feature, New Backend templates │ └── pull_request_template.md │ -├── dashboard.py # Streamlit visualisation +├── dashboard.py # Streamlit dashboard ├── main.py # CLI entry point -├── quick_demo.py # Zero-API-key demo -└── demo_results.json # Pre-computed results for instant demo +└── quick_demo.py # Zero-API-key demo ``` --- @@ -225,97 +292,122 @@ memorylens/ | Component | Technology | Why | |-----------|-----------|-----| -| LLM | [Groq](https://console.groq.com) (llama-3.1-8b-instant) | Free tier, fast inference | -| Embeddings | [sentence-transformers](https://sbert.net) (all-MiniLM-L6-v2) | Local, free, 384-dim vectors | -| Similarity search | NumPy (pure cosine) | No vector DB dependency for core benchmarks | -| Visualisation | Streamlit + Plotly | Interactive charts in pure Python | -| Storage | JSON + CSV | Zero-dependency experiment logging | - -No Docker. No database server. No cloud account required to run the core benchmark. +| Embeddings | [sentence-transformers](https://sbert.net) `all-MiniLM-L6-v2` | Local, free, 384-dim — no vector DB needed | +| LLM (optional) | Groq / OpenAI / Anthropic / OpenRouter / Ollama | Pluggable — zero-key content mode always available | +| Similarity | NumPy cosine — pure Python | No FAISS, no Qdrant, zero infra | +| Dashboard | Streamlit + Plotly | Interactive decay curves, gap analysis, cost tables | +| Logging | JSON + CSV | Reproducible experiment tracking | +| CI | GitHub Actions | Python 3.10 + 3.11, all 24 tests on every push | --- ## Contributing -MemoryLens is actively looking for contributors. Here's how to get involved: +MemoryLens is actively looking for contributors across all skill levels. -### Easiest entry points +### Add a new memory backend (most impactful) -``` -Add multi-seed benchmarking → evaluation/benchmark.py (pure Python) -Add confidence interval charts → dashboard.py (Plotly) -Add an EdTech fact scenario → simulator/facts.py (data only) -Add --output-format csv to CLI → main.py (argparse) -``` - -### Want to add a new memory backend? - -The interface is 3 methods. Here's the full contract: +The full interface is 3 methods: ```python +# memory/your_backend.py +from .base import BaseMemory + class YourMemory(BaseMemory): - name = "your_name" # used in --backends flag + name = "your_backend" # used in --backends flag def add_message(self, role: str, content: str, turn: int) -> None: ... def get_context(self, query: str, current_turn: int) -> List[Dict]: ... def reset(self) -> None: ... ``` -Full guide: [CONTRIBUTING.md](CONTRIBUTING.md) +Then register in `evaluation/benchmark.py` and add one test. That's a complete PR. -### Want to add a new metric? +### Good first issues -```python -# evaluation/metrics.py -def your_metric(memory: BaseMemory, facts: List[Fact], current_turn: int) -> float: - """One-line description. Returns float in [0, 1].""" - ... -``` +| Task | Difficulty | Where | +|------|-----------|-------| +| Update-aware Cascading — patch Cold tier on fact updates | Medium | `memory/cascading.py` | +| Confidence interval error bars in dashboard | Easy | `dashboard.py` | +| EdTech fact scenario (student/teacher) | Easy | `simulator/facts.py` | +| `pip install memorylens` — pyproject.toml setup | Easy | root | +| Docker deployment guide | Easy | docs/ | +| Qdrant/FAISS backend replacing NumPy | Medium | `memory/` | +| LangGraph orchestration layer | Hard | new | -### Good first issues +Browse [`good first issue`](https://github.com/Neal006/memorylens/issues?q=label%3A%22good+first+issue%22) · Full guide: [CONTRIBUTING.md](CONTRIBUTING.md) -Browse issues labelled [`good first issue`](https://github.com/Neal006/memorylens/issues?q=label%3A%22good+first+issue%22) — these are well-scoped tasks with clear acceptance criteria. +### Development setup ---- +```bash +git clone https://github.com/Neal006/memorylens.git +cd memorylens +python -m venv .venv && source .venv/bin/activate # or .venv\Scripts\activate on Windows +pip install -r requirements.txt +python tests/test_pipeline.py # 24 tests, no API key needed +``` -## Roadmap +--- -See [ROADMAP.md](ROADMAP.md) for the full plan. Next milestone: +## Research -- [ ] Update-aware Cascading (fix temporal drift regression) -- [ ] Multi-seed benchmarking with confidence intervals -- [ ] Streamlit Community Cloud deployment (public live demo) -- [ ] EdTech scenario — student/teacher memory tracking -- [ ] LangGraph orchestration layer +The methodology, metric definitions, and decay ablation results are documented in the full research paper: ---- +**[MemoryLens: A Temporal Decay Benchmark for LLM Memory Architectures](paper/memorylens_paper.md)** -## Citation +Key sections: +- Formal metric definitions with LaTeX formulae +- Ebbinghaus decay ablation with 4 variants +- Multi-seed results (n=5 personas) +- Comparison against RAGAS, TruLens, MemGPT, A-MEM +- Full reference list (Ebbinghaus 1885 → Xu 2024) -If you use MemoryLens in your research, please cite: +### Citation ```bibtex @software{memorylens2026, - author = {Neal006}, - title = {MemoryLens: An Evaluation Framework for LLM Memory Decay}, + author = {Srivastava, Neal}, + title = {{MemoryLens}: A Temporal Decay Benchmark for {LLM} Memory Architectures}, year = {2026}, url = {https://github.com/Neal006/memorylens}, - version = {0.2.0} + version = {0.3.0} } ``` --- +## Roadmap + +| Status | Item | +|--------|------| +| ✅ Done | Naive, RAG, Cascading, SummaryMemory backends | +| ✅ Done | 5 metrics (Recall@T, Precision@K, Drift, Noise, Efficiency) | +| ✅ Done | Ebbinghaus decay + ablation study | +| ✅ Done | Chunked RAG (production-realistic) | +| ✅ Done | Multi-seed CI (n=5, mean ± std) | +| ✅ Done | 5-provider LLM evaluation (Groq, OpenAI, Anthropic, OpenRouter, Ollama) | +| ✅ Done | Research paper with citations | +| 🔜 Next | Update-aware Cascading (fix temporal drift in Cold tier) | +| 🔜 Next | Streamlit Community Cloud deployment (public live demo) | +| 🔜 Next | Qdrant / FAISS production vector DB backend | +| 🔜 Next | `pip install memorylens` (PyPI package) | +| 🔜 Later | EdTech, Medical, Customer Support domain scenarios | +| 🔜 Later | arXiv preprint | + +Full roadmap: [ROADMAP.md](ROADMAP.md) + +--- + ## License -[MIT](LICENSE) — free to use, modify, and distribute. +[MIT](LICENSE) — free to use, modify, and distribute for any purpose. ---