feat: multi-provider real LLM evaluation pipeline by Neal006 · Pull Request #9 · Neal006/memorylens

Neal006 · 2026-05-22T04:05:19Z

Closes #8

Summary

Replaces the implicit assumption that "retrieved text = correct answer" with a genuine end-to-end LLM evaluation pass. Content-based metrics remain as fast, zero-cost proxies; the LLM pass is additive and activates automatically when any provider key is present.

utils/providers.py — unified LLMProvider ABC with five concrete backends (Groq, OpenAI, Anthropic, OpenRouter, Ollama), auto-detection priority chain, and _clean_messages() normalisation helper
evaluation/metrics.py — llm_recall_at_t() (answer+judge pipeline) and llm_temporal_drift() (old vs new value after update)
evaluation/benchmark.py — CheckpointResult gains llm_recall, llm_drift, has_llm_eval; run_benchmark() accepts optional provider
main.py — --llm, --provider, --list-providers flags; outputs three tables: Content Recall, LLM Recall, Gap
dashboard.py — provider selector auto-detects available keys; tabbed recall chart (Content / LLM / Gap); KPI cards show LLM recall delta
.env.example — all five provider keys documented inline

How the LLM Eval Works

For each fact at each checkpoint:

  1. ANSWER  memory.get_context(fact.query) → [context messages]
             provider.chat([system] + context + [question])
             → answer string

  2. JUDGE   provider.chat([judge_system, "Q:... Expected:... Got:..."])
             → "correct" | "wrong"

  llm_recalled = (verdict == "correct")

The gap between content recall and LLM recall is the core finding — a large positive gap means the backend retrieves the right text but the LLM still fails to extract the correct answer.

Test Plan

python tests/test_imports.py — all imports OK, provider registry validated
python tests/test_pipeline.py — all 14 tests pass (no API key needed)
python main.py --list-providers — shows provider status cleanly
python main.py — content-only mode unchanged, no regression
python main.py --llm --provider groq — requires GROQ_API_KEY

🤖 Generated with Claude Code

Add a genuine end-to-end LLM answer+judge evaluation pass on top of the existing content-based metrics. Content metrics remain as fast zero-cost proxies; LLM metrics activate when any provider key is present. Key additions: - utils/providers.py: unified LLMProvider abstraction for 5 backends (Groq, OpenAI, Anthropic, OpenRouter, Ollama) with auto-detection - evaluation/metrics.py: llm_recall_at_t() and llm_temporal_drift() using a two-stage answer->judge pipeline - evaluation/benchmark.py: optional provider param, CheckpointResult gains llm_recall/llm_drift/has_llm_eval fields - main.py: --llm, --provider, --list-providers flags; three output tables (content recall, LLM recall, gap) - dashboard.py: provider selector, tabbed Content/LLM/Gap recall charts, LLM recall KPIs with gap delta - .env.example: all five provider keys documented Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds an optional, end-to-end LLM evaluation pass (answer + judge) to complement existing content-based retrieval metrics, with support for multiple LLM providers and UI/CLI surfacing of Content vs LLM recall and the resulting “gap”.

Changes:

Introduces a unified LLMProvider abstraction with Groq/OpenAI/Anthropic/OpenRouter/Ollama backends and auto-detection.
Adds LLM-based evaluation metrics (llm_recall_at_t, llm_temporal_drift) and threads optional provider support through the benchmark runner and CLI output.
Updates the Streamlit dashboard to optionally run LLM eval and visualize Content / LLM / Gap recall.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
utils/providers.py	Adds multi-provider LLM abstraction, registry, auto-detection, and message normalization.
evaluation/metrics.py	Implements the two-stage LLM recall and LLM temporal drift metrics.
evaluation/benchmark.py	Threads optional provider through benchmark execution and JSON display output.
main.py	Adds `--llm`, `--provider`, `--list-providers` and prints Content/LLM/Gap tables.
dashboard.py	Adds provider detection/selection and tabbed recall charts for Content/LLM/Gap.
tests/test_imports.py	Extends import smoke test to include provider registry and LLM metric imports.
CHANGELOG.md	Documents the new multi-provider LLM evaluation feature.
.env.example	Documents environment variables for all supported providers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        temperature=0.0,
+    ).lower().strip()
+
+    verdict = "correct" if "correct" in verdict_raw else "wrong"


+    answer = provider.chat(messages, max_tokens=30, temperature=0.0).lower()
+


+    @classmethod
+    def is_available(cls) -> bool:
+        return bool(os.getenv("OPENAI_API_KEY"))
+
+    def _get_client(self):
+        if self._client is None:
+            from openai import OpenAI
+            self._client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+        return self._client


+    @classmethod
+    def is_available(cls) -> bool:
+        return bool(os.getenv("ANTHROPIC_API_KEY"))
+
+    def _get_client(self):
+        if self._client is None:
+            import anthropic
+            self._client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
+        return self._client


+    @classmethod
+    def is_available(cls) -> bool:
+        return bool(os.getenv("OPENROUTER_API_KEY"))
+
+    def _get_client(self):
+        if self._client is None:
+            from openai import OpenAI
+            self._client = OpenAI(
+                api_key=os.getenv("OPENROUTER_API_KEY"),
+                base_url=self.BASE_URL,
+            )
+        return self._client


+            # Ollama: try a quick ping
+            import urllib.request
+            try:
+                urllib.request.urlopen("http://localhost:11434/api/tags", timeout=1)


+def llm_recall_at_t(
+    memory: BaseMemory,
+    fact: Fact,
+    current_turn: int,
+    provider: "LLMProvider",
+) -> Dict:
+    """


Copilot AI review requested due to automatic review settings May 22, 2026 04:05

Copilot started reviewing on behalf of Neal006 May 22, 2026 04:05 View session

Neal006 merged commit b06876b into main May 22, 2026
4 checks passed

Neal006 deleted the feat/multi-provider-llm-eval branch May 22, 2026 04:07

Copilot AI reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-provider real LLM evaluation pipeline#9

feat: multi-provider real LLM evaluation pipeline#9
Neal006 merged 1 commit into
mainfrom
feat/multi-provider-llm-eval

Neal006 commented May 22, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		answer = provider.chat(messages, max_tokens=30, temperature=0.0).lower()

Conversation

Neal006 commented May 22, 2026

Summary

How the LLM Eval Works

Test Plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants