feat: multi-provider real LLM evaluation pipeline#9
Merged
Conversation
Add a genuine end-to-end LLM answer+judge evaluation pass on top of the existing content-based metrics. Content metrics remain as fast zero-cost proxies; LLM metrics activate when any provider key is present. Key additions: - utils/providers.py: unified LLMProvider abstraction for 5 backends (Groq, OpenAI, Anthropic, OpenRouter, Ollama) with auto-detection - evaluation/metrics.py: llm_recall_at_t() and llm_temporal_drift() using a two-stage answer->judge pipeline - evaluation/benchmark.py: optional provider param, CheckpointResult gains llm_recall/llm_drift/has_llm_eval fields - main.py: --llm, --provider, --list-providers flags; three output tables (content recall, LLM recall, gap) - dashboard.py: provider selector, tabbed Content/LLM/Gap recall charts, LLM recall KPIs with gap delta - .env.example: all five provider keys documented Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds an optional, end-to-end LLM evaluation pass (answer + judge) to complement existing content-based retrieval metrics, with support for multiple LLM providers and UI/CLI surfacing of Content vs LLM recall and the resulting “gap”.
Changes:
- Introduces a unified
LLMProviderabstraction with Groq/OpenAI/Anthropic/OpenRouter/Ollama backends and auto-detection. - Adds LLM-based evaluation metrics (
llm_recall_at_t,llm_temporal_drift) and threads optionalprovidersupport through the benchmark runner and CLI output. - Updates the Streamlit dashboard to optionally run LLM eval and visualize Content / LLM / Gap recall.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/providers.py | Adds multi-provider LLM abstraction, registry, auto-detection, and message normalization. |
| evaluation/metrics.py | Implements the two-stage LLM recall and LLM temporal drift metrics. |
| evaluation/benchmark.py | Threads optional provider through benchmark execution and JSON display output. |
| main.py | Adds --llm, --provider, --list-providers and prints Content/LLM/Gap tables. |
| dashboard.py | Adds provider detection/selection and tabbed recall charts for Content/LLM/Gap. |
| tests/test_imports.py | Extends import smoke test to include provider registry and LLM metric imports. |
| CHANGELOG.md | Documents the new multi-provider LLM evaluation feature. |
| .env.example | Documents environment variables for all supported providers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| temperature=0.0, | ||
| ).lower().strip() | ||
|
|
||
| verdict = "correct" if "correct" in verdict_raw else "wrong" |
Comment on lines
+248
to
+249
| answer = provider.chat(messages, max_tokens=30, temperature=0.0).lower() | ||
|
|
Comment on lines
+140
to
+148
| @classmethod | ||
| def is_available(cls) -> bool: | ||
| return bool(os.getenv("OPENAI_API_KEY")) | ||
|
|
||
| def _get_client(self): | ||
| if self._client is None: | ||
| from openai import OpenAI | ||
| self._client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) | ||
| return self._client |
Comment on lines
+187
to
+195
| @classmethod | ||
| def is_available(cls) -> bool: | ||
| return bool(os.getenv("ANTHROPIC_API_KEY")) | ||
|
|
||
| def _get_client(self): | ||
| if self._client is None: | ||
| import anthropic | ||
| self._client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY")) | ||
| return self._client |
Comment on lines
+243
to
+254
| @classmethod | ||
| def is_available(cls) -> bool: | ||
| return bool(os.getenv("OPENROUTER_API_KEY")) | ||
|
|
||
| def _get_client(self): | ||
| if self._client is None: | ||
| from openai import OpenAI | ||
| self._client = OpenAI( | ||
| api_key=os.getenv("OPENROUTER_API_KEY"), | ||
| base_url=self.BASE_URL, | ||
| ) | ||
| return self._client |
Comment on lines
+50
to
+53
| # Ollama: try a quick ping | ||
| import urllib.request | ||
| try: | ||
| urllib.request.urlopen("http://localhost:11434/api/tags", timeout=1) |
Comment on lines
+150
to
+156
| def llm_recall_at_t( | ||
| memory: BaseMemory, | ||
| fact: Fact, | ||
| current_turn: int, | ||
| provider: "LLMProvider", | ||
| ) -> Dict: | ||
| """ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #8
Summary
Replaces the implicit assumption that "retrieved text = correct answer" with a genuine end-to-end LLM evaluation pass. Content-based metrics remain as fast, zero-cost proxies; the LLM pass is additive and activates automatically when any provider key is present.
utils/providers.py— unifiedLLMProviderABC with five concrete backends (Groq, OpenAI, Anthropic, OpenRouter, Ollama), auto-detection priority chain, and_clean_messages()normalisation helperevaluation/metrics.py—llm_recall_at_t()(answer+judge pipeline) andllm_temporal_drift()(old vs new value after update)evaluation/benchmark.py—CheckpointResultgainsllm_recall,llm_drift,has_llm_eval;run_benchmark()accepts optionalprovidermain.py—--llm,--provider,--list-providersflags; outputs three tables: Content Recall, LLM Recall, Gapdashboard.py— provider selector auto-detects available keys; tabbed recall chart (Content / LLM / Gap); KPI cards show LLM recall delta.env.example— all five provider keys documented inlineHow the LLM Eval Works
The gap between content recall and LLM recall is the core finding — a large positive gap means the backend retrieves the right text but the LLM still fails to extract the correct answer.
Test Plan
python tests/test_imports.py— all imports OK, provider registry validatedpython tests/test_pipeline.py— all 14 tests pass (no API key needed)python main.py --list-providers— shows provider status cleanlypython main.py— content-only mode unchanged, no regressionpython main.py --llm --provider groq— requiresGROQ_API_KEY🤖 Generated with Claude Code