Skip to content

feat: multi-provider real LLM evaluation pipeline#9

Merged
Neal006 merged 1 commit into
mainfrom
feat/multi-provider-llm-eval
May 22, 2026
Merged

feat: multi-provider real LLM evaluation pipeline#9
Neal006 merged 1 commit into
mainfrom
feat/multi-provider-llm-eval

Conversation

@Neal006
Copy link
Copy Markdown
Owner

@Neal006 Neal006 commented May 22, 2026

Closes #8

Summary

Replaces the implicit assumption that "retrieved text = correct answer" with a genuine end-to-end LLM evaluation pass. Content-based metrics remain as fast, zero-cost proxies; the LLM pass is additive and activates automatically when any provider key is present.

  • utils/providers.py — unified LLMProvider ABC with five concrete backends (Groq, OpenAI, Anthropic, OpenRouter, Ollama), auto-detection priority chain, and _clean_messages() normalisation helper
  • evaluation/metrics.pyllm_recall_at_t() (answer+judge pipeline) and llm_temporal_drift() (old vs new value after update)
  • evaluation/benchmark.pyCheckpointResult gains llm_recall, llm_drift, has_llm_eval; run_benchmark() accepts optional provider
  • main.py--llm, --provider, --list-providers flags; outputs three tables: Content Recall, LLM Recall, Gap
  • dashboard.py — provider selector auto-detects available keys; tabbed recall chart (Content / LLM / Gap); KPI cards show LLM recall delta
  • .env.example — all five provider keys documented inline

How the LLM Eval Works

For each fact at each checkpoint:

  1. ANSWER  memory.get_context(fact.query) → [context messages]
             provider.chat([system] + context + [question])
             → answer string

  2. JUDGE   provider.chat([judge_system, "Q:... Expected:... Got:..."])
             → "correct" | "wrong"

  llm_recalled = (verdict == "correct")

The gap between content recall and LLM recall is the core finding — a large positive gap means the backend retrieves the right text but the LLM still fails to extract the correct answer.

Test Plan

  • python tests/test_imports.py — all imports OK, provider registry validated
  • python tests/test_pipeline.py — all 14 tests pass (no API key needed)
  • python main.py --list-providers — shows provider status cleanly
  • python main.py — content-only mode unchanged, no regression
  • python main.py --llm --provider groq — requires GROQ_API_KEY

🤖 Generated with Claude Code

Add a genuine end-to-end LLM answer+judge evaluation pass on top of the
existing content-based metrics. Content metrics remain as fast zero-cost
proxies; LLM metrics activate when any provider key is present.

Key additions:
- utils/providers.py: unified LLMProvider abstraction for 5 backends
  (Groq, OpenAI, Anthropic, OpenRouter, Ollama) with auto-detection
- evaluation/metrics.py: llm_recall_at_t() and llm_temporal_drift()
  using a two-stage answer->judge pipeline
- evaluation/benchmark.py: optional provider param, CheckpointResult
  gains llm_recall/llm_drift/has_llm_eval fields
- main.py: --llm, --provider, --list-providers flags; three output
  tables (content recall, LLM recall, gap)
- dashboard.py: provider selector, tabbed Content/LLM/Gap recall
  charts, LLM recall KPIs with gap delta
- .env.example: all five provider keys documented

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 22, 2026 04:05
@Neal006 Neal006 merged commit b06876b into main May 22, 2026
4 checks passed
@Neal006 Neal006 deleted the feat/multi-provider-llm-eval branch May 22, 2026 04:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional, end-to-end LLM evaluation pass (answer + judge) to complement existing content-based retrieval metrics, with support for multiple LLM providers and UI/CLI surfacing of Content vs LLM recall and the resulting “gap”.

Changes:

  • Introduces a unified LLMProvider abstraction with Groq/OpenAI/Anthropic/OpenRouter/Ollama backends and auto-detection.
  • Adds LLM-based evaluation metrics (llm_recall_at_t, llm_temporal_drift) and threads optional provider support through the benchmark runner and CLI output.
  • Updates the Streamlit dashboard to optionally run LLM eval and visualize Content / LLM / Gap recall.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
utils/providers.py Adds multi-provider LLM abstraction, registry, auto-detection, and message normalization.
evaluation/metrics.py Implements the two-stage LLM recall and LLM temporal drift metrics.
evaluation/benchmark.py Threads optional provider through benchmark execution and JSON display output.
main.py Adds --llm, --provider, --list-providers and prints Content/LLM/Gap tables.
dashboard.py Adds provider detection/selection and tabbed recall charts for Content/LLM/Gap.
tests/test_imports.py Extends import smoke test to include provider registry and LLM metric imports.
CHANGELOG.md Documents the new multi-provider LLM evaluation feature.
.env.example Documents environment variables for all supported providers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread evaluation/metrics.py
temperature=0.0,
).lower().strip()

verdict = "correct" if "correct" in verdict_raw else "wrong"
Comment thread evaluation/metrics.py
Comment on lines +248 to +249
answer = provider.chat(messages, max_tokens=30, temperature=0.0).lower()

Comment thread utils/providers.py
Comment on lines +140 to +148
@classmethod
def is_available(cls) -> bool:
return bool(os.getenv("OPENAI_API_KEY"))

def _get_client(self):
if self._client is None:
from openai import OpenAI
self._client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
return self._client
Comment thread utils/providers.py
Comment on lines +187 to +195
@classmethod
def is_available(cls) -> bool:
return bool(os.getenv("ANTHROPIC_API_KEY"))

def _get_client(self):
if self._client is None:
import anthropic
self._client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
return self._client
Comment thread utils/providers.py
Comment on lines +243 to +254
@classmethod
def is_available(cls) -> bool:
return bool(os.getenv("OPENROUTER_API_KEY"))

def _get_client(self):
if self._client is None:
from openai import OpenAI
self._client = OpenAI(
api_key=os.getenv("OPENROUTER_API_KEY"),
base_url=self.BASE_URL,
)
return self._client
Comment thread dashboard.py
Comment on lines +50 to +53
# Ollama: try a quick ping
import urllib.request
try:
urllib.request.urlopen("http://localhost:11434/api/tags", timeout=1)
Comment thread evaluation/metrics.py
Comment on lines +150 to +156
def llm_recall_at_t(
memory: BaseMemory,
fact: Fact,
current_turn: int,
provider: "LLMProvider",
) -> Dict:
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add real end-to-end LLM evaluation pipeline (multi-provider)

2 participants