LLM-driven systems regress in ways unit tests don't catch: the prompt drifts, the tool schema changes upstream, a model upgrade subtly changes behaviour. The eval harness is the regression net — golden cases that exercise the agent end-to-end and report accuracy by category and difficulty.
src/eval/
├── models.py # EvalCase, EvalResult (Pydantic)
├── runner.py # EvalRunner — generic, takes a Callable[[str], str]
├── judge.py # LLMClient Protocol + semantic-similarity judge
├── report.py # Markdown report generator
├── __main__.py # python -m src.eval
└── adapters/
└── azure_openai.py # Concrete LLMClient for Azure OpenAI (optional extra)
eval/
├── golden_qa.json # Toy smoke case — runs without LLM credentials
├── test_golden_qa.py # Parametrised runner for the toy case
├── golden_patterns.json # Four worked-pattern cases — require Azure OpenAI
└── test_golden_patterns.py # Skipped unless AZURE_OPENAI_* env vars are set
- The runner loads
eval/golden_qa.jsoninto a list ofEvalCases. - For each case, it calls the configured
answer_fn(question) -> str. - It compares the actual answer to the expected one using one of three tolerance modes:
exact_match— normalised string equality (lowercased, whitespace-collapsed).numeric_close— extracts numbers from both sides; passes if any extracted number is within 1 % of the expected. Filters year-like values (2020-2029) so a question about a year doesn't accidentally provide the comparison target.semantic_similar— calls an LLM judge (src/eval/judge.py) that scores 0.0–1.0; passes at ≥ 0.8.
- It returns a list of
EvalResults;src/eval/report.pyproduces a markdown summary.
The runner doesn't know about your agent loop. Pass any Callable[[str], str]:
from src.eval.runner import EvalRunner
def my_agent(question: str) -> str:
# Hit your agent loop / LLM client here.
return ...
runner = EvalRunner(answer_fn=my_agent)
results = runner.evaluate_all()For the LLM judge (semantic_similar cases), implement the LLMClient Protocol from src/eval/judge.py:
class MyLLMAdapter:
def complete_json(self, *, model: str, prompt: str) -> str:
# Hit your provider, return raw JSON body.
...
runner = EvalRunner(
answer_fn=my_agent,
judge_client=MyLLMAdapter(),
judge_model="gpt-4o-mini",
)If judge_client=None (default), semantic_similar cases pass with score=None and reason "no LLM client configured" — inconclusive, not a failure. That keeps the harness usable without LLM credentials.
{
"id": "unique-kebab-id",
"question": "...",
"category": "category-name",
"expected_answer": "...",
"tolerance": "exact_match" | "numeric_close" | "semantic_similar",
"difficulty": "easy" | "medium" | "hard",
"notes": "Why this case earns a slot."
}category and difficulty default to "general" and "easy"; explicit values are recommended once you have more than a handful of cases so the report breaks down meaningfully.
Locally:
uv run pytest eval/ # pytest runner with the marker
python -m src.eval # CLI runner — prints the markdown reportThe pytest invocation is marked @pytest.mark.eval, so the default pytest tests/ skips it.
The four cases in eval/golden_patterns.json are not benchmarks. They exist to demonstrate what an eval case looks like against each of the runner's tolerance modes; together they cover the four LLM-eval patterns you most often need to write:
| Case ID | Tolerance | Pattern demonstrated |
|---|---|---|
factual-http-200 |
exact_match |
Format-constrained factual recall. The prompt forces a single canonical token; if the model wraps the answer in prose, the case fails loudly. |
numeric-seconds-per-day |
numeric_close |
Numeric reasoning with extraction tolerance. The runner pulls the first number from each side and compares within 1 %, so 86,400 and 86400 seconds both match. |
definitional-fastapi-depends |
semantic_similar |
Free-form prose scored by an LLM judge at ≥ 0.8. Use for explanations and any case where wording can vary but the underlying claim is checkable. |
structured-json-status |
exact_match |
Structured-output adherence. The prompt asks for raw JSON; markdown-fenced or prose-wrapped responses fail — which is the failure mode downstream parsers also hit. |
The cases all call a real Azure OpenAI deployment via the adapter at src/eval/adapters/azure_openai.py. When you fork the template for a real project, replace these four with cases that exercise your own product's prompts; the patterns transfer.
uv sync --extra dev --extra eval # installs the openai SDK
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_DEPLOYMENT="gpt-4o-mini" # or whatever you deployed
export AZURE_OPENAI_API_VERSION="2024-10-21" # optional, this is the default
uv run pytest eval/test_golden_patterns.py -vWithout the env vars, eval/test_golden_patterns.py is skipped via pytestmark — eval/test_golden_qa.py still runs as a smoke check on the runner mechanics, so uv run pytest eval/ always exits 0 on a fresh checkout.
src/eval/judge.py defines LLMClient as a Protocol — the eval core does not import openai anywhere. To target a different provider (Anthropic, vLLM, vanilla OpenAI), write a new adapter under src/eval/adapters/ that implements complete_json(*, model, prompt) -> str and update the runner fixture in your test file. Nothing in src/eval/ itself changes.
.github/workflows/eval-nightly.yml ships workflow_dispatch-only by default to avoid accidental LLM API spend. To turn on a real nightly:
-
Add the Azure OpenAI secrets in repo settings:
AZURE_OPENAI_ENDPOINT,AZURE_OPENAI_API_KEY,AZURE_OPENAI_DEPLOYMENT, and optionallyAZURE_OPENAI_API_VERSION. -
Replace the workflow's
on:block with:on: schedule: - cron: "0 6 * * *" # daily 06:00 UTC workflow_dispatch:
-
Confirm
eval-nightly.ymlis still inEXEMPT_WORKFLOWSin.github/scripts/check_required_contexts.py(it should be — scheduled runs never gate PRs).
That's the full opt-in. Reverting is a one-line change back to workflow_dispatch: only.