The eval harness

LLM-driven systems regress in ways unit tests don't catch: the prompt drifts, the tool schema changes upstream, a model upgrade subtly changes behaviour. The eval harness is the regression net — golden cases that exercise the agent end-to-end and report accuracy by category and difficulty.

Layout

src/eval/
├── models.py            # EvalCase, EvalResult (Pydantic)
├── runner.py            # EvalRunner — generic, takes a Callable[[str], str]
├── judge.py             # LLMClient Protocol + semantic-similarity judge
├── report.py            # Markdown report generator
├── __main__.py          # python -m src.eval
└── adapters/
    └── azure_openai.py  # Concrete LLMClient for Azure OpenAI (optional extra)

eval/
├── golden_qa.json          # Toy smoke case — runs without LLM credentials
├── test_golden_qa.py       # Parametrised runner for the toy case
├── golden_patterns.json    # Four worked-pattern cases — require Azure OpenAI
└── test_golden_patterns.py # Skipped unless AZURE_OPENAI_* env vars are set

How it works

The runner loads eval/golden_qa.json into a list of EvalCases.
For each case, it calls the configured answer_fn(question) -> str.
It compares the actual answer to the expected one using one of three tolerance modes:
- exact_match — normalised string equality (lowercased, whitespace-collapsed).
- numeric_close — extracts numbers from both sides; passes if any extracted number is within 1 % of the expected. Filters year-like values (2020-2029) so a question about a year doesn't accidentally provide the comparison target.
- semantic_similar — calls an LLM judge (src/eval/judge.py) that scores 0.0–1.0; passes at ≥ 0.8.
It returns a list of EvalResults; src/eval/report.py produces a markdown summary.

Wiring your agent

The runner doesn't know about your agent loop. Pass any Callable[[str], str]:

from src.eval.runner import EvalRunner

def my_agent(question: str) -> str:
    # Hit your agent loop / LLM client here.
    return ...

runner = EvalRunner(answer_fn=my_agent)
results = runner.evaluate_all()

For the LLM judge (semantic_similar cases), implement the LLMClient Protocol from src/eval/judge.py:

class MyLLMAdapter:
    def complete_json(self, *, model: str, prompt: str) -> str:
        # Hit your provider, return raw JSON body.
        ...

runner = EvalRunner(
    answer_fn=my_agent,
    judge_client=MyLLMAdapter(),
    judge_model="gpt-4o-mini",
)

If judge_client=None (default), semantic_similar cases pass with score=None and reason "no LLM client configured" — inconclusive, not a failure. That keeps the harness usable without LLM credentials.

Adding a case

{
  "id": "unique-kebab-id",
  "question": "...",
  "category": "category-name",
  "expected_answer": "...",
  "tolerance": "exact_match" | "numeric_close" | "semantic_similar",
  "difficulty": "easy" | "medium" | "hard",
  "notes": "Why this case earns a slot."
}

category and difficulty default to "general" and "easy"; explicit values are recommended once you have more than a handful of cases so the report breaks down meaningfully.

Running the harness

Locally:

uv run pytest eval/             # pytest runner with the marker
python -m src.eval               # CLI runner — prints the markdown report

The pytest invocation is marked @pytest.mark.eval, so the default pytest tests/ skips it.

Worked patterns (Azure OpenAI)

The four cases in eval/golden_patterns.json are not benchmarks. They exist to demonstrate what an eval case looks like against each of the runner's tolerance modes; together they cover the four LLM-eval patterns you most often need to write:

Case ID	Tolerance	Pattern demonstrated
`factual-http-200`	`exact_match`	Format-constrained factual recall. The prompt forces a single canonical token; if the model wraps the answer in prose, the case fails loudly.
`numeric-seconds-per-day`	`numeric_close`	Numeric reasoning with extraction tolerance. The runner pulls the first number from each side and compares within 1 %, so `86,400` and `86400 seconds` both match.
`definitional-fastapi-depends`	`semantic_similar`	Free-form prose scored by an LLM judge at ≥ 0.8. Use for explanations and any case where wording can vary but the underlying claim is checkable.
`structured-json-status`	`exact_match`	Structured-output adherence. The prompt asks for raw JSON; markdown-fenced or prose-wrapped responses fail — which is the failure mode downstream parsers also hit.

The cases all call a real Azure OpenAI deployment via the adapter at src/eval/adapters/azure_openai.py. When you fork the template for a real project, replace these four with cases that exercise your own product's prompts; the patterns transfer.

Setup

uv sync --extra dev --extra eval        # installs the openai SDK

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_DEPLOYMENT="gpt-4o-mini"   # or whatever you deployed
export AZURE_OPENAI_API_VERSION="2024-10-21"   # optional, this is the default

uv run pytest eval/test_golden_patterns.py -v

Without the env vars, eval/test_golden_patterns.py is skipped via pytestmark — eval/test_golden_qa.py still runs as a smoke check on the runner mechanics, so uv run pytest eval/ always exits 0 on a fresh checkout.

Swapping providers

src/eval/judge.py defines LLMClient as a Protocol — the eval core does not import openai anywhere. To target a different provider (Anthropic, vLLM, vanilla OpenAI), write a new adapter under src/eval/adapters/ that implements complete_json(*, model, prompt) -> str and update the runner fixture in your test file. Nothing in src/eval/ itself changes.

Nightly opt-in

.github/workflows/eval-nightly.yml ships workflow_dispatch-only by default to avoid accidental LLM API spend. To turn on a real nightly:

Add the Azure OpenAI secrets in repo settings: AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT, and optionally AZURE_OPENAI_API_VERSION.

Replace the workflow's on: block with:

on:
  schedule:
    - cron: "0 6 * * *"   # daily 06:00 UTC
  workflow_dispatch:

Confirm eval-nightly.yml is still in EXEMPT_WORKFLOWS in .github/scripts/check_required_contexts.py (it should be — scheduled runs never gate PRs).

That's the full opt-in. Reverting is a one-line change back to workflow_dispatch: only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The eval harness

Layout

How it works

Wiring your agent

Adding a case

Running the harness

Worked patterns (Azure OpenAI)

Setup

Swapping providers

Nightly opt-in

FilesExpand file tree

EVAL_HARNESS.md

Latest commit

History

EVAL_HARNESS.md

File metadata and controls

The eval harness

Layout

How it works

Wiring your agent

Adding a case

Running the harness

Worked patterns (Azure OpenAI)

Setup

Swapping providers

Nightly opt-in