You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat: eval pattern examples calling Azure OpenAI (#94)
The eval slice previously shipped one toy case (echo-hello) and a
disabled-by-default nightly. A reader expecting an LLM-eval story
found the infrastructure without conviction.
Adds four worked-pattern cases that exercise the existing three
tolerance modes against a real Azure OpenAI deployment. These are
not benchmarks — they demonstrate what an eval case *looks like* for
the four LLM-eval patterns you most often need to write:
- factual-http-200 exact_match format-constrained recall
- numeric-seconds-per-day numeric_close numeric reasoning + tolerance
- definitional-fastapi-depends semantic_similar free-form judge-scored prose
- structured-json-status exact_match structured-output adherence
When the template is forked for a real project, replace these four
with cases that exercise the project's own prompts; the patterns
transfer regardless of what product is bolted on.
Provider choice — Azure OpenAI via the openai SDK with AzureOpenAI
client — is intentionally distinct from the rest of the harness
(which uses Claude via Claude Code). Demonstrates that the LLMClient
Protocol in src/eval/judge.py does its job: the eval core never
imports openai, vendor lock-in lives only in the adapter.
Changes:
- src/eval/adapters/azure_openai.py — implements LLMClient via the
openai.AzureOpenAI SDK. Reads endpoint/key/deployment/api-version
from env. Lazy-imports the SDK so the module is importable without
the optional extra installed; the adapter raises a clear
AzureOpenAIConfigError if the env or SDK is missing.
- eval/golden_patterns.json — the four cases with notes explaining
which pattern each demonstrates.
- eval/test_golden_patterns.py — separate test file gated on the
Azure env vars via pytestmark. Skipped on a stock checkout, so
`uv run pytest eval/` always exits 0. The toy test_golden_qa.py
keeps running as before.
- pyproject.toml — new optional [project.optional-dependencies] eval
extra (just `openai>=1.40.0`), mypy override for openai.* matching
the existing opentelemetry.* pattern, and a 0.2.10 -> 0.2.11
self-version bump.
- .github/workflows/eval-nightly.yml — env vars renamed from the
placeholder LLM_* set to AZURE_OPENAI_*. Header comment updated
with the Azure setup recipe. uv sync now passes --extra eval.
- docs/EVAL_HARNESS.md — new "Worked patterns" section with the
table mapping case -> tolerance -> pattern, the local setup
recipe, and a "Swapping providers" note documenting the
Protocol-based extension path.
Local gates: mypy --strict clean on 42 source files (was 31), ruff
clean, ruff format clean, import-linter both contracts kept, 192
unit tests pass, eval/ runs 1 passed + 4 skipped without LLM env.
Closes#94
* test: add adapter unit tests + adapters README (#94 review fixes)
Addresses two gate failures on #104 surfaced by code review:
1. "Tests required" gate — feat: prefix declared a behaviour change
but tests/ had no test for the new adapter (the eval/-side test
only runs with live Azure credentials). Adds
tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases
covering _resolve_config (defaults, override, empty-string
fallback, missing-env error listing), the constructor (env
wiring, explicit API version, missing-env, missing-SDK), and the
two SDK call paths (complete_json structured-output mode,
complete user-message dispatch, null-content returns "" / "{}").
The SDK is mocked at sys.modules level so the test never hits the
network and never requires the openai extra to be installed.
2. "src/ README audit" gate — every src/ package needs a README.md
per CLAUDE.md. Adds src/eval/adapters/README.md documenting the
layer's purpose, the current adapter, a 7-step "adding a new
adapter" recipe, and why the layer lives at the top of the import
order.
Also applies the reviewer's non-blocking sentinel-string suggestion:
the magic "azure-deployment" string passed as judge_model in
eval/test_golden_patterns.py is now the named constant
_AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner
threads it through but the Azure adapter discards it.
Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on
43 source files, ruff/format/import-linter all green.
Refs #94
* docs: add Key interfaces section to adapters README (#94 review)
src/ README audit gate looks for a `## Key interfaces` (or `## Public
surface`) anchor — the existing README had purpose / table /
extension recipe / layering rationale, but no exported-names section.
Adds a `## Key interfaces` section listing the two exported names:
- AzureOpenAIClient — the LLMClient implementation with notes on
complete() vs complete_json() and the discarded `model` arg
(Azure dispatches by deployment, not model).
- AzureOpenAIConfigError — the construction-time error type,
noting that it batches every missing env var into a single
message instead of failing-and-retrying.
Both already documented in the adapter docstrings; this section
hoists them to the README anchor the audit gate enforces.
Refs #94
* chore: bump version to 0.2.12 (rebase onto develop after #103)
├── golden_qa.json # Toy smoke case — runs without LLM credentials
19
+
├── test_golden_qa.py # Parametrised runner for the toy case
20
+
├── golden_patterns.json # Four worked-pattern cases — require Azure OpenAI
21
+
└── test_golden_patterns.py # Skipped unless AZURE_OPENAI_* env vars are set
18
22
```
19
23
20
24
## How it works
@@ -86,11 +90,43 @@ python -m src.eval # CLI runner — prints the markdown report
86
90
87
91
The pytest invocation is marked `@pytest.mark.eval`, so the default `pytest tests/` skips it.
88
92
93
+
## Worked patterns (Azure OpenAI)
94
+
95
+
The four cases in `eval/golden_patterns.json` are *not* benchmarks. They exist to demonstrate what an eval case looks like against each of the runner's tolerance modes; together they cover the four LLM-eval patterns you most often need to write:
96
+
97
+
| Case ID | Tolerance | Pattern demonstrated |
98
+
|---|---|---|
99
+
|`factual-http-200`|`exact_match`| Format-constrained factual recall. The prompt forces a single canonical token; if the model wraps the answer in prose, the case fails loudly. |
100
+
|`numeric-seconds-per-day`|`numeric_close`| Numeric reasoning with extraction tolerance. The runner pulls the first number from each side and compares within 1 %, so `86,400` and `86400 seconds` both match. |
101
+
|`definitional-fastapi-depends`|`semantic_similar`| Free-form prose scored by an LLM judge at ≥ 0.8. Use for explanations and any case where wording can vary but the underlying claim is checkable. |
102
+
|`structured-json-status`|`exact_match`| Structured-output adherence. The prompt asks for raw JSON; markdown-fenced or prose-wrapped responses fail — which is the failure mode downstream parsers also hit. |
103
+
104
+
The cases all call a real Azure OpenAI deployment via the adapter at `src/eval/adapters/azure_openai.py`. When you fork the template for a real project, replace these four with cases that exercise your own product's prompts; the patterns transfer.
105
+
106
+
### Setup
107
+
108
+
```sh
109
+
uv sync --extra dev --extra eval# installs the openai SDK
export AZURE_OPENAI_DEPLOYMENT="gpt-4o-mini"# or whatever you deployed
114
+
export AZURE_OPENAI_API_VERSION="2024-10-21"# optional, this is the default
115
+
116
+
uv run pytest eval/test_golden_patterns.py -v
117
+
```
118
+
119
+
Without the env vars, `eval/test_golden_patterns.py` is skipped via `pytestmark` — `eval/test_golden_qa.py` still runs as a smoke check on the runner mechanics, so `uv run pytest eval/` always exits 0 on a fresh checkout.
120
+
121
+
### Swapping providers
122
+
123
+
`src/eval/judge.py` defines `LLMClient` as a `Protocol` — the eval core does not import `openai` anywhere. To target a different provider (Anthropic, vLLM, vanilla OpenAI), write a new adapter under `src/eval/adapters/` that implements `complete_json(*, model, prompt) -> str` and update the runner fixture in your test file. Nothing in `src/eval/` itself changes.
124
+
89
125
## Nightly opt-in
90
126
91
127
`.github/workflows/eval-nightly.yml` ships `workflow_dispatch`-only by default to avoid accidental LLM API spend. To turn on a real nightly:
92
128
93
-
1. Add the LLM secrets in repo settings: `LLM_API_KEY` (required), `LLM_PROVIDER`, `LLM_BASE_URL`, `LLM_MODEL` (optional, depending on adapter).
129
+
1. Add the Azure OpenAI secrets in repo settings: `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT`, and optionally `AZURE_OPENAI_API_VERSION`.
"question": "What HTTP status code means OK? Respond with only the number, no prose.",
5
+
"category": "factual-recall",
6
+
"expected_answer": "200",
7
+
"tolerance": "exact_match",
8
+
"difficulty": "easy",
9
+
"notes": "Pattern: factual recall with format-constrained output. exact_match works because the prompt forces a single canonical token. If the model adds prose (\"The status code is 200.\") this fails loudly — which is the point: format adherence is part of the assertion."
10
+
},
11
+
{
12
+
"id": "numeric-seconds-per-day",
13
+
"question": "How many seconds are in 24 hours? Respond with the integer only.",
14
+
"category": "numeric-reasoning",
15
+
"expected_answer": "86400",
16
+
"tolerance": "numeric_close",
17
+
"difficulty": "easy",
18
+
"notes": "Pattern: numeric extraction with 1% tolerance. The runner pulls the first number from each side and compares ratios, so '86,400', '86400 seconds', and '86400.0' all match. Use this tolerance for math, conversions, and any case where formatting around the number is uninteresting."
19
+
},
20
+
{
21
+
"id": "definitional-fastapi-depends",
22
+
"question": "In one sentence: what does FastAPI's Depends() do?",
23
+
"category": "definitional",
24
+
"expected_answer": "Depends declares a callable that FastAPI resolves at request time and injects the result into the parameter, enabling dependency injection for things like authentication, database sessions, or settings.",
25
+
"tolerance": "semantic_similar",
26
+
"difficulty": "medium",
27
+
"notes": "Pattern: free-form prose scored by LLM judge. semantic_similar passes at score >= 0.8 via the judge in src/eval/judge.py. Use this for definitions, explanations, and any case where wording can legitimately vary but the underlying claim is checkable."
28
+
},
29
+
{
30
+
"id": "structured-json-status",
31
+
"question": "Return exactly this JSON object and nothing else (no markdown fence, no prose, no trailing newline): {\"ok\": true, \"version\": 1}",
"notes": "Pattern: format adherence on structured output. Models commonly wrap JSON in ```json``` fences or add a preamble; exact_match after normalisation (lowercase + whitespace-collapse) accepts a clean response but rejects the fenced or prose-wrapped version. This is the failure mode you want to catch — downstream parsers break the same way."
Concrete `LLMClient` adapters for the eval harness. The judge in [`src/eval/judge.py`](../judge.py) calls an `LLMClient` Protocol — never a vendor SDK directly. Each adapter in this package implements that Protocol for one provider, so the eval core stays vendor-neutral and a downstream consumer can swap providers by changing one wiring line in their test fixture.
4
+
5
+
## Key interfaces
6
+
7
+
Exported from this package:
8
+
9
+
-**`AzureOpenAIClient`** — implements `src.eval.judge.LLMClient`. Construct from env via `AzureOpenAIClient()`; call `complete(prompt)` for runner `answer_fn` use, `complete_json(*, model, prompt)` for judge use. The `model` argument on `complete_json` is accepted for Protocol conformance and discarded — Azure addresses by deployment name (set at construction time, read from `AZURE_OPENAI_DEPLOYMENT`).
10
+
-**`AzureOpenAIConfigError`** — raised at construction when required env is missing or the optional `openai` extra is not installed. Subclass of `RuntimeError`. The error message names every missing env var in one go so the caller doesn't have to fix-and-retry.
11
+
12
+
## Why this layer exists
13
+
14
+
Without the Protocol seam, swapping LLM providers would mean touching the eval core. With it, vendor lock-in is confined to one file per provider. The layer demonstrates that the harness's "provider-agnostic" claim is structural, not aspirational: the eval core has zero imports of any vendor SDK.
1. Add the SDK to `[project.optional-dependencies]` in `pyproject.toml` — either to the existing `eval` extra or a new provider-scoped one.
25
+
2. Add the SDK's top-level module to `[[tool.mypy.overrides]]` with `ignore_missing_imports = true`, matching the existing `openai.*` / `opentelemetry.*` entries. This keeps mypy clean on stock `uv sync --extra dev` checkouts.
26
+
3. Implement `complete_json(*, model: str, prompt: str) -> str` per the `LLMClient` Protocol in [`src/eval/judge.py`](../judge.py). Optionally add a `complete(prompt: str) -> str` for use as an `EvalRunner.answer_fn`.
27
+
4.**Lazy-import the SDK inside `__init__`** so the adapter module remains importable without the optional extra installed. The import error path should raise a clear, named exception (e.g. `AzureOpenAIConfigError`) telling the reader which `uv sync --extra ...` to run.
28
+
5. Read configuration from environment variables at construction time. Raise the same named exception listing every missing var when env is incomplete — fail fast, fail clear.
29
+
6. Add an offline unit test in [`tests/`](../../../tests/) that mocks the SDK at the `sys.modules` level (see `tests/test_eval_azure_openai_adapter.py` for the pattern). This keeps the unit suite credential-free; live-credential paths are exercised by [`eval/test_golden_patterns.py`](../../../eval/test_golden_patterns.py).
30
+
7. Document the env contract in this README's table above and in [`docs/EVAL_HARNESS.md`](../../../docs/EVAL_HARNESS.md)'s "Worked patterns" section.
31
+
32
+
## Why adapters live under `src/eval/`
33
+
34
+
The import-linter contract in `pyproject.toml` puts `src.eval` at the top of the layered import order:
35
+
36
+
```
37
+
api | eval -> agent -> tools -> data -> observability -> models
38
+
```
39
+
40
+
Adapters can therefore depend on anything in `src/`; nothing in `src/` depends on them. That asymmetry is exactly what the layered architecture exists to encode — vendor-specific code stays at the boundary, never leaks down into the eval primitives or the model layer.
0 commit comments