feat: eval pattern examples calling Azure OpenAI by constk · Pull Request #104 · constk/harness-python-react

constk · 2026-05-25T13:57:22Z

What & why

The eval slice previously shipped one toy case (echo-hello) and a disabled nightly. A reader expecting an LLM-eval story found the infrastructure without conviction.

This PR adds four worked-pattern cases that exercise the existing three tolerance modes against a real Azure OpenAI deployment. They are not benchmarks — they demonstrate what an eval case looks like for the four LLM-eval patterns you most often need to write:

Case ID	Tolerance	Pattern demonstrated
`factual-http-200`	`exact_match`	Format-constrained factual recall
`numeric-seconds-per-day`	`numeric_close`	Numeric reasoning with extraction tolerance
`definitional-fastapi-depends`	`semantic_similar`	Free-form prose scored by an LLM judge
`structured-json-status`	`exact_match`	Structured-output adherence (catches markdown-fenced JSON)

When the template is forked for a real project, the four cases get replaced with ones that exercise the project's own prompts; the patterns transfer regardless of what product is bolted on.

Provider choice — Azure OpenAI — is intentionally distinct from the rest of the harness (which uses Claude via Claude Code). Demonstrates that the existing LLMClient Protocol in src/eval/judge.py does its job: the eval core never imports openai, and vendor lock-in lives only in the new adapter.

Closes #94.

Changes

File	Purpose
`src/eval/adapters/azure_openai.py` (new)	`AzureOpenAIClient` implementing `LLMClient`. Lazy SDK import, env-driven config, clear `AzureOpenAIConfigError` on missing config.
`src/eval/adapters/__init__.py` (new)	Package init with rationale docstring.
`eval/golden_patterns.json` (new)	The four worked-pattern cases.
`eval/test_golden_patterns.py` (new)	Pytest runner gated on `AZURE_OPENAI_*` env via `pytestmark` — skipped on stock checkouts.
`pyproject.toml`	New `eval` optional extra (`openai>=1.40.0`), mypy override for `openai.` matching the existing `opentelemetry.` pattern, version bump 0.2.10 → 0.2.11.
`uv.lock`	Regenerated to include the `eval` extra.
`.github/workflows/eval-nightly.yml`	Env vars renamed `LLM_` → `AZURE_OPENAI_`. Header updated with the Azure recipe. `uv sync` now passes `--extra eval`.
`docs/EVAL_HARNESS.md`	New "Worked patterns" section + "Swapping providers" note documenting the Protocol-based extension path.

Test plan

Local gates (all green):

uv run --frozen mypy --strict src/ tests/ — clean on 42 source files (was 31)
uv run --frozen ruff check . — All checks passed
uv run --frozen ruff format --check . — 57 files already formatted
uv run --frozen lint-imports — both contracts kept
uv run --frozen pytest tests/ -q — 192 passed
uv run --frozen pytest eval/ -q — 1 passed, 4 skipped (pattern cases correctly skipped without Azure env)
(Future) Real Azure run via workflow_dispatch on the nightly workflow once secrets are configured

Invariants affected

None. The new adapter sits at the top of the layered import order (src.eval is the top layer); no boundary changes.

New deps / actions / external surface

openai>=1.40.0 — added as an optional extra (uv sync --extra eval). Default uv sync --extra dev doesn't pull it.
New external endpoint: Azure OpenAI (per-deployment URL). Only called from the new test file, only when env vars are set.
No new GitHub Actions; existing eval-nightly.yml action SHAs unchanged.

Linked issue

Closes #94

Addresses two gate failures on #104 surfaced by code review: 1. "Tests required" gate — feat: prefix declared a behaviour change but tests/ had no test for the new adapter (the eval/-side test only runs with live Azure credentials). Adds tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases covering _resolve_config (defaults, override, empty-string fallback, missing-env error listing), the constructor (env wiring, explicit API version, missing-env, missing-SDK), and the two SDK call paths (complete_json structured-output mode, complete user-message dispatch, null-content returns "" / "{}"). The SDK is mocked at sys.modules level so the test never hits the network and never requires the openai extra to be installed. 2. "src/ README audit" gate — every src/ package needs a README.md per CLAUDE.md. Adds src/eval/adapters/README.md documenting the layer's purpose, the current adapter, a 7-step "adding a new adapter" recipe, and why the layer lives at the top of the import order. Also applies the reviewer's non-blocking sentinel-string suggestion: the magic "azure-deployment" string passed as judge_model in eval/test_golden_patterns.py is now the named constant _AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner threads it through but the Azure adapter discards it. Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on 43 source files, ruff/format/import-linter all green. Refs #94

The eval slice previously shipped one toy case (echo-hello) and a disabled-by-default nightly. A reader expecting an LLM-eval story found the infrastructure without conviction. Adds four worked-pattern cases that exercise the existing three tolerance modes against a real Azure OpenAI deployment. These are not benchmarks — they demonstrate what an eval case *looks like* for the four LLM-eval patterns you most often need to write: - factual-http-200 exact_match format-constrained recall - numeric-seconds-per-day numeric_close numeric reasoning + tolerance - definitional-fastapi-depends semantic_similar free-form judge-scored prose - structured-json-status exact_match structured-output adherence When the template is forked for a real project, replace these four with cases that exercise the project's own prompts; the patterns transfer regardless of what product is bolted on. Provider choice — Azure OpenAI via the openai SDK with AzureOpenAI client — is intentionally distinct from the rest of the harness (which uses Claude via Claude Code). Demonstrates that the LLMClient Protocol in src/eval/judge.py does its job: the eval core never imports openai, vendor lock-in lives only in the adapter. Changes: - src/eval/adapters/azure_openai.py — implements LLMClient via the openai.AzureOpenAI SDK. Reads endpoint/key/deployment/api-version from env. Lazy-imports the SDK so the module is importable without the optional extra installed; the adapter raises a clear AzureOpenAIConfigError if the env or SDK is missing. - eval/golden_patterns.json — the four cases with notes explaining which pattern each demonstrates. - eval/test_golden_patterns.py — separate test file gated on the Azure env vars via pytestmark. Skipped on a stock checkout, so `uv run pytest eval/` always exits 0. The toy test_golden_qa.py keeps running as before. - pyproject.toml — new optional [project.optional-dependencies] eval extra (just `openai>=1.40.0`), mypy override for openai.* matching the existing opentelemetry.* pattern, and a 0.2.10 -> 0.2.11 self-version bump. - .github/workflows/eval-nightly.yml — env vars renamed from the placeholder LLM_* set to AZURE_OPENAI_*. Header comment updated with the Azure setup recipe. uv sync now passes --extra eval. - docs/EVAL_HARNESS.md — new "Worked patterns" section with the table mapping case -> tolerance -> pattern, the local setup recipe, and a "Swapping providers" note documenting the Protocol-based extension path. Local gates: mypy --strict clean on 42 source files (was 31), ruff clean, ruff format clean, import-linter both contracts kept, 192 unit tests pass, eval/ runs 1 passed + 4 skipped without LLM env. Closes #94

Addresses two gate failures on #104 surfaced by code review: 1. "Tests required" gate — feat: prefix declared a behaviour change but tests/ had no test for the new adapter (the eval/-side test only runs with live Azure credentials). Adds tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases covering _resolve_config (defaults, override, empty-string fallback, missing-env error listing), the constructor (env wiring, explicit API version, missing-env, missing-SDK), and the two SDK call paths (complete_json structured-output mode, complete user-message dispatch, null-content returns "" / "{}"). The SDK is mocked at sys.modules level so the test never hits the network and never requires the openai extra to be installed. 2. "src/ README audit" gate — every src/ package needs a README.md per CLAUDE.md. Adds src/eval/adapters/README.md documenting the layer's purpose, the current adapter, a 7-step "adding a new adapter" recipe, and why the layer lives at the top of the import order. Also applies the reviewer's non-blocking sentinel-string suggestion: the magic "azure-deployment" string passed as judge_model in eval/test_golden_patterns.py is now the named constant _AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner threads it through but the Azure adapter discards it. Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on 43 source files, ruff/format/import-linter all green. Refs #94

src/ README audit gate looks for a `## Key interfaces` (or `## Public surface`) anchor — the existing README had purpose / table / extension recipe / layering rationale, but no exported-names section. Adds a `## Key interfaces` section listing the two exported names: - AzureOpenAIClient — the LLMClient implementation with notes on complete() vs complete_json() and the discarded `model` arg (Azure dispatches by deployment, not model). - AzureOpenAIConfigError — the construction-time error type, noting that it batches every missing env var into a single message instead of failing-and-retrying. Both already documented in the adapter docstrings; this section hoists them to the README anchor the audit gate enforces. Refs #94

…sed post-#103/#104) main moved ahead of develop on 2026-05-25 when PR #86 was merged directly to main rather than via develop -> release flow. The divergence is one squash commit (eff5b1c) carrying: - docs/BEADS.md (optional Beads issue-queue guidance) - .github/pull_request_template.md (Beads PR-template block) - .github/scripts/check_aspirational_tickets.py (PEP 758 reformat) - .github/scripts/check_pin_freshness.py / check_tests_present.py / check_version_bump.py (touch-ups) - .gitattributes / .gitignore (.beads/ ignore, Windows renormalise) - CONTRIBUTING.md (line-ending normalisation) - tests/test_scripts_compile.py (new CI-script compile gate) - docs/DEVELOPMENT.md / docs/HARNESS.md / docs/HARNESS_PRIMER.md cross-refs - pyproject.toml + uv.lock self-version 0.2.10 -> 0.2.11 This PR was rebased after #103 (CVE fix, develop -> 0.2.11) and #104 (eval pattern examples, develop -> 0.2.12) merged. The version on main (0.2.11) is now behind develop (0.2.12); the conflict is resolved by bumping develop -> 0.2.13. After this lands, develop is at 0.2.13 and contains everything main has. Remaining in-flight PRs (#99, #100, #101, #105) need to rebase to bump 0.2.13 -> 0.2.14 (and onward sequentially as they merge). No behaviour change beyond what #86 already added to main. # Conflicts: # pyproject.toml # uv.lock

constk mentioned this pull request May 25, 2026

chore: align develop with main — backport #86 content + version #106

Merged

5 tasks

constk added 4 commits May 26, 2026 15:18

chore: bump version to 0.2.12 (rebase onto develop after #103)

1a32080

constk force-pushed the feat/94-eval-pattern-examples-azure-openai branch from 0d6531a to 1a32080 Compare May 26, 2026 05:20

constk merged commit 18b4d30 into develop May 26, 2026
21 checks passed

constk deleted the feat/94-eval-pattern-examples-azure-openai branch May 26, 2026 05:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval pattern examples calling Azure OpenAI#104

feat: eval pattern examples calling Azure OpenAI#104
constk merged 4 commits into
developfrom
feat/94-eval-pattern-examples-azure-openai

constk commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

constk commented May 25, 2026

What & why

Changes

Test plan

Invariants affected

New deps / actions / external surface

Linked issue

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant