Skip to content

feat: eval pattern examples calling Azure OpenAI#104

Merged
constk merged 4 commits into
developfrom
feat/94-eval-pattern-examples-azure-openai
May 26, 2026
Merged

feat: eval pattern examples calling Azure OpenAI#104
constk merged 4 commits into
developfrom
feat/94-eval-pattern-examples-azure-openai

Conversation

@constk
Copy link
Copy Markdown
Owner

@constk constk commented May 25, 2026

What & why

The eval slice previously shipped one toy case (echo-hello) and a disabled nightly. A reader expecting an LLM-eval story found the infrastructure without conviction.

This PR adds four worked-pattern cases that exercise the existing three tolerance modes against a real Azure OpenAI deployment. They are not benchmarks — they demonstrate what an eval case looks like for the four LLM-eval patterns you most often need to write:

Case ID Tolerance Pattern demonstrated
factual-http-200 exact_match Format-constrained factual recall
numeric-seconds-per-day numeric_close Numeric reasoning with extraction tolerance
definitional-fastapi-depends semantic_similar Free-form prose scored by an LLM judge
structured-json-status exact_match Structured-output adherence (catches markdown-fenced JSON)

When the template is forked for a real project, the four cases get replaced with ones that exercise the project's own prompts; the patterns transfer regardless of what product is bolted on.

Provider choice — Azure OpenAI — is intentionally distinct from the rest of the harness (which uses Claude via Claude Code). Demonstrates that the existing LLMClient Protocol in src/eval/judge.py does its job: the eval core never imports openai, and vendor lock-in lives only in the new adapter.

Closes #94.

Changes

File Purpose
src/eval/adapters/azure_openai.py (new) AzureOpenAIClient implementing LLMClient. Lazy SDK import, env-driven config, clear AzureOpenAIConfigError on missing config.
src/eval/adapters/__init__.py (new) Package init with rationale docstring.
eval/golden_patterns.json (new) The four worked-pattern cases.
eval/test_golden_patterns.py (new) Pytest runner gated on AZURE_OPENAI_* env via pytestmark — skipped on stock checkouts.
pyproject.toml New eval optional extra (openai>=1.40.0), mypy override for openai.* matching the existing opentelemetry.* pattern, version bump 0.2.10 → 0.2.11.
uv.lock Regenerated to include the eval extra.
.github/workflows/eval-nightly.yml Env vars renamed LLM_*AZURE_OPENAI_*. Header updated with the Azure recipe. uv sync now passes --extra eval.
docs/EVAL_HARNESS.md New "Worked patterns" section + "Swapping providers" note documenting the Protocol-based extension path.

Test plan

Local gates (all green):

  • uv run --frozen mypy --strict src/ tests/ — clean on 42 source files (was 31)
  • uv run --frozen ruff check . — All checks passed
  • uv run --frozen ruff format --check . — 57 files already formatted
  • uv run --frozen lint-imports — both contracts kept
  • uv run --frozen pytest tests/ -q — 192 passed
  • uv run --frozen pytest eval/ -q — 1 passed, 4 skipped (pattern cases correctly skipped without Azure env)
  • (Future) Real Azure run via workflow_dispatch on the nightly workflow once secrets are configured

Invariants affected

None. The new adapter sits at the top of the layered import order (src.eval is the top layer); no boundary changes.

New deps / actions / external surface

  • openai>=1.40.0 — added as an optional extra (uv sync --extra eval). Default uv sync --extra dev doesn't pull it.
  • New external endpoint: Azure OpenAI (per-deployment URL). Only called from the new test file, only when env vars are set.
  • No new GitHub Actions; existing eval-nightly.yml action SHAs unchanged.

Linked issue

Closes #94

constk added a commit that referenced this pull request May 25, 2026
Addresses two gate failures on #104 surfaced by code review:

1. "Tests required" gate — feat: prefix declared a behaviour change
   but tests/ had no test for the new adapter (the eval/-side test
   only runs with live Azure credentials). Adds
   tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases
   covering _resolve_config (defaults, override, empty-string
   fallback, missing-env error listing), the constructor (env
   wiring, explicit API version, missing-env, missing-SDK), and the
   two SDK call paths (complete_json structured-output mode,
   complete user-message dispatch, null-content returns "" / "{}").

   The SDK is mocked at sys.modules level so the test never hits the
   network and never requires the openai extra to be installed.

2. "src/ README audit" gate — every src/ package needs a README.md
   per CLAUDE.md. Adds src/eval/adapters/README.md documenting the
   layer's purpose, the current adapter, a 7-step "adding a new
   adapter" recipe, and why the layer lives at the top of the import
   order.

Also applies the reviewer's non-blocking sentinel-string suggestion:
the magic "azure-deployment" string passed as judge_model in
eval/test_golden_patterns.py is now the named constant
_AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner
threads it through but the Azure adapter discards it.

Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on
43 source files, ruff/format/import-linter all green.

Refs #94
constk added 4 commits May 26, 2026 15:18
The eval slice previously shipped one toy case (echo-hello) and a
disabled-by-default nightly. A reader expecting an LLM-eval story
found the infrastructure without conviction.

Adds four worked-pattern cases that exercise the existing three
tolerance modes against a real Azure OpenAI deployment. These are
not benchmarks — they demonstrate what an eval case *looks like* for
the four LLM-eval patterns you most often need to write:

  - factual-http-200             exact_match       format-constrained recall
  - numeric-seconds-per-day      numeric_close     numeric reasoning + tolerance
  - definitional-fastapi-depends semantic_similar  free-form judge-scored prose
  - structured-json-status       exact_match       structured-output adherence

When the template is forked for a real project, replace these four
with cases that exercise the project's own prompts; the patterns
transfer regardless of what product is bolted on.

Provider choice — Azure OpenAI via the openai SDK with AzureOpenAI
client — is intentionally distinct from the rest of the harness
(which uses Claude via Claude Code). Demonstrates that the LLMClient
Protocol in src/eval/judge.py does its job: the eval core never
imports openai, vendor lock-in lives only in the adapter.

Changes:

  - src/eval/adapters/azure_openai.py — implements LLMClient via the
    openai.AzureOpenAI SDK. Reads endpoint/key/deployment/api-version
    from env. Lazy-imports the SDK so the module is importable without
    the optional extra installed; the adapter raises a clear
    AzureOpenAIConfigError if the env or SDK is missing.

  - eval/golden_patterns.json — the four cases with notes explaining
    which pattern each demonstrates.

  - eval/test_golden_patterns.py — separate test file gated on the
    Azure env vars via pytestmark. Skipped on a stock checkout, so
    `uv run pytest eval/` always exits 0. The toy test_golden_qa.py
    keeps running as before.

  - pyproject.toml — new optional [project.optional-dependencies] eval
    extra (just `openai>=1.40.0`), mypy override for openai.* matching
    the existing opentelemetry.* pattern, and a 0.2.10 -> 0.2.11
    self-version bump.

  - .github/workflows/eval-nightly.yml — env vars renamed from the
    placeholder LLM_* set to AZURE_OPENAI_*. Header comment updated
    with the Azure setup recipe. uv sync now passes --extra eval.

  - docs/EVAL_HARNESS.md — new "Worked patterns" section with the
    table mapping case -> tolerance -> pattern, the local setup
    recipe, and a "Swapping providers" note documenting the
    Protocol-based extension path.

Local gates: mypy --strict clean on 42 source files (was 31), ruff
clean, ruff format clean, import-linter both contracts kept, 192
unit tests pass, eval/ runs 1 passed + 4 skipped without LLM env.

Closes #94
Addresses two gate failures on #104 surfaced by code review:

1. "Tests required" gate — feat: prefix declared a behaviour change
   but tests/ had no test for the new adapter (the eval/-side test
   only runs with live Azure credentials). Adds
   tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases
   covering _resolve_config (defaults, override, empty-string
   fallback, missing-env error listing), the constructor (env
   wiring, explicit API version, missing-env, missing-SDK), and the
   two SDK call paths (complete_json structured-output mode,
   complete user-message dispatch, null-content returns "" / "{}").

   The SDK is mocked at sys.modules level so the test never hits the
   network and never requires the openai extra to be installed.

2. "src/ README audit" gate — every src/ package needs a README.md
   per CLAUDE.md. Adds src/eval/adapters/README.md documenting the
   layer's purpose, the current adapter, a 7-step "adding a new
   adapter" recipe, and why the layer lives at the top of the import
   order.

Also applies the reviewer's non-blocking sentinel-string suggestion:
the magic "azure-deployment" string passed as judge_model in
eval/test_golden_patterns.py is now the named constant
_AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner
threads it through but the Azure adapter discards it.

Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on
43 source files, ruff/format/import-linter all green.

Refs #94
src/ README audit gate looks for a `## Key interfaces` (or `## Public
surface`) anchor — the existing README had purpose / table /
extension recipe / layering rationale, but no exported-names section.

Adds a `## Key interfaces` section listing the two exported names:

  - AzureOpenAIClient — the LLMClient implementation with notes on
    complete() vs complete_json() and the discarded `model` arg
    (Azure dispatches by deployment, not model).
  - AzureOpenAIConfigError — the construction-time error type,
    noting that it batches every missing env var into a single
    message instead of failing-and-retrying.

Both already documented in the adapter docstrings; this section
hoists them to the README anchor the audit gate enforces.

Refs #94
@constk constk force-pushed the feat/94-eval-pattern-examples-azure-openai branch from 0d6531a to 1a32080 Compare May 26, 2026 05:20
@constk constk merged commit 18b4d30 into develop May 26, 2026
21 checks passed
@constk constk deleted the feat/94-eval-pattern-examples-azure-openai branch May 26, 2026 05:21
constk added a commit that referenced this pull request May 26, 2026
…sed post-#103/#104)

main moved ahead of develop on 2026-05-25 when PR #86 was merged
directly to main rather than via develop -> release flow. The
divergence is one squash commit (eff5b1c) carrying:

  - docs/BEADS.md (optional Beads issue-queue guidance)
  - .github/pull_request_template.md (Beads PR-template block)
  - .github/scripts/check_aspirational_tickets.py (PEP 758 reformat)
  - .github/scripts/check_pin_freshness.py / check_tests_present.py /
    check_version_bump.py (touch-ups)
  - .gitattributes / .gitignore (.beads/ ignore, Windows renormalise)
  - CONTRIBUTING.md (line-ending normalisation)
  - tests/test_scripts_compile.py (new CI-script compile gate)
  - docs/DEVELOPMENT.md / docs/HARNESS.md / docs/HARNESS_PRIMER.md
    cross-refs
  - pyproject.toml + uv.lock self-version 0.2.10 -> 0.2.11

This PR was rebased after #103 (CVE fix, develop -> 0.2.11) and
#104 (eval pattern examples, develop -> 0.2.12) merged. The version
on main (0.2.11) is now behind develop (0.2.12); the conflict is
resolved by bumping develop -> 0.2.13.

After this lands, develop is at 0.2.13 and contains everything main
has. Remaining in-flight PRs (#99, #100, #101, #105) need to rebase
to bump 0.2.13 -> 0.2.14 (and onward sequentially as they merge).

No behaviour change beyond what #86 already added to main.

# Conflicts:
#	pyproject.toml
#	uv.lock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant