feat: eval harness scaffold (runner, judge, report, models) + 1 example golden case + nightly workflow_dispatch

## Problem

The eval flywheel is the harness's most distinctive feature for LLM-driven systems. Documenting the pattern without scaffolding code leaves users to reinvent it.

## Proposed solution

Port `src/eval/{__main__.py, runner.py, judge.py, report.py, models.py}` from Teller. Generalise the LLM client via the `LLM_PROVIDER` config from #18 (no Azure/OpenAI lock-in). Add `eval/golden_qa.json` with ONE trivial example: `{"id": "echo-hello", "question": "echo hello", "expected": "hello", "tolerance": "exact_match"}`. Add `eval/test_golden_qa.py` runner using `@pytest.mark.eval`. Port `.github/workflows/eval-nightly.yml` BUT replace `schedule` with `workflow_dispatch`-only. Document opt-in (adding LLM secrets, re-enabling schedule) in `docs/EVAL_HARNESS.md` (#25).

## Acceptance criteria

- [ ] `uv run pytest eval/` runs the example case and passes (no LLM call needed for `exact_match`).
- [ ] `eval-nightly.yml` only triggers on `workflow_dispatch`.
- [ ] Inline comment in the workflow explains how to flip on `schedule:` once secrets are configured.
- [ ] `src/eval/judge.py` LLM-judge interface is provider-agnostic (calls a `LLMClient` Protocol).

## Priority rationale

High: the eval harness is the hook that makes this template stand out for LLM-system engineers.

## Depends on

#18, #19


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval harness scaffold (runner, judge, report, models) + 1 example golden case + nightly workflow_dispatch #24

Problem

Proposed solution

Acceptance criteria

Priority rationale

Depends on

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: eval harness scaffold (runner, judge, report, models) + 1 example golden case + nightly workflow_dispatch #24

Description

Problem

Proposed solution

Acceptance criteria

Priority rationale

Depends on

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions