Problem
The eval flywheel is the harness's most distinctive feature for LLM-driven systems. Documenting the pattern without scaffolding code leaves users to reinvent it.
Proposed solution
Port src/eval/{__main__.py, runner.py, judge.py, report.py, models.py} from Teller. Generalise the LLM client via the LLM_PROVIDER config from #18 (no Azure/OpenAI lock-in). Add eval/golden_qa.json with ONE trivial example: {"id": "echo-hello", "question": "echo hello", "expected": "hello", "tolerance": "exact_match"}. Add eval/test_golden_qa.py runner using @pytest.mark.eval. Port .github/workflows/eval-nightly.yml BUT replace schedule with workflow_dispatch-only. Document opt-in (adding LLM secrets, re-enabling schedule) in docs/EVAL_HARNESS.md (#25).
Acceptance criteria
Priority rationale
High: the eval harness is the hook that makes this template stand out for LLM-system engineers.
Depends on
#18, #19
Problem
The eval flywheel is the harness's most distinctive feature for LLM-driven systems. Documenting the pattern without scaffolding code leaves users to reinvent it.
Proposed solution
Port
src/eval/{__main__.py, runner.py, judge.py, report.py, models.py}from Teller. Generalise the LLM client via theLLM_PROVIDERconfig from #18 (no Azure/OpenAI lock-in). Addeval/golden_qa.jsonwith ONE trivial example:{"id": "echo-hello", "question": "echo hello", "expected": "hello", "tolerance": "exact_match"}. Addeval/test_golden_qa.pyrunner using@pytest.mark.eval. Port.github/workflows/eval-nightly.ymlBUT replaceschedulewithworkflow_dispatch-only. Document opt-in (adding LLM secrets, re-enabling schedule) indocs/EVAL_HARNESS.md(#25).Acceptance criteria
uv run pytest eval/runs the example case and passes (no LLM call needed forexact_match).eval-nightly.ymlonly triggers onworkflow_dispatch.schedule:once secrets are configured.src/eval/judge.pyLLM-judge interface is provider-agnostic (calls aLLMClientProtocol).Priority rationale
High: the eval harness is the hook that makes this template stand out for LLM-system engineers.
Depends on
#18, #19