Skip to content

feat: eval harness scaffold (runner, judge, report, models) + 1 example golden case + nightly workflow_dispatch #24

@constk

Description

@constk

Problem

The eval flywheel is the harness's most distinctive feature for LLM-driven systems. Documenting the pattern without scaffolding code leaves users to reinvent it.

Proposed solution

Port src/eval/{__main__.py, runner.py, judge.py, report.py, models.py} from Teller. Generalise the LLM client via the LLM_PROVIDER config from #18 (no Azure/OpenAI lock-in). Add eval/golden_qa.json with ONE trivial example: {"id": "echo-hello", "question": "echo hello", "expected": "hello", "tolerance": "exact_match"}. Add eval/test_golden_qa.py runner using @pytest.mark.eval. Port .github/workflows/eval-nightly.yml BUT replace schedule with workflow_dispatch-only. Document opt-in (adding LLM secrets, re-enabling schedule) in docs/EVAL_HARNESS.md (#25).

Acceptance criteria

  • uv run pytest eval/ runs the example case and passes (no LLM call needed for exact_match).
  • eval-nightly.yml only triggers on workflow_dispatch.
  • Inline comment in the workflow explains how to flip on schedule: once secrets are configured.
  • src/eval/judge.py LLM-judge interface is provider-agnostic (calls a LLMClient Protocol).

Priority rationale

High: the eval harness is the hook that makes this template stand out for LLM-system engineers.

Depends on

#18, #19

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestevalEval harness scaffolding and golden casestestTests, QA, eval harness

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions