Problem
The eval slice is the thinnest part of the harness:
- eval/golden_qa.json contains a single toy case (echo-hello).
- The nightly eval workflow is disabled by default.
- docs/EVAL_HARNESS.md documents the protocol but doesn't frame the "why would I add a case?" question.
A reader expecting an LLM-eval story finds infrastructure without conviction.
Proposed solution
Pick one of the following:
Option A — Add real cases. Add 3–4 realistic golden cases (e.g. a Q/A pair, a code-generation prompt with a deterministic expected substring, a tool-use trace check). Wire the nightly workflow to run them by default. Update EVAL_HARNESS.md with a "how to add a case" section.
Option B — Reframe as scaffold. Add a paragraph at the top of EVAL_HARNESS.md explicitly framing the eval slice as "scaffold for your project's eval cases, not a benchmark." Document the contract for adding cases. Keep the toy case as a smoke test and explain why the nightly is opt-in.
Acceptance criteria
Priority rationale
Currently the weakest organ of the harness story; either path closes the gap.
Problem
The eval slice is the thinnest part of the harness:
A reader expecting an LLM-eval story finds infrastructure without conviction.
Proposed solution
Pick one of the following:
Option A — Add real cases. Add 3–4 realistic golden cases (e.g. a Q/A pair, a code-generation prompt with a deterministic expected substring, a tool-use trace check). Wire the nightly workflow to run them by default. Update EVAL_HARNESS.md with a "how to add a case" section.
Option B — Reframe as scaffold. Add a paragraph at the top of EVAL_HARNESS.md explicitly framing the eval slice as "scaffold for your project's eval cases, not a benchmark." Document the contract for adding cases. Keep the toy case as a smoke test and explain why the nightly is opt-in.
Acceptance criteria
Priority rationale
Currently the weakest organ of the harness story; either path closes the gap.