Skip to content

test: strengthen eval slice — realistic cases or explicit scaffold framing #94

@constk

Description

@constk

Problem

The eval slice is the thinnest part of the harness:

  • eval/golden_qa.json contains a single toy case (echo-hello).
  • The nightly eval workflow is disabled by default.
  • docs/EVAL_HARNESS.md documents the protocol but doesn't frame the "why would I add a case?" question.

A reader expecting an LLM-eval story finds infrastructure without conviction.

Proposed solution

Pick one of the following:

Option A — Add real cases. Add 3–4 realistic golden cases (e.g. a Q/A pair, a code-generation prompt with a deterministic expected substring, a tool-use trace check). Wire the nightly workflow to run them by default. Update EVAL_HARNESS.md with a "how to add a case" section.

Option B — Reframe as scaffold. Add a paragraph at the top of EVAL_HARNESS.md explicitly framing the eval slice as "scaffold for your project's eval cases, not a benchmark." Document the contract for adding cases. Keep the toy case as a smoke test and explain why the nightly is opt-in.

Acceptance criteria

  • One of Option A or Option B chosen and implemented
  • EVAL_HARNESS.md tells a reader either how to add cases or why this is a scaffold
  • Nightly workflow status (enabled / opt-in) is consistent with the chosen framing

Priority rationale

Currently the weakest organ of the harness story; either path closes the gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationevalEval harness scaffolding and golden cases

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions