Objective
Define how one logical eval run expands into engine executions and judge passes, and how those results are reported.
Priority
P1 — Should Fix
Details
The current eval flow assumes a single engine and a single judge pass per case, but the design backlog also asks whether the same case can run across multiple engines and whether LLM-as-judge results need stabilization via repeated judging or majority vote. These questions affect the same execution and reporting model and should be resolved together.
Acceptance Criteria
- Matrix mode is either specified or explicitly deferred
- The default judge execution policy is defined
- If advanced execution modes are supported, the CLI and report format are defined clearly
- Report semantics remain interpretable and comparable across runs
Notes
Source: work/eval-design-discussion.md
Objective
Define how one logical eval run expands into engine executions and judge passes, and how those results are reported.
Priority
P1 — Should Fix
Details
The current eval flow assumes a single engine and a single judge pass per case, but the design backlog also asks whether the same case can run across multiple engines and whether LLM-as-judge results need stabilization via repeated judging or majority vote. These questions affect the same execution and reporting model and should be resolved together.
Acceptance Criteria
Notes
Source: work/eval-design-discussion.md