[P1] Define advanced execution policy for eval engines and judges

## Objective
Define how one logical eval run expands into engine executions and judge passes, and how those results are reported.

## Priority
P1 — Should Fix

## Details
The current eval flow assumes a single engine and a single judge pass per case, but the design backlog also asks whether the same case can run across multiple engines and whether LLM-as-judge results need stabilization via repeated judging or majority vote. These questions affect the same execution and reporting model and should be resolved together.

## Acceptance Criteria
- Matrix mode is either specified or explicitly deferred
- The default judge execution policy is defined
- If advanced execution modes are supported, the CLI and report format are defined clearly
- Report semantics remain interpretable and comparable across runs

## Notes
Source: work/eval-design-discussion.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P1] Define advanced execution policy for eval engines and judges #31

Objective

Priority

Details

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[P1] Define advanced execution policy for eval engines and judges #31

Description

Objective

Priority

Details

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions