Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions docs/benchmarks/multi_family_admissibility_benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Deterministic Multi-Family Admissibility Benchmark

## Purpose

The deterministic multi-family admissibility benchmark tracks operational admissibility degradation across fixture families registered in the manifest.

Each manifest-registered fixture family contributes one deterministic degradation curve using the same standard levels, so contributors can compare progression behavior across families without changing scoring rules or artifact shape.

## Pipeline

```mermaid
flowchart LR
A[fixtures/manifest.json]
B[DegradationCurveGenerator.fixtures_for_manifest_family(...)]
C[AdmissibilityScorer]
D[artifacts/multi_family_admissibility_results.json]
E[Reproducibility and progression tests]

A --> B --> C --> D --> E
```

### Pipeline notes

1. `fixtures/manifest.json` is the source of truth for which fixture families participate.
2. `DegradationCurveGenerator.fixtures_for_manifest_family(...)` resolves fixtures for each family from manifest registration.
3. `AdmissibilityScorer` computes exact admissibility component outcomes for each level.
4. Results are written to `artifacts/multi_family_admissibility_results.json` in a stable deterministic JSON layout.
5. Reproducibility and progression tests validate that the committed artifact remains consistent and semantically protected.

## Current fixture families

The current multi-family benchmark includes these manifest-registered families:

- `coding_workflow_pr_review`
- `incident_response_page_triage`

## Standard degradation levels

Every included family is evaluated at exactly four standard levels in explicit order:

1. `baseline`
2. `mild`
3. `moderate`
4. `severe`

## Determinism guarantees

The benchmark is designed to remain deterministic across local runs and CI runs:

- manifest-driven family selection
- explicit level order (`baseline`, `mild`, `moderate`, `severe`)
- exact rational score aggregation
- stable JSON output structure and ordering
- no timestamps or environment-dependent fields

## Regeneration commands

Use either command to regenerate the deterministic multi-family artifact:

```bash
python scripts/generate_multi_family_admissibility_artifact.py
```

```bash
npm run generate:multi-family-admissibility
```

## Validation commands

Run the targeted protections plus the repository-wide check entrypoint:

```bash
pytest tests/test_multi_family_admissibility_artifact.py -q
pytest tests/test_artifact_reproducibility.py -q
pytest tests/test_manifest_fixture_families.py -q
npm run check
```

## Regression protections

The benchmark is protected by deterministic regression checks that enforce:

- committed artifact must match regenerated output
- every family must expose all four standard levels
- baseline and severe behavior is explicitly checked
- mild and moderate behavior must be distinct
- degradation must be progressive:
- `baseline > mild >= moderate > severe`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The progression formula should use a strict inequality (>) between mild and moderate to be consistent with the requirement stated on line 86 that these behaviors must be distinct. Using >= allows for identical scores, which contradicts the stated goal of ensuring distinct behavior across these levels.

Suggested change
- `baseline > mild >= moderate > severe`
- baseline > mild > moderate > severe


## Non-goals

This benchmark intentionally excludes:

- LLM judging
- embeddings
- fuzzy semantic similarity
- runtime orchestration
- deployment/showcase dependencies
Loading