Add deterministic multi-family admissibility artifact generation#136
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new multi-family admissibility results artifact, a Python script for its generation, and a suite of tests to ensure schema stability and reproducibility. The changes also include adding a new npm script to trigger the generation process. Review feedback identifies an improvement opportunity regarding the data quality of the incident response triage family, noting that the 'moderate' and 'mild' degradation points are currently identical and suggesting that the underlying fixtures be updated to provide a more granular degradation curve.
| { | ||
| "expected_admissible": false, | ||
| "failed_contracts": [ | ||
| "no_orphan_mitigation_steps", | ||
| "rollback_reachable" | ||
| ], | ||
| "failure_labels": [ | ||
| "INVARIANT_VIOLATION", | ||
| "RECOVERY_PATH_INVALID" | ||
| ], | ||
| "fixture_id": "incident_response_page_triage_moderate_v1", | ||
| "fixture_path": "fixtures/incident_response_page_triage_moderate_v1", | ||
| "fixture_version": "1.0.0", | ||
| "governance_score": 1.0, | ||
| "observed_admissible": false, | ||
| "operational_score": 1.0, | ||
| "overall_admissibility_score": 0.8333333333333334, | ||
| "passed_contracts": [ | ||
| "alert_ack_before_mitigation", | ||
| "root_cause_links_incident" | ||
| ], | ||
| "relational_score": 0.3333333333333333, | ||
| "structural_score": 1.0 | ||
| }, |
There was a problem hiding this comment.
The moderate degradation point for the incident_response_page_triage family is identical to the mild point (lines 131-154) in terms of scores, failed contracts, and failure labels. For a benchmark artifact intended to represent a degradation curve, these levels should ideally show progressive loss of quality. Please check if the fixtures for this family need to be updated to provide more granular degradation steps.
Motivation
baseline,mild,moderate,severelevels so multi-family comparisons can be reproduced reliably.Description
scripts/generate_multi_family_admissibility_artifact.pywhich loadsfixtures/manifest.json, discovers families deterministically, filters to families that include the standard levels, usesDegradationCurveGenerator.fixtures_for_manifest_family(...)to build a curve per family withcurve_idset to<family>_curve_v1, and writes a stable JSON payload toartifacts/multi_family_admissibility_results.json.artifacts/multi_family_admissibility_results.jsonthat contains lexicographically-sorted families and one curve per family without timestamps or environment-dependent fields and without modifying the existing single-family artifactartifacts/layered_admissibility_results.json.tests/test_multi_family_admissibility_artifact.pythat validate generation vs committed payload, deterministic family ordering, presence of the two current families, four points per family using manifest level order, compatibility of thecoding_workflow_pr_reviewcurve with the existing single-family artifact, and repeated-generation stability.tests/test_artifact_reproducibility.pyto include the new multi-family artifact and add an npm scriptgenerate:multi-family-admissibilitythat wraps the Python generator.Testing
python scripts/generate_multi_family_admissibility_artifact.pyproducedartifacts/multi_family_admissibility_results.jsonandpytest tests/test_multi_family_admissibility_artifact.py -qpassed (6 tests passed).pytest tests/test_artifact_reproducibility.py -qwhich passed and confirmed regeneration parity against the committed artifact.pytest tests/test_degradation_curve_generator.py -qandpytest tests/test_manifest_fixture_families.py -qandnpm run checkwhich completed successfully (repository tests passed andnpm run checkreturned without errors).Codex Task