Skip to content

Add deterministic multi-family admissibility artifact generation#136

Merged
ProfRandom92 merged 3 commits into
mainfrom
codex/add-multi-family-admissibility-artifact
May 19, 2026
Merged

Add deterministic multi-family admissibility artifact generation#136
ProfRandom92 merged 3 commits into
mainfrom
codex/add-multi-family-admissibility-artifact

Conversation

@ProfRandom92
Copy link
Copy Markdown
Owner

Motivation

  • Provide a deterministic, committed benchmark that contains one degradation curve per manifest-registered family that exposes the standard baseline, mild, moderate, severe levels so multi-family comparisons can be reproduced reliably.

Description

  • Add scripts/generate_multi_family_admissibility_artifact.py which loads fixtures/manifest.json, discovers families deterministically, filters to families that include the standard levels, uses DegradationCurveGenerator.fixtures_for_manifest_family(...) to build a curve per family with curve_id set to <family>_curve_v1, and writes a stable JSON payload to artifacts/multi_family_admissibility_results.json.
  • Add the committed artifact artifacts/multi_family_admissibility_results.json that contains lexicographically-sorted families and one curve per family without timestamps or environment-dependent fields and without modifying the existing single-family artifact artifacts/layered_admissibility_results.json.
  • Add focused tests in tests/test_multi_family_admissibility_artifact.py that validate generation vs committed payload, deterministic family ordering, presence of the two current families, four points per family using manifest level order, compatibility of the coding_workflow_pr_review curve with the existing single-family artifact, and repeated-generation stability.
  • Extend tests/test_artifact_reproducibility.py to include the new multi-family artifact and add an npm script generate:multi-family-admissibility that wraps the Python generator.

Testing

  • Ran the generator and unit tests: python scripts/generate_multi_family_admissibility_artifact.py produced artifacts/multi_family_admissibility_results.json and pytest tests/test_multi_family_admissibility_artifact.py -q passed (6 tests passed).
  • Verified deterministic artifact reproducibility with pytest tests/test_artifact_reproducibility.py -q which passed and confirmed regeneration parity against the committed artifact.
  • Ran broader validation: pytest tests/test_degradation_curve_generator.py -q and pytest tests/test_manifest_fixture_families.py -q and npm run check which completed successfully (repository tests passed and npm run check returned without errors).
  • All automated checks executed during validation completed successfully.

Codex Task

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new multi-family admissibility results artifact, a Python script for its generation, and a suite of tests to ensure schema stability and reproducibility. The changes also include adding a new npm script to trigger the generation process. Review feedback identifies an improvement opportunity regarding the data quality of the incident response triage family, noting that the 'moderate' and 'mild' degradation points are currently identical and suggesting that the underlying fixtures be updated to provide a more granular degradation curve.

Comment on lines +155 to +178
{
"expected_admissible": false,
"failed_contracts": [
"no_orphan_mitigation_steps",
"rollback_reachable"
],
"failure_labels": [
"INVARIANT_VIOLATION",
"RECOVERY_PATH_INVALID"
],
"fixture_id": "incident_response_page_triage_moderate_v1",
"fixture_path": "fixtures/incident_response_page_triage_moderate_v1",
"fixture_version": "1.0.0",
"governance_score": 1.0,
"observed_admissible": false,
"operational_score": 1.0,
"overall_admissibility_score": 0.8333333333333334,
"passed_contracts": [
"alert_ack_before_mitigation",
"root_cause_links_incident"
],
"relational_score": 0.3333333333333333,
"structural_score": 1.0
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The moderate degradation point for the incident_response_page_triage family is identical to the mild point (lines 131-154) in terms of scores, failed contracts, and failure labels. For a benchmark artifact intended to represent a degradation curve, these levels should ideally show progressive loss of quality. Please check if the fixtures for this family need to be updated to provide more granular degradation steps.

@ProfRandom92 ProfRandom92 merged commit 764aaa4 into main May 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant