Skip to content

Harden deterministic operational failure taxonomy registration and coverage#141

Merged
ProfRandom92 merged 1 commit into
mainfrom
codex/harden-failure-taxonomy-for-replay-validation
May 19, 2026
Merged

Harden deterministic operational failure taxonomy registration and coverage#141
ProfRandom92 merged 1 commit into
mainfrom
codex/harden-failure-taxonomy-for-replay-validation

Conversation

@ProfRandom92
Copy link
Copy Markdown
Owner

Motivation

  • Ensure every replay failure label has explicit, deterministic operational semantics (observable trigger, linked contract/invariant, severity, and non-goal) and that all labels used by fixtures and artifacts are registered and enforced by tests.

Description

  • Added a canonical in-code failure taxonomy registry at src/validation/failure_taxonomy.py that enumerates labels with operational_meaning, observable_trigger, contract_or_invariant_type, severity_class, and non_goal, plus a BANNED_FUZZY_TERMS guard.
  • Added human-readable guidance at docs/failure_taxonomy.md describing the canonical source, required fields per label, non-goals, and preferred hardened labels (including TOOL_ORDER_VIOLATION, RECOVERY_PATH_LOSS, BLOCKER_DETACHMENT, GOVERNANCE_DRIFT, DEPENDENCY_CHAIN_BREAK, EVIDENCE_SURVIVAL_LOSS, and HIGH_CRITICAL_EVIDENCE_LOSS).
  • Added deterministic tests in tests/test_failure_taxonomy.py that assert fixture and artifact-emitted labels are registered, that every registered label includes all required operational fields, and that banned fuzzy terms are not present.
  • Changed files: src/validation/failure_taxonomy.py, docs/failure_taxonomy.md, tests/test_failure_taxonomy.py.
  • Risks: taxonomy entries are curated in-code so adding new labels to fixtures/artifacts without updating the registry will break tests.
  • Next: update src/validation/failure_taxonomy.py first when introducing new failure labels, and extend the registry and tests in the same PR as any fixture or artifact changes.

Testing

  • Ran pytest tests/test_failure_taxonomy.py -q which passed (4 tests, all OK).
  • Ran pytest tests/test_manifest_fixture_families.py -q which passed (3 tests, all OK).
  • Ran pytest tests/test_multi_family_admissibility_artifact.py -q which passed (6 tests, all OK).
  • Ran full project checks via npm run check which runs layout/typecheck/validate/build and the full pytest suite; the full test run succeeded with 222 passed and npm run check completed without errors.

Codex Task

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a deterministic operational failure taxonomy for replay admissibility validation, consisting of a documentation guide, a Python registry of failure labels, and a test suite to enforce field requirements and naming conventions. Feedback focuses on resolving redundancies between similar labels (such as evidence and recovery path losses), ensuring recursive artifact discovery in tests for full coverage, and extending the validation of banned fuzzy terms to the operational definitions to maintain deterministic semantics.

"confusion",
)

FAILURE_TAXONOMY: Final[dict[str, dict[str, str]]] = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The taxonomy contains several redundant labels that overlap in operational meaning and trigger conditions. For example, EVIDENCE_SURVIVAL_LOSS (line 52) vs EVIDENCE_LOSS (line 150), and RECOVERY_PATH_LOSS (line 24) vs RECOVERY_PATH_INVALID (line 73). To maintain a truly canonical and deterministic registry, these should be unified or the legacy versions should be explicitly marked as deprecated/aliases to avoid confusion for future fixture development.


def _collect_artifact_failure_labels() -> set[str]:
labels: set[str] = set()
for path in sorted((ROOT / "artifacts").glob("*.json")):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current glob pattern only searches the top-level artifacts directory. Since fixtures are searched recursively (line 18), artifacts should likely be searched recursively as well to ensure full coverage of all generated labels.

Suggested change
for path in sorted((ROOT / "artifacts").glob("*.json")):
for path in sorted((ROOT / "artifacts").glob("**/*.json")):

Comment on lines +74 to +78
def test_registered_labels_do_not_use_banned_fuzzy_terms() -> None:
for label in FAILURE_TAXONOMY:
normalized = label.lower()
for banned in BANNED_FUZZY_TERMS:
assert banned not in normalized, f"label '{label}' contains banned fuzzy term '{banned}'"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test only validates the label names. To fully enforce the project's goal of deterministic operational semantics, the operational_meaning field should also be checked for banned fuzzy terms to prevent non-deterministic language from creeping into the taxonomy definitions.

Suggested change
def test_registered_labels_do_not_use_banned_fuzzy_terms() -> None:
for label in FAILURE_TAXONOMY:
normalized = label.lower()
for banned in BANNED_FUZZY_TERMS:
assert banned not in normalized, f"label '{label}' contains banned fuzzy term '{banned}'"
def test_registered_labels_do_not_use_banned_fuzzy_terms() -> None:
for label, spec in FAILURE_TAXONOMY.items():
check_texts = [label.lower(), spec.get("operational_meaning", "").lower()]
for banned in BANNED_FUZZY_TERMS:
for text in check_texts:
assert banned not in text, f"label '{label}' contains banned fuzzy term '{banned}'"

@ProfRandom92 ProfRandom92 merged commit 973c0f1 into main May 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant