feat: add substitution contract to digital twin#3
Conversation
Self-review fixes for PR #3: - synthesize.py: type-guard legacy spec sections with _safe_dict so malformed (non-dict) operating_model/decision_policy/etc. no longer crash the compatibility renderer. Replace citation-as-because with prose reasons. Filter destructive verbs (force-push, publish, release, delete, rm, drop, truncate, deploy-to-prod) from legacy decide_alone before lifting into autonomous_authority. Introduce needs_compatibility_defaults helper that catches non-dict garbage in substitution sections. - pushback-detector.py: warn on unreadable/non-dict state file and rescan instead of crashing on setdefault. - extract-twin-spec.py: log stderr WARN on load_text/load_json failures and reject non-dict analysis JSON before mutating. - tests: add six negative tests (non-dict sections, destructive decide_alone filter, prose-only because, partial-population compat detection, corrupt state file recovery). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
danielbentes
left a comment
There was a problem hiding this comment.
Self-Review Summary (Path B, single-session)
Findings: P1: 2 → fixed, P2: 7 → 5 fixed + 2 deferred, P3: 12 → 1 fixed + 11 deferred
Six parallel reviewers (security, code-quality, conventions, test-runner, error-handling, holdout-validation) + main-thread requirements review against issue #2 acceptance criteria.
P1 — Fixed in ab89d59
| # | Category | Location | Issue | Fix |
|---|---|---|---|---|
| ERR-1 | error-handling | skills/digital-twin/scripts/synthesize.py:1685-1691 |
_legacy_substitution_fields used spec.get(k) or {} which only protects against None/falsy. A legacy spec with non-dict operating_model (string, list, int) crashed downstream renderers — but the path was specifically meant to handle malformed legacy specs. |
Added _safe_dict() helper; all section reads now isinstance(...)-guard before .get(). |
| ERR-2 | error-handling | skills/digital-twin/scripts/pushback-detector.py:267-274 |
State-file json.load swallowed errors with bare pass. If the file contained a non-dict (corrupted/external edit), state.setdefault(...) crashed with AttributeError, leaving the detector permanently broken until manually reset. |
Validate isinstance(loaded, dict) after load; warn to stderr on any failure and rescan from scratch. Also type-guard inner offsets/seen_hashes. |
P2 — Fixed in ab89d59
| # | Category | Location | Issue | Fix |
|---|---|---|---|---|
| SEC-2 | security/authority-boundary | synthesize.py:1752 |
_legacy_substitution_fields lifted decision_policy.decide_alone directly into autonomous_authority and substitution.md without policy filter. A legacy spec listing destructive items (force-push, publish releases, delete branches, drop tables) silently became authorized autonomous actions. |
New _filter_destructive_authority() deny-list filter applied to both autonomous_authority and the judgment_rules preferred_action that quotes decide_alone. Patterns: force-push, publish, release, delete, rm, rm -, drop , truncate, deploy-to-prod, merge to main. |
| F1 | correctness | synthesize.py:1700,1709,1716 |
The legacy backfill assigned citation strings (e.g. "quality.md §6") to because fields, which the constitution renderer prints as "- Because: quality.md §6" — nonsensical user-facing output that conflated citation with rationale. |
because now carries prose explaining why each value matters; citations stay in the evidence field where the schema intends. |
| ERR-5 | correctness/missing-validation | synthesize.py:2541-2552 |
twin_spec_compat_defaults used not twin_spec.get(k) which doesn't flag a key present with a truthy non-dict value (e.g. "TBD" string). The renderer then crashed when reading pol.get(...) on the non-dict. |
New needs_compatibility_defaults() helper checks isinstance(value, dict) and value. normalize_twin_spec_for_rendering now overwrites non-dict sections rather than preserving garbage. |
| ERR-3 | error-handling/silent-failure | pushback-detector.py:270-271 |
Bare except: pass swallowed corrupt state-file errors with no log line. |
Bundled with ERR-2: stderr WARN on every load failure with the underlying exception text. |
| ERR-4 | error-handling/silent-failure | extract-twin-spec.py:38-52,72 |
load_text/load_json returned empty defaults on OSError/JSONDecodeError with no signal; user couldn't tell which analysis file was unreadable. build_stats_packet also called .pop() on results without type-checking. |
stderr WARN on every load failure including the path; build_stats_packet rejects non-dict results before mutating. |
| Conv-P2 | convention | PR title | Title lacked feat: prefix; squash-merge would land non-conventional history. |
Edited PR title to feat: add substitution contract to digital twin. |
P2 — Deferred
| # | Issue | Why deferred |
|---|---|---|
| SEC-1 | Compatibility-derived authority filling happens before schema validation, so a legacy spec without the four new sections always becomes a "full" substituting agent. | SEC-2 deny-list closes the concrete attack surface (destructive items in decide_alone). The remaining "should legacy specs auto-derive substitution authority at all" question is a product decision, not a defect — the existing compatibility-derived substitution defaults status banner is the documented mitigation, and a strict-mode flag is a follow-up worth its own issue. |
| Holdout-P2 | Proposal approval guard is enforced only at the LLM-instruction layer in commands/propose-rules.md:74, with no automated test. |
The guard reads as a workflow contract for the LLM running the command. Automating it would need a parser for proposal frontmatter — worth a follow-up issue. |
| F3 | category_scores aggregates twin_total/twin_max across cases with different denominators when optional concept_groups/forbidden_phrases are present. |
Today's fixture passes uniformly so the metric isn't misleading in practice; converting to a per-case ratio average is a measurement change worth its own commit with a regression case. |
| F2 | compatibility-derived status doesn't trigger on partially-populated-but-malformed new sections. |
Now partially fixed via ERR-5 (non-dict garbage → backfill). The remaining case (truthy dict with missing inner required fields) falls through to validation, which already produces a clear error message and degraded twin. |
P3 — Deferred (mostly polish or pre-existing)
SEC-3user_name sanitization — local env-var, low-impact; tracking as follow-up.SEC-4extra adversarial test coverage — partially addressed by the 6 new tests; broader negative suite is a follow-up.F4avoid_phrases/forbidden_phrasesnaming overlap;F5"majority" in concept-group and forbidden;F6pushback_trigger_hit_rateAND-gates avoidance with recovery;F7slug collision risk;F9pre-existingAPPROVAL_WORDSincludes"sounds";F10shallow-copy semantics;F8twin_spec_compat_defaultscentralization (resolved by ERR-5 helper).ERR-6--mock-response-fileerror message clarity;ERR-8–ERR-12finer-grained silent-failure surface.Conv-P3no labels on PR; branch name lacks issue id (linked viaCloses #2body).
Requirements Adherence (Issue #2 — holdout-validation skill)
All 10 acceptance criteria SUPPORTED by file evidence:
| AC | Criterion | Evidence |
|---|---|---|
| 1 | Substitution role explicit | references/subagent-template.md:18-20; twin-spec-schema.json:76-88 |
| 2 | Principles + trust behavior, not only always/never | twin-spec-schema.json:39-75,89-100 |
| 3 | Durable rules include rationale + when-to-apply | twin-spec-schema.json:229-235,249-255; synthesize.py:2114-2122 |
| 4 | Brief agent for non-trivial task | twin-spec-schema.json:259-267; fixture agent-brief-nontrivial |
| 5 | Review agent plan as user would | Fixture review-agent-plan-overbroad; agent_supervision_policy.review_actions |
| 6 | Handle ambiguous context with priors | Fixture unknown-project; project_routing.unknown_project_behavior |
| 7 | CLAUDE guidance principle-first ordering | claude-patch-template.md:17-25 |
| 8 | Proposal loop principle-oriented | pushback-detector.py:208-213 scaffold sections |
| 9 | Held-out multi-agent + not solely keyword | 9 cases, 4 multi-agent, concept_groups + forbidden_phrases |
| 10 | Doc explains delegate vs reserved authority | examples/sample-CLAUDE-md-patch.md:35-37; schema :78 |
Quality Gates
| Check | Result |
|---|---|
pytest |
38 passed (was 32; +6 negative tests) |
python3 -m py_compile on 5 changed .py files |
Pass |
python3 -m json.tool on schema + heldout fixture |
Pass |
evaluate-twin.py heldout_cases.json |
n=9, twin_win_rate=1.0, all category_scores=1.0 |
ruff check skills/digital-twin/scripts tests |
No issues |
git diff --check |
Clean |
Recommendation
Self-review complete; fix-forward commit landed. Ready for /flow:pr or merge once any external reviewer (e.g. CI) signs off.
CI unblock: - synthesize.py: annotate detail_fields with tuple[tuple[str,str], ...] so mypy accepts variable-length tuple assignments. Security: - SEC-1: new --strict-substitution flag refuses legacy backfill; legacy v1 specs missing substitution sections now emit a degraded twin instead of compatibility-derived authority when the flag is set. - SEC-3: sanitize_user_name strips control chars, caps at 64, restricts to [A-Za-z0-9 ._-]; updated XSS test to assert the stronger allow-list behavior. Correctness: - F2: needs_compatibility_defaults catches non-dict garbage; remaining partial-population case routes through validation → degraded path. - F3 + ERR-11: category_scores averages per-case ratios so fat cases with optional concept/forbidden checks do not dilute thinner cases in the same category; added pushback_trigger_avoidance_rate metric. - F4: docstring documents the role split between avoid_phrases (trigger-coupled) and forbidden_phrases (universal). - F5: replaced "majority" in concept_group so it no longer collides with the forbidden_phrases entry for the same case. - F6: removed avoidance AND-gate from pushback_trigger_hit_rate; avoidance now reported in its own rollup. - F7: proposal name now carries an 8-char content-hash suffix so two proposals starting with the same sentence (or empty replies that collapse to "rule") no longer collide. Skip empty replies. - F9: "sounds" alone no longer classifies as approval; require a two-token pattern via SOUNDS_APPROVAL_RE. - F10: deepcopy in normalize_twin_spec_for_rendering so inner-dict mutation does not leak to callers retaining the original spec. Error handling: - ERR-6: --mock-response-file now reports "path not found" with the failing path instead of generic JSON-decode failure. - ERR-8: warn to stderr when pushback-detector cannot stat or open a session file. - ERR-9: count and report corrupt JSONL lines. - ERR-10: load_schema error includes the schema path. Holdout-P2: - New proposal_ready_for_approval helper detects unfilled scaffold sections so the propose-rules guard is now testable. Tests: 13 new tests covering all of the above; pytest 49/49, mypy clean, ruff clean, eval harness clean. Conv-P3: added 'enhancement' label to PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
danielbentes
left a comment
There was a problem hiding this comment.
Follow-up Review (Cycle 2)
CI failure addressed plus every deferred finding from cycle 1.
CI fix
- mypy
synthesize.py:2185,2187—detail_fieldswas assigned variable-length tuple shapes; annotated astuple[tuple[str, str], ...]with explicitall_detail_fieldssource. Localmypy skills/digital-twin/scripts tests→ No issues.
Cycle-1 deferred — all addressed in c9a... follow-up commit
| ID | Category | Resolution |
|---|---|---|
| SEC-1 | security/authority-boundary | New --strict-substitution CLI flag — legacy v1 specs missing substitution sections emit a degraded twin instead of compatibility-derived authority. Test: test_strict_substitution_flag_refuses_legacy_backfill. |
| SEC-3 | security/input-validation | sanitize_user_name strips control chars, caps at 64, restricts to [A-Za-z0-9 ._-]. Applied at parse time. Updated XSS test to assert the stronger allow-list. Tests: test_sanitize_user_name_strips_newlines_and_caps_length, test_synthesize_sanitizes_user_name_into_rendered_output. |
| SEC-4 | security/test-coverage | 6 new adversarial tests added (newline injection, oversized name, empty/symbol-only, destructive decide_alone filter, corrupt state file, missing mock file). |
| Holdout-P2 | testability | New proposal_ready_for_approval(body) helper in pushback-detector.py parses scaffold headers + detects _Fill in_ placeholders. Tests: test_proposal_ready_for_approval_rejects_unfilled_scaffold, test_proposal_ready_for_approval_accepts_filled_body, test_proposal_ready_for_approval_detects_missing_section. |
| F2 | correctness/missing-validation | Cumulative with ERR-5: needs_compatibility_defaults catches non-dict garbage; partial-population (dict with empty inner arrays) routes through schema validation → degraded twin (existing test_synthesize_invalid_twin_spec_degrades). |
| F3 / ERR-11 | correctness | category_scores now averages per-case ratios instead of summing across mixed denominators. Added pushback_trigger_avoidance_rate separate from pushback_trigger_hit_rate. Test: test_eval_category_score_normalizes_per_case_ratio. |
| F4 | naming/docs | Added docstring in score_response explaining avoid_phrases (trigger-coupled) vs forbidden_phrases (universal). |
| F5 | fixture | Removed "majority" from agent-disagreement-consolidation concept_group so it no longer collides with the forbidden entry. |
| F6 | correctness | pushback_trigger_hit_rate no longer AND-gates avoidance with recovery — recovery tracked alone, avoidance reported in pushback_trigger_avoidance_rate. Test: test_eval_pushback_trigger_avoidance_separated_from_recovery. |
| F7 | correctness | Proposal name: frontmatter now carries an 8-char content-hash suffix so identical leading sentences (or empty replies that collapse to "rule") no longer collide. Empty replies are now skipped before reaching proposal_body. Test: test_proposal_body_includes_content_hash_suffix. |
| F9 | correctness | Removed "sounds" from APPROVAL_WORDS; added SOUNDS_APPROVAL_RE requiring sounds (good|right|fine|great|reasonable|like a plan). Test: test_pushback_detector_sounds_bad_is_not_approval. |
| F10 | maintainability | normalize_twin_spec_for_rendering now copy.deepcopy() the spec — callers retaining a reference no longer see backfilled keys mutated in. |
| ERR-6 | error-handling | --mock-response-file checks Path.exists() and prints ERROR: --mock-response-file path not found: {path} before any JSON parsing. Test: test_extract_twin_spec_missing_mock_file_fails_clearly. |
| ERR-8 | error-handling | WARN: could not stat / open {fpath}: {exc} for session-file scan failures in pushback-detector. |
| ERR-9 | error-handling | corrupt_lines counter; WARN: skipped N corrupt JSONL line(s) printed when > 0. |
| ERR-10 | error-handling | load_schema error includes the schema path: {schema_path}: {error}. |
| Conv-P3 | convention | Added enhancement label via gh pr edit 3 --add-label enhancement. (Branch name is an existing slash-style topic branch — renaming would disrupt the open PR; deferred to convention guidance for new branches.) |
| ERR-12 | error-handling/pre-existing | Pre-existing single-line YAML description regex — out of scope for this PR (no new behavior dependent on it); flagged for a follow-up issue. |
Quality Gates
| Check | Result |
|---|---|
pytest |
49 passed (was 38; +11 new tests this cycle, +6 fixture-prep) |
mypy skills/digital-twin/scripts tests |
No issues found |
ruff check skills/digital-twin/scripts tests |
No issues found |
python3 -m compileall |
clean |
bash -n shell scripts |
clean |
evaluate-twin.py heldout_cases.json |
n=9, twin_win_rate=1.0, trigger_hit=1.0, trigger_avoid=1.0, all 8 category_scores=1.0 |
Previous Feedback Status
| Cycle | Finding | Priority | Claimed Status | Verified |
|---|---|---|---|---|
| 1 | ERR-1 dict guards | P1 | resolved | ✓ synthesize.py:1727 _safe_dict |
| 1 | ERR-2 state-file isinstance | P1 | resolved | ✓ pushback-detector.py:274-289 |
| 1 | SEC-2 deny-list | P2 | resolved | ✓ _filter_destructive_authority, test test_filter_destructive_authority_blocks_legacy_decide_alone_items |
| 1 | F1 because/evidence | P2 | resolved | ✓ test test_legacy_substitution_because_is_prose_not_citation |
| 1 | ERR-3/4 silent failures | P2 | resolved | ✓ stderr WARNs in extract-twin-spec.py |
| 1 | ERR-5 compat non-dict | P2 | resolved | ✓ needs_compatibility_defaults, test test_normalize_twin_spec_repairs_non_dict_sections |
| 1 | Conv-P2 PR title | P2 | resolved | ✓ title now feat: add substitution contract to digital twin |
| 1 | SEC-1 | P2 | deferred → resolved this cycle | ✓ --strict-substitution flag |
| 1 | Holdout-P2 | P2 | deferred → resolved this cycle | ✓ proposal_ready_for_approval helper + 3 tests |
| 1 | F3 category_scores | P2 | deferred → resolved this cycle | ✓ per-case ratio average |
| 1 | F2 partial population | P2 | deferred → cumulative resolution | ✓ via ERR-5 + validation path |
| 1 | All P3 (SEC-3/4, F4-F10, ERR-6/8/9/10) | P3 | deferred → resolved this cycle | ✓ |
| 1 | Conv-P3 labels | P3 | deferred → resolved this cycle | ✓ enhancement label added |
Recommendation
All P1/P2/P3 findings from cycle 1 resolved. CI mypy failure fixed. Ready for merge once new CI run passes.
Resolution log (self-review)Both
No findings escalated. Quality gates: pytest 49/49, mypy clean, ruff clean, eval harness clean, CI passing. |
Summary
Closes #2
Verification
Review notes