Enhance digital twin functionality with behavioral spec extraction#1
Conversation
- Updated README.md to reflect the addition of Phase 4.6, which extracts a compact behavioral `twin-spec.json` for the operational contract of the sub-agent. - Revised init.md and update.md to include the new Phase 4.6 in the pipeline, ensuring the behavioral spec is generated and utilized. - Introduced new generated rule files in the output, enhancing the CLAUDE.md patch with user-level rules for preferences, workflows, verification, and recovery. - Improved the extraction scripts to prioritize full user messages over truncated cache rows, ensuring higher quality data for analysis. - Updated methodology documentation to clarify the purpose and output of the new behavioral twin spec extraction phase. This commit enhances the digital twin's ability to accurately reflect user behavior and preferences, improving the overall functionality and user experience.
danielbentes
left a comment
There was a problem hiding this comment.
Review findings:
-
P1 commands/init.md:102-111 runs extract-twin-spec before Phase 5, but extract-twin-spec.py:68-99 consumes plan, convergence, and memory inventory. First-run specs therefore omit the exact deep-source evidence the twin is supposed to encode.
-
P2 extract-corpus.py:204-215 drops every last-prompt row in a session if any full user/human row exists. The stated behavior is to prefer full user messages only when they represent the same turn. Count-only validation on the local corpus showed 12,503 mixed-session last-prompt rows, with 9,161 not matching any full user text, so this can silently discard real behavior evidence.
-
P2 extract-twin-spec.py:174-202 defines only shallow validation while the schema at references/twin-spec-schema.json is much stricter. A malformed nested spec still exits 0 and gets rendered as complete, which can produce incomplete policy sections without triggering degraded mode.
Verification run: py_compile passed, pytest -q passed with 18 tests, and evaluate-twin.py on heldout_cases.json reported twin_win_rate=1.0 and pushback_trigger_hit_rate=1.0.
Summary
Behavioral Twin v1 turns the digital-twin output from a profile/memory dump into a compact, evidence-backed operational agent contract, then hardens the pipeline around untrusted corpus data and local artifact generation.
analysis/twin-spec.jsonextraction phase.~/.claude/agents/twin.mdprimarily from that spec instead of hardcoded defaults or raw memory dumps.0.3.0.Key Changes
Behavioral twin spec
skills/digital-twin/scripts/extract-twin-spec.py.references/twin-spec-schema.jsonandreferences/prompts/twin-spec-extraction.md.skills/digital-twin/scripts/twin_spec_validation.py.Corrected pipeline order
extract-twin-spec.pyas default for replacement-agent output. It is only skipped for profile-only fallback runs where a degradedtwin.mdis acceptable.Compact twin agent generation
synthesize.pysotwin.mdis rendered fromanalysis/twin-spec.json.model: inheritand explicit tool frontmatter.twin-spec.jsonis missing or invalid, synthesis still writes profile artifacts but emits an explicit incomplete-spec warning intwin.md.Generated CLAUDE rules
rules/preferences.mdrules/workflows.mdrules/verification.mdrules/recovery.mdCLAUDE-md-patch.mdis now a short install guide that imports those rules instead of pasting a long defaults block.Corpus signal quality
extract-corpus.pynow drops alast-promptcache row only when it is an exact match or clear truncation-prefix duplicate of a fulluser/humanmessage in the same session.last-promptrows are preserved as behavior evidence instead of being dropped at session scope.source_type,is_auto_wake, andis_human_typed.quantitative.pycomputes human-typed metrics separately and filters false slash commands like/api,/users,/tmp, and/month.Security hardening
interaction_style.narrative_htmlbefore renderingPROFILE.html.PROFILE.html; the report is self-contained with local system fonts.extract-insights.py --allow-sdk-fallback.WebFetchfrom the default generated twin subagent tool list.pr-comment-mining.shargument parsing, Python path handling, UTF-8 file I/O, and GitHub repo/PR validation.Evaluation
scripts/evaluate-twin.pyas a deterministic, no-network eval harness.Release Validation
Local checks:
pytest->28 passedruff check .-> passedmypy skills/digital-twin/scripts tests-> passedpython3 -m compileall -q skills/digital-twin/scripts skills/digital-twin/references/visualization tests-> passedbash -n skills/digital-twin/scripts/pr-comment-mining.sh-> passedgit diff --check-> passedReal-data temp validation, without overwriting the installed agent:
/private/tmp/digital-twin-release-20260514-223157.analysis/twin-spec.json.PROFILE.md,PROFILE.html,twin.md,CLAUDE-md-patch.md, generated rule files, and insight cards._TBD_, noSee PROFILE.md, no raw memory dump, no incomplete-spec warning, no external font URLs, noWebFetch, and no obvious script/event-handler HTML patterns.PROFILE.htmlwith PythonHTMLParsersuccessfully.twin.mdincludes the expected operational sections: decision, delegation, verification, and recovery policies.PROFILE.mdreports the real corpus counts: 9,678 prompts, 1,140 sessions, 39 projects, 144 memory files, 27 plans, and 3,550 convergence pairs.Notes
~/.claude/agents/twin.mduntil the user runs/digital-twin:initor/digital-twin:update.ghis unauthenticated locally; the rest of the corpus/profile/twin pipeline still completes.