tactical-ddd skill eval: variant task-scoping, skill-activation fix, focus benchmark by szjanikowski · Pull Request #54 · NoesisVision/nasde-toolkit

szjanikowski · 2026-05-22T10:02:33Z

What & why

This branch came out of a real evaluation of the public tactical-ddd skill
(ntcoding/claude-skillz) on two
domain-driven C# tasks. Along the way it surfaced a silent skill-activation bug
and a missing piece of nasde functionality (repo-specific variants). Those fixes
are the point of this PR; the benchmark changes are the supporting evidence.

Toolkit changes (the reusable bits)

feat: variant task-scoping — variant.toml can declare tasks = [...].
A repo-specific variant (e.g. a skill tuned to one repo's conventions) now runs
only against its tasks; --all-variants skips it elsewhere (SKIPPED status) and
the multi-variant confirm prompt shows the true per-variant trial count. Absent
tasks = unscoped (previous default, no breaking change).
fix: default evaluator max_turns 30 → 60 — the LLM-as-judge hit
error_max_turns on DDD-rich workspaces (many small value-object files), failing
silently (exit 1, empty stderr; the error is in the claude -p stdout JSON).
Still overridable via [evaluation] max_turns.

The skill-activation bug (worth reading)

SKILL.md with comments before the YAML frontmatter ( then
---) breaks Claude Code's frontmatter parser: the skill stages onto disk but is
never registered, so it never activates. Every "skill" run was silently a vanilla
run until we caught it (the skill was missing from the session "skills" array).
Fixed in the benchmark's skill snapshot here. A staging-time guard so nasde refuses
to stage such a file is specced separately and will land in its own PR.

Benchmark changes (`examples/ddd-architectural-challenges`)

Focused on DDD tasks: dropped java-order-dispatch (not DDD); restored
csharp-anemic-to-rich-domain as the second protagonist.
Simplified verifiers to a binary reward (compile + tests); design quality is the
LLM-as-judge's job. Sharpened rubrics in Nick Tune's own vocabulary.
Public tactical-ddd skill rewritten TS→C# verbatim, repo-agnostic (no repo
leaks). Two clearly-labelled repo-tuned variants added (anemic / weather) as
the "skill adapted to your repo" arm, each task-scoped via the new feature.

Verification

ruff check ✓ · ruff format --check ✓ · mypy ✓ (changed files) · pytest 221 ✓ (12 new)
CLI smoke ✓; --all-variants scoping verified end-to-end (weather-tuned correctly SKIPPED on the anemic task)

🤖 Generated with Claude Code

…ll eval - remove java-order-dispatch (not DDD) and csharp-anemic (different repo) - simplify verifiers to binary reward: compile + tests pass (+ Open-Meteo URL for weather) - sharpen assessment_criteria.md key checks under skill principles (rich domain language, construction guard, exhaustiveness, domain isolation, OCP) - add analyze_focus.py to aggregate trial results into _focus_report.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

My first cut leaked SOLID/typing bias (compiler-reject, exhaustiveness, OCP file-count) — these are not how Nick frames tactical DDD in his SKILL.md. Rewrote against his actual principles: - Menu test (#3): would a domain expert recognize these names? - Implicit-to-explicit (#6): is the concept named, or smuggled as a primitive? - Value object liberality (#8): is everything definable-by-attributes a value object? - Aggregate invariants (#7): what must always be true? - Anemic-model test (#4): does the domain decide, or just carry data? - Isolate domain (#1): can it be read by experts, tested without mocks? - Generic-vs-domain (#5): would this code exist in a different business? - Repository shape (#9): does the port return full domain objects? Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… DDD baseline Sonnet 4.6 produced mixed-quality DDD in smoke (anti-patterns like ask-don't-tell in skill variant despite skill's principle #4). Trying Opus 4.6 in both variants to see if a stronger base model gives a cleaner signal about skill's marginal value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…efore YAML Critical bug:  and  placed BEFORE the opening '---' meant Claude Code's frontmatter parser saw no frontmatter, so the description was missing, no trigger registered, and the skill was never activated. Trajectory inspection of recent runs confirms this — skill_listing.skills lists update-config / verify / debug / etc., but NOT tactical-ddd. All historical 'tactical-ddd' results were effectively vanilla measurements with a dead .md file lying in .claude/skills/. Also strengthen CLAUDE.md: explicitly point the agent at the skill and tell it to translate the skill's TypeScript examples into idiomatic C# (the codebase is .NET 8 with [DddValueObject] annotations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SKILL.md: all 9 TypeScript code examples rewritten in C# using DDD-starter-dotnet idioms (readonly struct + IEquatable<T> + static Of() factory + private ctor + [DddValueObject] / [DddDomainService] annotations). The discriminated-union example now uses the same readonly-struct + enum Kind pattern as Discount.cs in the existing codebase. Checklist item #4 strengthened: "No IsApplicable() + Apply() split — give the domain object one decision method that both checks and acts." This is what the previous skill smoke got wrong on weather. Rubric (both tasks) realigned: replaced IsActiveFor/AppliesOn (which are still ask-don't-tell, contradicting principle #4) with ApplyOn(...) as the positive pattern. Added explicit Anemic-model test to weather: PrecipitationDiscount must own a single ApplyOn(Money, WeatherConditions) → Money method, not IsApplicable + DiscountPercentage exposed to the caller. Models rolled back from Opus 4.6 to Sonnet 4.6 — easier comparison with historical data + 5x cheaper for proper run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skill-variant generates DDD-rich workspaces (multiple small value objects: Precipitation, Location, WeatherConditions, WeatherBasedDiscount, ...). LLM-as-judge needs more Read/Glob turns to assess such a layout. With the default max_turns=30, evaluator hit error_max_turns 3/3 times on weather trial while vanilla (simpler layout) just barely fit. Root cause confirmed via manual claude -p run: subtype=error_max_turns, num_turns=6 with --max-turns 5, exit=1, empty stderr (error info is in stdout JSON only — separate evaluator-backend bug worth filing later). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

With Docker daemon at 31 GB and 2 parallel variants planned, 4 GB per container was tight under skill-variant load (3.4M input tokens on weather, many small VO files, .NET 8 build + test). 6 GB gives 2 GB headroom and still fits 2 parallel runs comfortably (12 GB total / 31 GB available). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CRITICAL: previous "C# rewrite" contaminated a PUBLIC skill (ntcoding's tactical-ddd) with details of the benchmarked repo. This invalidates a benchmark of that skill — the skill must be repo-agnostic. Leaks removed from SKILL.md: - References to repo files: "see Sources/Sales/Sales.Commons/Money.cs and PercentageDiscount.cs", "the pattern Discount in this codebase" - Repo annotations [DddValueObject] / [DddDomainService] / [DddAggregateRoot] (from NoesisVision.Annotations.Domain.DDD — these were never in upstream) - "this codebase" phrasing throughout - Precipitation added to the value-object candidate list (a concept straight from the weather task) The rewrite is now a VERBATIM TypeScript→C# translation of upstream: same prose, same examples (Delivery/Claim/Driver/Money), same VO candidate list. Verified: prose diff vs upstream is casing-only; VO list identical; 0 TS blocks, 9 C# blocks; zero repo identifiers. CLAUDE.md restored to the original 3-line "Approach" (matching main). Removed my injected "Available skills" section, which (a) leaked [Ddd*] annotations and (b) biased the experiment by telling the agent to use the skill — the skill self-activates via its frontmatter triggers (confirmed in session skill_listing). Rubric (assessment_criteria.md) intentionally NOT touched: it is a per-task judge instruction for THIS repo, not a public artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the second protagonist task (csharp-anemic-to-rich-domain, an anemic→rich refactor on a legacy ASP.NET Core repo) and a repo-tuned skill variant (claude-ntcoding-tactical-ddd-anemic-tuned) whose SKILL.md adds a "this codebase's conventions" section + a value-object example in the repo's own idiom (class + private parameterless ctor for EF + BusinessException invariants). This is the deliberate "skill adapted to your repo" arm — clearly marked in the SKILL.md header as NOT the pristine public skill. The public-skill variant remains verbatim (TS→C# translation only, no repo leaks). memory_mb bumped to 6144 for the anemic task to match the others. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "skill tuned to repo = 90" weather result came from a transient contaminated state of the public skill variant (since cleaned). That artifact no longer existed as a file — only the public (verbatim) variant remained on disk, so the blog's strongest Task-1 bar had nothing to link to. Recovered the exact SKILL.md that produced 90 from the run's sandbox artifacts (jobs/.../focus-skill-sonnet-fixed/ddd-weather-discount__z9Dgfep) and saved it as claude-ntcoding-tactical-ddd-weather-tuned. Header now clearly marks it as the repo-tuned arm (references Money/PercentageDiscount/Discount), not the public skill. Now both protagonist tasks have a symmetric public + repo-tuned variant pair on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A variant can now declare which tasks it applies to: tasks = ["csharp-anemic-to-rich-domain"] This supports repo-specific variants — e.g. a skill whose examples are tuned to one repo's conventions — so --all-variants never runs them against the wrong codebase. Scoped variants are skipped (SKIPPED status) for non-matching tasks; the scope wins even over an explicit --tasks filter. Absent/empty = unscoped (runs against every task, the previous default behavior). - runner.py: load_variant_task_scope() + scope_tasks_for_variant() - cli.py: _run_all_variants honors scope; multi-variant confirm prompt now shows per-variant task lists and the true trial count (no more over-count) - scaffold: commented `tasks` example in variant.toml template - docs: README (new section), CLAUDE.md (variant.toml ref), benchmark-creator skill - tests: 7 new cases for scope parsing + intersection - examples: the two repo-tuned ddd skill variants now declare their task scope Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The LLM-as-judge hit error_max_turns on DDD-rich workspaces (many small value object files → many Read/Glob turns before it can emit the final verdict). Observed repeatedly on the weather task: 30 turns was not enough, the assessment silently failed (exit 1, empty stderr — the error is in the claude -p stdout JSON as subtype=error_max_turns). 60 gives comfortable headroom; still overridable via [evaluation] max_turns in nasde.toml. Updates the dataclass default, the load_project_config fallback, the default assertion test, and the README/CLAUDE.md config references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rfile) The anemic-refactor task built on net6.0 while its instruction.md still claimed "ASP.NET Core 2.2", and neither the rubric nor the verifier required modern .NET idioms — so the agent reproduced the source repo's dated netcoreapp-2.x style (no records / init / required / file-scoped namespaces). That likely depressed its design scores vs the DDD-starter tasks, which start from more modern code. Changes (all in tasks/csharp-anemic-to-rich-domain/): - Dockerfile: SDK 6.0 → 8.0, upgrade target framework to net8.0 - instruction.md: states .NET 8, makes "modern C# idioms" an explicit requirement and a success criterion ("reads like modern .NET 8, not a port of 2.x") - assessment_criteria.md: adds a project-wide "Modern .NET expectation" note + a Domain-Modeling key check rewarding record/readonly record struct value objects over hand-rolled Equals/GetHashCode; dated style caps the dimension - tests/test.sh: simplified to binary reward (compile + targets net8.0); design quality and idiom modernity go to the LLM-judge, consistent with the other tasks Note: existing anemic results (55/56/62/65) are now stale and need re-running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… idiom Second anemic→rich case from a recognizable international author's "Online Theater" sample (movie rental, NHibernate), as a stronger-pedigree alternative to the existing anemic task. Author name kept only in the git clone URL, not in task names or prose. - tasks/csharp-movie-rental-anemic/: instruction (trimmed to match the other tasks' guidance level — no step-by-step business rules), rubric, Dockerfile (sdk:8.0, clones dotnet6 branch, seds net6→net8, build-only — NHibernate hits SQL Server only at runtime), verifier (compile + targets net8.0). Sanity docker build: 0 errors. - variants/claude-ntcoding-tactical-ddd-movie-tuned/: repo-tuned skill scoped to the task; conventions in modern C# (record value objects, file-scoped ns) with an explicit "do NOT add CSharpFunctionalExtensions / Result<T>" guard so the agent doesn't chase a library detour that breaks the build. - variants/claude-ntcoding-tactical-ddd-anemic-tuned/SKILL.md: rewritten from the old class-VO + bracketed-namespace idiom (which fought the modernized rubric and pulled the tuned arm DOWN) to record / readonly record struct / file-scoped — consistent with the .NET 8 modern-idiom rubric. Target stays .NET 8 (LTS): the LLM-judge reliably knows .NET 8 idioms; .NET 10 / C# 14 is past its training and would make "is this modern?" scoring a guess. Keeps the already-run .NET 8 vanilla/guided/public anemic trials valid; only the tuned arm (which used the old skill) needs re-running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nstructions Skill auto-activation on Sonnet was probabilistic and silently failed (trajectory forensics showed Skill tool_use=0 — the skill body never entered context). Added an explicit step-1 invocation to each skill variant's instruction file so the skill is actually read and applied. vanilla + guided are left untouched as the comparison baseline. Covers 6 variants: claude public + 3 tuned (CLAUDE.md), codex (AGENTS.md), gemini (GEMINI.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Szymon Janikowski and others added 12 commits May 21, 2026 16:30

szjanikowski force-pushed the experiment/tactical-ddd-focus branch from bdc558d to 6b59c24 Compare May 22, 2026 10:08

Szymon Janikowski and others added 3 commits May 22, 2026 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tactical-ddd skill eval: variant task-scoping, skill-activation fix, focus benchmark#54

tactical-ddd skill eval: variant task-scoping, skill-activation fix, focus benchmark#54
szjanikowski wants to merge 15 commits into
mainfrom
experiment/tactical-ddd-focus

szjanikowski commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

szjanikowski commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Toolkit changes (the reusable bits)

The skill-activation bug (worth reading)

Benchmark changes (examples/ddd-architectural-challenges)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

szjanikowski commented May 22, 2026 •

edited

Loading

Benchmark changes (`examples/ddd-architectural-challenges`)