Skip to content

tactical-ddd skill eval: variant task-scoping, skill-activation fix, focus benchmark#54

Open
szjanikowski wants to merge 15 commits into
mainfrom
experiment/tactical-ddd-focus
Open

tactical-ddd skill eval: variant task-scoping, skill-activation fix, focus benchmark#54
szjanikowski wants to merge 15 commits into
mainfrom
experiment/tactical-ddd-focus

Conversation

@szjanikowski
Copy link
Copy Markdown
Contributor

@szjanikowski szjanikowski commented May 22, 2026

What & why

This branch came out of a real evaluation of the public tactical-ddd skill
(ntcoding/claude-skillz) on two
domain-driven C# tasks. Along the way it surfaced a silent skill-activation bug
and a missing piece of nasde functionality (repo-specific variants). Those fixes
are the point of this PR; the benchmark changes are the supporting evidence.

Toolkit changes (the reusable bits)

  • feat: variant task-scopingvariant.toml can declare tasks = [...].
    A repo-specific variant (e.g. a skill tuned to one repo's conventions) now runs
    only against its tasks; --all-variants skips it elsewhere (SKIPPED status) and
    the multi-variant confirm prompt shows the true per-variant trial count. Absent
    tasks = unscoped (previous default, no breaking change).
  • fix: default evaluator max_turns 30 → 60 — the LLM-as-judge hit
    error_max_turns on DDD-rich workspaces (many small value-object files), failing
    silently (exit 1, empty stderr; the error is in the claude -p stdout JSON).
    Still overridable via [evaluation] max_turns.

The skill-activation bug (worth reading)

SKILL.md with comments before the YAML frontmatter (<!-- Source: ... --> then
---) breaks Claude Code's frontmatter parser: the skill stages onto disk but is
never registered, so it never activates. Every "skill" run was silently a vanilla
run until we caught it (the skill was missing from the session "skills" array).
Fixed in the benchmark's skill snapshot here. A staging-time guard so nasde refuses
to stage such a file is specced separately and will land in its own PR.

Benchmark changes (examples/ddd-architectural-challenges)

  • Focused on DDD tasks: dropped java-order-dispatch (not DDD); restored
    csharp-anemic-to-rich-domain as the second protagonist.
  • Simplified verifiers to a binary reward (compile + tests); design quality is the
    LLM-as-judge's job. Sharpened rubrics in Nick Tune's own vocabulary.
  • Public tactical-ddd skill rewritten TS→C# verbatim, repo-agnostic (no repo
    leaks). Two clearly-labelled repo-tuned variants added (anemic / weather) as
    the "skill adapted to your repo" arm, each task-scoped via the new feature.

Verification

  • ruff check ✓ · ruff format --check ✓ · mypy ✓ (changed files) · pytest 221 ✓ (12 new)
  • CLI smoke ✓; --all-variants scoping verified end-to-end (weather-tuned correctly SKIPPED on the anemic task)

🤖 Generated with Claude Code

Szymon Janikowski and others added 12 commits May 21, 2026 16:30
…ll eval

- remove java-order-dispatch (not DDD) and csharp-anemic (different repo)
- simplify verifiers to binary reward: compile + tests pass (+ Open-Meteo URL for weather)
- sharpen assessment_criteria.md key checks under skill principles
  (rich domain language, construction guard, exhaustiveness, domain isolation, OCP)
- add analyze_focus.py to aggregate trial results into _focus_report.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My first cut leaked SOLID/typing bias (compiler-reject, exhaustiveness, OCP file-count) — these are not how Nick frames tactical DDD in his SKILL.md. Rewrote against his actual principles:

- Menu test (#3): would a domain expert recognize these names?
- Implicit-to-explicit (#6): is the concept named, or smuggled as a primitive?
- Value object liberality (#8): is everything definable-by-attributes a value object?
- Aggregate invariants (#7): what must always be true?
- Anemic-model test (#4): does the domain decide, or just carry data?
- Isolate domain (#1): can it be read by experts, tested without mocks?
- Generic-vs-domain (#5): would this code exist in a different business?
- Repository shape (#9): does the port return full domain objects?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… DDD baseline

Sonnet 4.6 produced mixed-quality DDD in smoke (anti-patterns like ask-don't-tell
in skill variant despite skill's principle #4). Trying Opus 4.6 in both variants
to see if a stronger base model gives a cleaner signal about skill's marginal value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efore YAML

Critical bug: <!-- Source: ... --> and <!-- Snapshot: ... --> placed BEFORE the
opening '---' meant Claude Code's frontmatter parser saw no frontmatter, so the
description was missing, no trigger registered, and the skill was never activated.

Trajectory inspection of recent runs confirms this — skill_listing.skills lists
update-config / verify / debug / etc., but NOT tactical-ddd. All historical
'tactical-ddd' results were effectively vanilla measurements with a dead .md
file lying in .claude/skills/.

Also strengthen CLAUDE.md: explicitly point the agent at the skill and tell it
to translate the skill's TypeScript examples into idiomatic C# (the codebase is
.NET 8 with [DddValueObject] annotations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SKILL.md: all 9 TypeScript code examples rewritten in C# using DDD-starter-dotnet
idioms (readonly struct + IEquatable<T> + static Of() factory + private ctor +
[DddValueObject] / [DddDomainService] annotations). The discriminated-union
example now uses the same readonly-struct + enum Kind pattern as Discount.cs
in the existing codebase.

Checklist item #4 strengthened: "No IsApplicable() + Apply() split — give the
domain object one decision method that both checks and acts." This is what the
previous skill smoke got wrong on weather.

Rubric (both tasks) realigned: replaced IsActiveFor/AppliesOn (which are still
ask-don't-tell, contradicting principle #4) with ApplyOn(...) as the positive
pattern. Added explicit Anemic-model test to weather: PrecipitationDiscount
must own a single ApplyOn(Money, WeatherConditions) → Money method, not
IsApplicable + DiscountPercentage exposed to the caller.

Models rolled back from Opus 4.6 to Sonnet 4.6 — easier comparison with
historical data + 5x cheaper for proper run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skill-variant generates DDD-rich workspaces (multiple small value objects:
Precipitation, Location, WeatherConditions, WeatherBasedDiscount, ...).
LLM-as-judge needs more Read/Glob turns to assess such a layout. With the
default max_turns=30, evaluator hit error_max_turns 3/3 times on weather
trial while vanilla (simpler layout) just barely fit.

Root cause confirmed via manual claude -p run: subtype=error_max_turns,
num_turns=6 with --max-turns 5, exit=1, empty stderr (error info is in
stdout JSON only — separate evaluator-backend bug worth filing later).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With Docker daemon at 31 GB and 2 parallel variants planned, 4 GB per
container was tight under skill-variant load (3.4M input tokens on weather,
many small VO files, .NET 8 build + test). 6 GB gives 2 GB headroom and
still fits 2 parallel runs comfortably (12 GB total / 31 GB available).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL: previous "C# rewrite" contaminated a PUBLIC skill (ntcoding's
tactical-ddd) with details of the benchmarked repo. This invalidates a
benchmark of that skill — the skill must be repo-agnostic.

Leaks removed from SKILL.md:
- References to repo files: "see Sources/Sales/Sales.Commons/Money.cs and
  PercentageDiscount.cs", "the pattern Discount in this codebase"
- Repo annotations [DddValueObject] / [DddDomainService] / [DddAggregateRoot]
  (from NoesisVision.Annotations.Domain.DDD — these were never in upstream)
- "this codebase" phrasing throughout
- Precipitation added to the value-object candidate list (a concept straight
  from the weather task)

The rewrite is now a VERBATIM TypeScript→C# translation of upstream: same
prose, same examples (Delivery/Claim/Driver/Money), same VO candidate list.
Verified: prose diff vs upstream is casing-only; VO list identical; 0 TS
blocks, 9 C# blocks; zero repo identifiers.

CLAUDE.md restored to the original 3-line "Approach" (matching main). Removed
my injected "Available skills" section, which (a) leaked [Ddd*] annotations and
(b) biased the experiment by telling the agent to use the skill — the skill
self-activates via its frontmatter triggers (confirmed in session skill_listing).

Rubric (assessment_criteria.md) intentionally NOT touched: it is a per-task
judge instruction for THIS repo, not a public artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the second protagonist task (csharp-anemic-to-rich-domain, an anemic→rich
refactor on a legacy ASP.NET Core repo) and a repo-tuned skill variant
(claude-ntcoding-tactical-ddd-anemic-tuned) whose SKILL.md adds a
"this codebase's conventions" section + a value-object example in the repo's own
idiom (class + private parameterless ctor for EF + BusinessException invariants).

This is the deliberate "skill adapted to your repo" arm — clearly marked in the
SKILL.md header as NOT the pristine public skill. The public-skill variant remains
verbatim (TS→C# translation only, no repo leaks).

memory_mb bumped to 6144 for the anemic task to match the others.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "skill tuned to repo = 90" weather result came from a transient contaminated
state of the public skill variant (since cleaned). That artifact no longer existed
as a file — only the public (verbatim) variant remained on disk, so the blog's
strongest Task-1 bar had nothing to link to.

Recovered the exact SKILL.md that produced 90 from the run's sandbox artifacts
(jobs/.../focus-skill-sonnet-fixed/ddd-weather-discount__z9Dgfep) and saved it as
claude-ntcoding-tactical-ddd-weather-tuned. Header now clearly marks it as the
repo-tuned arm (references Money/PercentageDiscount/Discount), not the public skill.

Now both protagonist tasks have a symmetric public + repo-tuned variant pair on disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A variant can now declare which tasks it applies to:

    tasks = ["csharp-anemic-to-rich-domain"]

This supports repo-specific variants — e.g. a skill whose examples are tuned to
one repo's conventions — so --all-variants never runs them against the wrong
codebase. Scoped variants are skipped (SKIPPED status) for non-matching tasks;
the scope wins even over an explicit --tasks filter. Absent/empty = unscoped
(runs against every task, the previous default behavior).

- runner.py: load_variant_task_scope() + scope_tasks_for_variant()
- cli.py: _run_all_variants honors scope; multi-variant confirm prompt now shows
  per-variant task lists and the true trial count (no more over-count)
- scaffold: commented `tasks` example in variant.toml template
- docs: README (new section), CLAUDE.md (variant.toml ref), benchmark-creator skill
- tests: 7 new cases for scope parsing + intersection
- examples: the two repo-tuned ddd skill variants now declare their task scope

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LLM-as-judge hit error_max_turns on DDD-rich workspaces (many small value
object files → many Read/Glob turns before it can emit the final verdict).
Observed repeatedly on the weather task: 30 turns was not enough, the assessment
silently failed (exit 1, empty stderr — the error is in the claude -p stdout JSON
as subtype=error_max_turns). 60 gives comfortable headroom; still overridable via
[evaluation] max_turns in nasde.toml.

Updates the dataclass default, the load_project_config fallback, the default
assertion test, and the README/CLAUDE.md config references.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@szjanikowski szjanikowski force-pushed the experiment/tactical-ddd-focus branch from bdc558d to 6b59c24 Compare May 22, 2026 10:08
Szymon Janikowski and others added 3 commits May 22, 2026 14:33
…rfile)

The anemic-refactor task built on net6.0 while its instruction.md still claimed
"ASP.NET Core 2.2", and neither the rubric nor the verifier required modern .NET
idioms — so the agent reproduced the source repo's dated netcoreapp-2.x style
(no records / init / required / file-scoped namespaces). That likely depressed
its design scores vs the DDD-starter tasks, which start from more modern code.

Changes (all in tasks/csharp-anemic-to-rich-domain/):
- Dockerfile: SDK 6.0 → 8.0, upgrade target framework to net8.0
- instruction.md: states .NET 8, makes "modern C# idioms" an explicit requirement
  and a success criterion ("reads like modern .NET 8, not a port of 2.x")
- assessment_criteria.md: adds a project-wide "Modern .NET expectation" note +
  a Domain-Modeling key check rewarding record/readonly record struct value
  objects over hand-rolled Equals/GetHashCode; dated style caps the dimension
- tests/test.sh: simplified to binary reward (compile + targets net8.0); design
  quality and idiom modernity go to the LLM-judge, consistent with the other tasks

Note: existing anemic results (55/56/62/65) are now stale and need re-running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… idiom

Second anemic→rich case from a recognizable international author's "Online Theater"
sample (movie rental, NHibernate), as a stronger-pedigree alternative to the
existing anemic task. Author name kept only in the git clone URL, not in task
names or prose.

- tasks/csharp-movie-rental-anemic/: instruction (trimmed to match the other tasks'
  guidance level — no step-by-step business rules), rubric, Dockerfile (sdk:8.0,
  clones dotnet6 branch, seds net6→net8, build-only — NHibernate hits SQL Server
  only at runtime), verifier (compile + targets net8.0). Sanity docker build: 0 errors.
- variants/claude-ntcoding-tactical-ddd-movie-tuned/: repo-tuned skill scoped to the
  task; conventions in modern C# (record value objects, file-scoped ns) with an
  explicit "do NOT add CSharpFunctionalExtensions / Result<T>" guard so the agent
  doesn't chase a library detour that breaks the build.
- variants/claude-ntcoding-tactical-ddd-anemic-tuned/SKILL.md: rewritten from the old
  class-VO + bracketed-namespace idiom (which fought the modernized rubric and pulled
  the tuned arm DOWN) to record / readonly record struct / file-scoped — consistent
  with the .NET 8 modern-idiom rubric.

Target stays .NET 8 (LTS): the LLM-judge reliably knows .NET 8 idioms; .NET 10 / C# 14
is past its training and would make "is this modern?" scoring a guess. Keeps the
already-run .NET 8 vanilla/guided/public anemic trials valid; only the tuned arm
(which used the old skill) needs re-running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nstructions

Skill auto-activation on Sonnet was probabilistic and silently failed
(trajectory forensics showed Skill tool_use=0 — the skill body never
entered context). Added an explicit step-1 invocation to each skill
variant's instruction file so the skill is actually read and applied.
vanilla + guided are left untouched as the comparison baseline.

Covers 6 variants: claude public + 3 tuned (CLAUDE.md), codex (AGENTS.md),
gemini (GEMINI.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant