tactical-ddd skill eval: variant task-scoping, skill-activation fix, focus benchmark#54
Open
szjanikowski wants to merge 15 commits into
Open
tactical-ddd skill eval: variant task-scoping, skill-activation fix, focus benchmark#54szjanikowski wants to merge 15 commits into
szjanikowski wants to merge 15 commits into
Conversation
…ll eval - remove java-order-dispatch (not DDD) and csharp-anemic (different repo) - simplify verifiers to binary reward: compile + tests pass (+ Open-Meteo URL for weather) - sharpen assessment_criteria.md key checks under skill principles (rich domain language, construction guard, exhaustiveness, domain isolation, OCP) - add analyze_focus.py to aggregate trial results into _focus_report.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My first cut leaked SOLID/typing bias (compiler-reject, exhaustiveness, OCP file-count) — these are not how Nick frames tactical DDD in his SKILL.md. Rewrote against his actual principles: - Menu test (#3): would a domain expert recognize these names? - Implicit-to-explicit (#6): is the concept named, or smuggled as a primitive? - Value object liberality (#8): is everything definable-by-attributes a value object? - Aggregate invariants (#7): what must always be true? - Anemic-model test (#4): does the domain decide, or just carry data? - Isolate domain (#1): can it be read by experts, tested without mocks? - Generic-vs-domain (#5): would this code exist in a different business? - Repository shape (#9): does the port return full domain objects? Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… DDD baseline Sonnet 4.6 produced mixed-quality DDD in smoke (anti-patterns like ask-don't-tell in skill variant despite skill's principle #4). Trying Opus 4.6 in both variants to see if a stronger base model gives a cleaner signal about skill's marginal value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efore YAML Critical bug: <!-- Source: ... --> and <!-- Snapshot: ... --> placed BEFORE the opening '---' meant Claude Code's frontmatter parser saw no frontmatter, so the description was missing, no trigger registered, and the skill was never activated. Trajectory inspection of recent runs confirms this — skill_listing.skills lists update-config / verify / debug / etc., but NOT tactical-ddd. All historical 'tactical-ddd' results were effectively vanilla measurements with a dead .md file lying in .claude/skills/. Also strengthen CLAUDE.md: explicitly point the agent at the skill and tell it to translate the skill's TypeScript examples into idiomatic C# (the codebase is .NET 8 with [DddValueObject] annotations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SKILL.md: all 9 TypeScript code examples rewritten in C# using DDD-starter-dotnet idioms (readonly struct + IEquatable<T> + static Of() factory + private ctor + [DddValueObject] / [DddDomainService] annotations). The discriminated-union example now uses the same readonly-struct + enum Kind pattern as Discount.cs in the existing codebase. Checklist item #4 strengthened: "No IsApplicable() + Apply() split — give the domain object one decision method that both checks and acts." This is what the previous skill smoke got wrong on weather. Rubric (both tasks) realigned: replaced IsActiveFor/AppliesOn (which are still ask-don't-tell, contradicting principle #4) with ApplyOn(...) as the positive pattern. Added explicit Anemic-model test to weather: PrecipitationDiscount must own a single ApplyOn(Money, WeatherConditions) → Money method, not IsApplicable + DiscountPercentage exposed to the caller. Models rolled back from Opus 4.6 to Sonnet 4.6 — easier comparison with historical data + 5x cheaper for proper run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skill-variant generates DDD-rich workspaces (multiple small value objects: Precipitation, Location, WeatherConditions, WeatherBasedDiscount, ...). LLM-as-judge needs more Read/Glob turns to assess such a layout. With the default max_turns=30, evaluator hit error_max_turns 3/3 times on weather trial while vanilla (simpler layout) just barely fit. Root cause confirmed via manual claude -p run: subtype=error_max_turns, num_turns=6 with --max-turns 5, exit=1, empty stderr (error info is in stdout JSON only — separate evaluator-backend bug worth filing later). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With Docker daemon at 31 GB and 2 parallel variants planned, 4 GB per container was tight under skill-variant load (3.4M input tokens on weather, many small VO files, .NET 8 build + test). 6 GB gives 2 GB headroom and still fits 2 parallel runs comfortably (12 GB total / 31 GB available). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL: previous "C# rewrite" contaminated a PUBLIC skill (ntcoding's tactical-ddd) with details of the benchmarked repo. This invalidates a benchmark of that skill — the skill must be repo-agnostic. Leaks removed from SKILL.md: - References to repo files: "see Sources/Sales/Sales.Commons/Money.cs and PercentageDiscount.cs", "the pattern Discount in this codebase" - Repo annotations [DddValueObject] / [DddDomainService] / [DddAggregateRoot] (from NoesisVision.Annotations.Domain.DDD — these were never in upstream) - "this codebase" phrasing throughout - Precipitation added to the value-object candidate list (a concept straight from the weather task) The rewrite is now a VERBATIM TypeScript→C# translation of upstream: same prose, same examples (Delivery/Claim/Driver/Money), same VO candidate list. Verified: prose diff vs upstream is casing-only; VO list identical; 0 TS blocks, 9 C# blocks; zero repo identifiers. CLAUDE.md restored to the original 3-line "Approach" (matching main). Removed my injected "Available skills" section, which (a) leaked [Ddd*] annotations and (b) biased the experiment by telling the agent to use the skill — the skill self-activates via its frontmatter triggers (confirmed in session skill_listing). Rubric (assessment_criteria.md) intentionally NOT touched: it is a per-task judge instruction for THIS repo, not a public artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the second protagonist task (csharp-anemic-to-rich-domain, an anemic→rich refactor on a legacy ASP.NET Core repo) and a repo-tuned skill variant (claude-ntcoding-tactical-ddd-anemic-tuned) whose SKILL.md adds a "this codebase's conventions" section + a value-object example in the repo's own idiom (class + private parameterless ctor for EF + BusinessException invariants). This is the deliberate "skill adapted to your repo" arm — clearly marked in the SKILL.md header as NOT the pristine public skill. The public-skill variant remains verbatim (TS→C# translation only, no repo leaks). memory_mb bumped to 6144 for the anemic task to match the others. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "skill tuned to repo = 90" weather result came from a transient contaminated state of the public skill variant (since cleaned). That artifact no longer existed as a file — only the public (verbatim) variant remained on disk, so the blog's strongest Task-1 bar had nothing to link to. Recovered the exact SKILL.md that produced 90 from the run's sandbox artifacts (jobs/.../focus-skill-sonnet-fixed/ddd-weather-discount__z9Dgfep) and saved it as claude-ntcoding-tactical-ddd-weather-tuned. Header now clearly marks it as the repo-tuned arm (references Money/PercentageDiscount/Discount), not the public skill. Now both protagonist tasks have a symmetric public + repo-tuned variant pair on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A variant can now declare which tasks it applies to:
tasks = ["csharp-anemic-to-rich-domain"]
This supports repo-specific variants — e.g. a skill whose examples are tuned to
one repo's conventions — so --all-variants never runs them against the wrong
codebase. Scoped variants are skipped (SKIPPED status) for non-matching tasks;
the scope wins even over an explicit --tasks filter. Absent/empty = unscoped
(runs against every task, the previous default behavior).
- runner.py: load_variant_task_scope() + scope_tasks_for_variant()
- cli.py: _run_all_variants honors scope; multi-variant confirm prompt now shows
per-variant task lists and the true trial count (no more over-count)
- scaffold: commented `tasks` example in variant.toml template
- docs: README (new section), CLAUDE.md (variant.toml ref), benchmark-creator skill
- tests: 7 new cases for scope parsing + intersection
- examples: the two repo-tuned ddd skill variants now declare their task scope
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LLM-as-judge hit error_max_turns on DDD-rich workspaces (many small value object files → many Read/Glob turns before it can emit the final verdict). Observed repeatedly on the weather task: 30 turns was not enough, the assessment silently failed (exit 1, empty stderr — the error is in the claude -p stdout JSON as subtype=error_max_turns). 60 gives comfortable headroom; still overridable via [evaluation] max_turns in nasde.toml. Updates the dataclass default, the load_project_config fallback, the default assertion test, and the README/CLAUDE.md config references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdc558d to
6b59c24
Compare
…rfile)
The anemic-refactor task built on net6.0 while its instruction.md still claimed
"ASP.NET Core 2.2", and neither the rubric nor the verifier required modern .NET
idioms — so the agent reproduced the source repo's dated netcoreapp-2.x style
(no records / init / required / file-scoped namespaces). That likely depressed
its design scores vs the DDD-starter tasks, which start from more modern code.
Changes (all in tasks/csharp-anemic-to-rich-domain/):
- Dockerfile: SDK 6.0 → 8.0, upgrade target framework to net8.0
- instruction.md: states .NET 8, makes "modern C# idioms" an explicit requirement
and a success criterion ("reads like modern .NET 8, not a port of 2.x")
- assessment_criteria.md: adds a project-wide "Modern .NET expectation" note +
a Domain-Modeling key check rewarding record/readonly record struct value
objects over hand-rolled Equals/GetHashCode; dated style caps the dimension
- tests/test.sh: simplified to binary reward (compile + targets net8.0); design
quality and idiom modernity go to the LLM-judge, consistent with the other tasks
Note: existing anemic results (55/56/62/65) are now stale and need re-running.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… idiom Second anemic→rich case from a recognizable international author's "Online Theater" sample (movie rental, NHibernate), as a stronger-pedigree alternative to the existing anemic task. Author name kept only in the git clone URL, not in task names or prose. - tasks/csharp-movie-rental-anemic/: instruction (trimmed to match the other tasks' guidance level — no step-by-step business rules), rubric, Dockerfile (sdk:8.0, clones dotnet6 branch, seds net6→net8, build-only — NHibernate hits SQL Server only at runtime), verifier (compile + targets net8.0). Sanity docker build: 0 errors. - variants/claude-ntcoding-tactical-ddd-movie-tuned/: repo-tuned skill scoped to the task; conventions in modern C# (record value objects, file-scoped ns) with an explicit "do NOT add CSharpFunctionalExtensions / Result<T>" guard so the agent doesn't chase a library detour that breaks the build. - variants/claude-ntcoding-tactical-ddd-anemic-tuned/SKILL.md: rewritten from the old class-VO + bracketed-namespace idiom (which fought the modernized rubric and pulled the tuned arm DOWN) to record / readonly record struct / file-scoped — consistent with the .NET 8 modern-idiom rubric. Target stays .NET 8 (LTS): the LLM-judge reliably knows .NET 8 idioms; .NET 10 / C# 14 is past its training and would make "is this modern?" scoring a guess. Keeps the already-run .NET 8 vanilla/guided/public anemic trials valid; only the tuned arm (which used the old skill) needs re-running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nstructions Skill auto-activation on Sonnet was probabilistic and silently failed (trajectory forensics showed Skill tool_use=0 — the skill body never entered context). Added an explicit step-1 invocation to each skill variant's instruction file so the skill is actually read and applied. vanilla + guided are left untouched as the comparison baseline. Covers 6 variants: claude public + 3 tuned (CLAUDE.md), codex (AGENTS.md), gemini (GEMINI.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
This branch came out of a real evaluation of the public
tactical-dddskill(ntcoding/claude-skillz) on two
domain-driven C# tasks. Along the way it surfaced a silent skill-activation bug
and a missing piece of nasde functionality (repo-specific variants). Those fixes
are the point of this PR; the benchmark changes are the supporting evidence.
Toolkit changes (the reusable bits)
feat: variant task-scoping —variant.tomlcan declaretasks = [...].A repo-specific variant (e.g. a skill tuned to one repo's conventions) now runs
only against its tasks;
--all-variantsskips it elsewhere (SKIPPED status) andthe multi-variant confirm prompt shows the true per-variant trial count. Absent
tasks= unscoped (previous default, no breaking change).fix: default evaluatormax_turns30 → 60 — the LLM-as-judge hiterror_max_turnson DDD-rich workspaces (many small value-object files), failingsilently (exit 1, empty stderr; the error is in the
claude -pstdout JSON).Still overridable via
[evaluation] max_turns.The skill-activation bug (worth reading)
SKILL.mdwith comments before the YAML frontmatter (<!-- Source: ... -->then---) breaks Claude Code's frontmatter parser: the skill stages onto disk but isnever registered, so it never activates. Every "skill" run was silently a vanilla
run until we caught it (the skill was missing from the session
"skills"array).Fixed in the benchmark's skill snapshot here. A staging-time guard so nasde refuses
to stage such a file is specced separately and will land in its own PR.
Benchmark changes (
examples/ddd-architectural-challenges)java-order-dispatch(not DDD); restoredcsharp-anemic-to-rich-domainas the second protagonist.LLM-as-judge's job. Sharpened rubrics in Nick Tune's own vocabulary.
tactical-dddskill rewritten TS→C# verbatim, repo-agnostic (no repoleaks). Two clearly-labelled repo-tuned variants added (anemic / weather) as
the "skill adapted to your repo" arm, each task-scoped via the new feature.
Verification
ruff check✓ ·ruff format --check✓ ·mypy✓ (changed files) ·pytest221 ✓ (12 new)--all-variantsscoping verified end-to-end (weather-tuned correctly SKIPPED on the anemic task)🤖 Generated with Claude Code