NoesisVision · szjanikowski · May 26, 2026 · May 26, 2026 · May 26, 2026
@@ -288,6 +288,21 @@ model = "google/gemini-3-flash-preview"   # Required format: google/<model-name>
 - `google/gemini-3-flash-preview` — best quality/speed ratio, daily coding tasks
 - `google/gemini-3.1-flash-lite-preview` — fastest, simple and repetitive tasks
 
+### Scoping a variant to specific tasks (optional)
+
+If a variant only makes sense for certain tasks — e.g. a skill whose examples are
+tuned to one repo's conventions — declare a `tasks` list. It restricts the variant
+to those tasks so `--all-variants` never runs it against the wrong codebase:
+
+```toml
+agent = "claude"
+model = "claude-sonnet-4-6"
+tasks = ["my-benchmark/task-a"]   # only runs against these tasks
+```
+
+Omit `tasks` for a general-purpose variant (the default — runs against all tasks).
+The scope wins even over an explicit `--tasks` filter.
+
 ### Claude Code variant
 
 ```

@@ -58,6 +58,7 @@ After any change to the evaluation pipeline, CLI flags, configuration schema, ag
 **Think DX-first:** For every new option or feature, ask "where will the user be when they need this?" and put the documentation there. A feature that exists only in CLAUDE.md is invisible to most users. Check every touchpoint:
 
 **Documentation:**
+- `CHANGELOG.md` — **add an entry under `## [Unreleased]`** (Added / Changed / Fixed) for any user-visible change: new CLI flag, config field, behavior change, dependency bump. Add the `[#NN]` PR link-reference at the bottom. Easy to forget — do it as part of the change, not at release time.
 - `README.md` — user-facing documentation (CLI options table, nasde.toml config reference, explanatory text). This is where most users look first.
 - `CLAUDE.md` — agent instructions (CLI reference, nasde.toml example, architecture decisions)
 - `ARCHITECTURE.md` — system architecture with mermaid diagrams (end-to-end flow, trial lifecycle, cloud sandbox providers, assessment evaluation)

@@ -1,4 +1,5 @@
 .DS_Store
+.idea/
 __pycache__/
 *.pyc
 .venv/

@@ -9,6 +9,24 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure.
 
 ## [Unreleased]
 
+### Added
+- **Variant task-scoping: `tasks` array in `variant.toml`.** A variant may declare
+  `tasks = [...]` to restrict it to specific tasks — use it for a repo-specific
+  variant (e.g. a skill tuned to one repo's conventions) so it never runs against
+  the wrong codebase. With `--all-variants` a scoped variant runs only against its
+  declared tasks (others SKIPPED); with a single `--variant`, requesting a task
+  outside its scope aborts with a clear error. The scope wins over an explicit
+  `--tasks` filter. Absent/empty → unscoped (the default). ([#54])
+
+### Changed
+- **Evaluator default `max_turns` raised 30 → 60.** Avoids `error_max_turns` on
+  large/complex workspaces (e.g. DDD refactors). Override via `[evaluation]
+  max_turns` in `nasde.toml`. ([#54])
+
+### Fixed
+- **Bump `starlette` 1.0.0 → 1.1.0** (PYSEC-2026-161; transitive via
+  harbor/fastapi/mcp). ([#54])
+
 ## [0.4.0] — 2026-05-21
 
 ### Added
@@ -390,4 +408,5 @@ Initial release under the **nasde-toolkit** name (rebrand from
 [#50]: https://github.com/NoesisVision/nasde-toolkit/pull/50
 [#51]: https://github.com/NoesisVision/nasde-toolkit/pull/51
 [#52]: https://github.com/NoesisVision/nasde-toolkit/pull/52
+[#54]: https://github.com/NoesisVision/nasde-toolkit/pull/54
 [gh-litellm-2026-04]: https://github.com/BerriAI/litellm/security/advisories/GHSA-xqmj-j6mv-4862
@@ -166,7 +166,7 @@ build_commands = []
 backend = "claude"                            # "claude" (default) | "codex"
 model = "claude-opus-4-7"
 dimensions_file = "assessment_dimensions.json"
-# max_turns = 30                              # Max evaluator conversation turns
+# max_turns = 60                              # Max evaluator conversation turns (default 60)
 # allowed_tools = ["Read", "Glob", "Grep"]    # Override default tool whitelist
 # mcp_config = "./evaluator_mcp.json"         # MCP server config for evaluator
 # skills_dir = "./evaluator_skills"           # Skills directory for evaluator
@@ -255,13 +255,25 @@ Declares the agent type and the model the variant runs against. Every variant MU
 agent = "claude"                    # "claude" | "codex" | "gemini"
 model = "claude-sonnet-4-6"         # REQUIRED. Model appropriate for the agent family.
 
+tasks = ["my-benchmark/task-a"]     # Optional: task-scope. Restrict this variant to specific tasks.
+
 [[skill]]                           # Optional (ADR-009): skill-by-reference, Claude only.
 path = "../../../src/plugins/my-plugin/skills/my-skill"   # source skill dir (required)
 ref  = "abc1234"                    # optional git ref, same semantics as [nasde.source]
 ```
 
 **Model priority**: `--model` CLI flag > `variant.toml [model]`. Missing model in both places → SystemExit with a clear error.
 
+**`tasks` (variant task-scope)**: optional list of task names this variant is
+meant to run against. Use it for a *repo-specific* variant — e.g. a skill whose
+examples reference one repo's conventions — so it never runs against the wrong
+codebase. The scope is enforced in **both** run modes: with `--all-variants` a
+scoped variant runs only against its declared tasks (others are skipped with a
+SKIPPED status); with a single `--variant`, if none of the requested tasks fall in
+the variant's scope the run aborts with a clear error. Either way the scope wins
+over an explicit `--tasks` filter. Absent or empty → unscoped (runs against every
+task, the default).
+
 **`[[skill]]` (skill-by-reference)**: each entry stages the **whole** skill dir
 (incl. `references/`) into `/app/.claude/skills/<name>/` from a source path —
 no copy under `variants/<v>/skills/`. Optional `ref` reads from a temp git

@@ -178,6 +178,12 @@ The criteria spell out what each score means for each dimension. Here is the ful
 
 **The insight:** the same "DDD guidance" skill helps Claude a little (+3.5) and *badly* hurts Codex (-22). The per-dimension breakdown pinpoints *where* Codex regresses — domain modeling, encapsulation, extensibility — which would be invisible without this assessment. Skill optimization is agent-specific.
 
+### Deep dive — does a public skill, and tuning it, actually help?
+
+A separate study took a public DDD skill (the `tactical-ddd` skill from `ntcoding/claude-skillz`) and its repo-tuned version across four configurations on two deliberately different tasks — a feature on a clean DDD codebase and a legacy anemic→rich refactor. The headline: **the effect is task-dependent** — tuning the skill to the repo adds **+0.13** quality on the clean-feature task but only **+0.06** on the legacy refactor (increment over the bare model; absolute scores across tasks aren't comparable). Two lessons that generalize: judge *per dimension*, not one aggregate; and a skill present on disk is not a skill used — verify it activated.
+
+→ Full tables, per-dimension radars, and token/time charts in **[Benchmark Results](docs/benchmark-results.md#deep-dive--tactical-ddd-skill-public-vs-repo-tuned-claude-code)**.
+
 ### More benchmarks in the repo
 
 - **Refactoring katas (Java + Python)** — four classic refactorings scored on behavior preservation, clarity, technique, scope discipline. *Takeaway:* a candidate "refactoring skill" didn't move the score — shipping it would have been based on vibes.
@@ -402,7 +408,7 @@ append_system_prompt = "Pay special attention to SOLID principles when scoring."
 | `backend` | `claude` | Subprocess backend: `claude` or `codex` |
 | `model` | `claude-opus-4-7` | Evaluator model |
 | `dimensions_file` | `assessment_dimensions.json` | Scoring dimensions file |
-| `max_turns` | `30` | Max conversation turns |
+| `max_turns` | `60` | Max evaluator conversation turns (raise for DDD-rich workspaces with many small files) |
 | `allowed_tools` | `["Read", "Glob", "Grep"]` | Tool whitelist |
 | `mcp_config` | — | Path to MCP server config JSON |
 | `skills_dir` | — | Path to evaluator skills directory |
@@ -473,6 +479,28 @@ sandbox — no copy under `variants/`. The legacy
 `variants/<v>/skills/<name>/` copy path still works unchanged (and now also
 carries `references/`, which it previously dropped).
 
+## Scoping a variant to specific tasks (`tasks`)
+
+Some variants only make sense for one task — for example, a skill whose code
+examples are *tuned to a particular repo's conventions*. Running such a variant
+against a different codebase produces misleading results. Declare a `tasks`
+scope in the variant's `variant.toml`:
+
+```toml
+agent = "claude"
+model = "claude-sonnet-4-6"
+
+# This variant's skill references this repo's value objects, so it should only
+# run against that task.
+tasks = ["csharp-anemic-to-rich-domain"]
+```
+
+The scope is enforced either way you run: with `--all-variants` a scoped variant
+runs **only** against its declared tasks (others show as `SKIPPED`); with a single
+`--variant`, asking for a task outside its scope aborts with a clear error rather
+than running against the wrong repo. Omit `tasks` (the default) for a
+general-purpose variant that runs everywhere.
+
 ## Commands
 
 ### Core
@@ -581,7 +609,7 @@ build_commands = []
 backend = "claude"                            # "claude" (default) | "codex"
 model = "claude-opus-4-7"
 dimensions_file = "assessment_dimensions.json"
-# max_turns = 30                              # Max evaluator conversation turns
+# max_turns = 60                              # Max evaluator conversation turns (default 60)
 # allowed_tools = ["Read", "Glob", "Grep"]    # Override default tool whitelist
 # mcp_config = "./evaluator_mcp.json"         # MCP server config for evaluator
 # skills_dir = "./evaluator_skills"           # Skills directory for evaluator

@@ -30,6 +30,45 @@ Results from the three example benchmarks included in `examples/`. All scores ar
 
 **Takeaway:** Architectural guidance helps Claude (+3.5) but dramatically hurts Codex (-22.0). The same skill applied to different agents can have opposite effects — this is exactly the kind of insight NASDE is designed to surface.
 
+### Deep dive — tactical-ddd skill: public vs repo-tuned (Claude Code)
+
+A focused follow-up on the same benchmark family: we took a public DDD skill ([`tactical-ddd` from `ntcoding/claude-skillz`](https://github.com/NTCoding/claude-skillz)) and a repo-tuned version, and measured four configurations of Claude Code on two deliberately different tasks — a **feature on a clean DDD codebase** (`ddd-weather-discount`) and a **legacy anemic→rich refactor** (`csharp-movie-rental-anemic`). Each configuration was run repeatedly and each run scored repeatedly; the numbers below are medians (normalized 0–1). Skill activation was verified per run — a mounted skill the agent never invokes scores like no skill at all.
+
+| Configuration | Weather (feature) | Movie (legacy) |
+|---|:---:|:---:|
+| vanilla (no skill) | 0.79 | 0.56 |
+| guided (manual DDD hints, no skill) | 0.84 | 0.58 |
+| public skill | 0.85 | 0.60 |
+| repo-tuned skill | **0.92** | **0.62** |
+
+**The effect is task-dependent.** Absolute scores across tasks aren't comparable — task difficulty sets the baseline (movie starts lower because it's harder). What *is* comparable is the **increment over vanilla** on each task: how much each step lifts quality above the bare model.
+
+<p align="center">
+  <img src="../examples/ddd-architectural-challenges/assets/increment_vs_vanilla.png" width="520" alt="Quality gain over vanilla">
+</p>
+
+The same repo-tuned skill adds **+0.13** on the clean-feature task but only **+0.06** on the legacy refactor — twice the payoff where the design space is open.
+
+Per-dimension radars show *where* the gains land (test quality stays flat everywhere — the skill teaches modeling, not testing):
+
+<p align="center">
+  <img src="../examples/ddd-architectural-challenges/assets/radar_weather.png" width="380" alt="Weather radar">
+  <img src="../examples/ddd-architectural-challenges/assets/radar_movie.png" width="380" alt="Movie radar">
+</p>
+
+What does the gain cost? Token usage and run time per configuration — and the answer is **not** the simple "better costs more" you might expect:
+
+<p align="center">
+  <img src="../examples/ddd-architectural-challenges/assets/ops_tokens_weather.png" width="240" alt="Weather tokens">
+  <img src="../examples/ddd-architectural-challenges/assets/ops_tokens_movie.png" width="240" alt="Movie tokens">
+  <img src="../examples/ddd-architectural-challenges/assets/ops_time_weather.png" width="240" alt="Weather time">
+  <img src="../examples/ddd-architectural-challenges/assets/ops_time_movie.png" width="240" alt="Movie time">
+</p>
+
+Cost doesn't track quality. On weather the top-scoring repo-tuned skill spends *fewer* tokens than guided or public; on movie the public skill is the cheapest of all four. The real overhead is run time on the messy refactor, where the skill arms take roughly twice as long as the bare model.
+
+**Two lessons that generalize:** (1) judge one aggregate number and you miss the story — a real per-dimension gain hides inside an averaged score, which is why we show radars, not a single bar; (2) a skill present on disk is not a skill used — always verify it activated before trusting the result.
+
 ## UC1: Project-Specific Setup — NASDE Dev Skill
 
 1 task: Add multi-attempt support to the nasde-toolkit itself. Claude only (project-specific skill, cross-agent comparison not applicable).

@@ -1 +1,2 @@
 jobs/
+_focus_report.md
@@ -13,6 +13,7 @@ build_commands = []
 model = "claude-opus-4-7"
 dimensions_file = "assessment_dimensions.json"
 skills_dir = "./evaluator_skills"
+max_turns = 60  # = current default; explicit here as a reminder DDD-rich workspaces need headroom
 
 [reporting]
 platform = "opik"

@@ -2,6 +2,15 @@
 
 Evaluate the agent's solution across five dimensions.
 
+**Modern .NET expectation (applies throughout):** the project targets .NET 8. A
+strong solution reads like a modern .NET 8 codebase, not a port of a .NET Core 2.x
+app. Reward idiomatic modern C# where it improves the domain model — `record` /
+`readonly record struct` for value objects (free value equality), `required` /
+`init`-only properties, file-scoped namespaces, nullable reference types, pattern
+matching. Penalize value objects hand-rolled with manual `Equals`/`GetHashCode`
+boilerplate and dated `netcoreapp`-era style when a `record` would express the
+concept more clearly.
+
 ## 1. Domain Modeling (0–25)
 
 | Score | Criteria |
@@ -18,6 +27,7 @@ Evaluate the agent's solution across five dimensions.
 - Does `Company` protect its invariants (contact employee, source)?
 - Is the employee-company relationship modeled with proper invariant enforcement?
 - Are constructors/factory methods used instead of public setters?
+- **Modern idioms**: are value objects expressed as `record` / `readonly record struct` (value equality for free) rather than classes with hand-written `Equals`/`GetHashCode`? Are file-scoped namespaces, `init`/`required`, and nullable reference types used? Code that achieves rich DDD but in dated .NET Core 2.x style caps this dimension at 20 (not exemplary).
 
 ## 2. Encapsulation (0–20)
 

@@ -1,4 +1,4 @@
-FROM mcr.microsoft.com/dotnet/sdk:6.0
+FROM mcr.microsoft.com/dotnet/sdk:8.0
 
 RUN apt-get update && apt-get install -y \
     git \
@@ -10,14 +10,14 @@ RUN apt-get update && apt-get install -y \
 WORKDIR /app
 RUN git clone https://github.com/kgrzybek/refactoring-from-anemic-to-rich-domain-model-example.git . && git checkout anemic
 
-# Upgrade project from netcoreapp2.2 to net6.0
-RUN sed -i 's/netcoreapp2.2/net6.0/' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
+# Upgrade project from netcoreapp2.2 to net8.0
+RUN sed -i 's/netcoreapp2.2/net8.0/' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
     sed -i 's/<PackageReference Include="Microsoft.AspNetCore.App" \/>//' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
     sed -i 's/<PackageReference Include="Microsoft.AspNetCore.Razor.Design" Version="2.2.0" PrivateAssets="All" \/>//' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
-    sed -i 's/Version="2.2.6"/Version="6.0.0"/g' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
+    sed -i 's/Version="2.2.6"/Version="8.0.0"/g' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
     sed -i 's/Version="4.0.1"/Version="6.5.0"/g' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj
 
-# Fix Swagger for .NET 6 (DescribeAllEnumsAsStrings removed, Info -> OpenApiInfo)
+# Fix Swagger for modern .NET (DescribeAllEnumsAsStrings removed, Info -> OpenApiInfo)
 RUN sed -i '/DescribeAllEnumsAsStrings/d' DotNetConfPl.Refactoring/Startup.cs && \
     sed -i 's/Swashbuckle.AspNetCore.Swagger.Info/Microsoft.OpenApi.Models.OpenApiInfo/g' DotNetConfPl.Refactoring/Startup.cs && \
     sed -i '1s/^/using Microsoft.OpenApi.Models;\n/' DotNetConfPl.Refactoring/Startup.cs

@@ -1,7 +1,7 @@
 # Task: Refactor Anemic Domain Model to Rich Domain Model
 
 ## Context
-You are working in `/app`, a C# ASP.NET Core 2.2 application that manages companies, persons, and employees. The codebase follows a classic anemic domain model anti-pattern: domain classes (`Person`, `Company`, `Employee`) are plain data containers with only public getters and setters, while all business logic lives in application service classes (`PersonService`, `CompanyService`, `EmployeeService`).
+You are working in `/app`, a C# ASP.NET Core application (targeting **.NET 8**) that manages companies, persons, and employees. The codebase originated on an older .NET version and still carries its dated style: domain classes (`Person`, `Company`, `Employee`) are plain data containers with only public getters and setters, while all business logic lives in application service classes (`PersonService`, `CompanyService`, `EmployeeService`) — a classic anemic domain model anti-pattern. Bring the domain model up to modern standards as you enrich it.
 
 The project structure:
 ```
@@ -34,10 +34,11 @@ Specifically:
 - Domain objects enforce their own invariants — constructors and methods should reject invalid state
 - Services contain only orchestration: load, call domain method, save
 - Value objects are immutable with equality based on attributes
-- Follow existing C# conventions in the project
+- **Use modern C# / .NET idioms** appropriate to .NET 8 — e.g. `record` / `readonly record struct` for value objects, `required` / `init`-only properties, file-scoped namespaces, nullable reference types, and pattern matching where they make the code clearer. Do not preserve the legacy `netcoreapp`-era style.
 
 ## Success Criteria
-1. The project compiles successfully
+1. The project compiles successfully on .NET 8
 2. Domain entities contain behavior methods (not just getters/setters)
 3. Service classes are thin orchestrators — no business logic in if/else blocks
 4. At least one value object has been introduced
+5. The code reads like a modern .NET 8 codebase, not a port of a .NET Core 2.x app
@@ -11,7 +11,7 @@ framework = ".NET"
 timeout_sec = 1800
 
 [environment]
-memory_mb = 4096
+memory_mb = 6144
 
 [verifier]
 timeout_sec = 300