Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .claude/skills/nasde-benchmark-creator/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,21 @@ model = "google/gemini-3-flash-preview" # Required format: google/<model-name>
- `google/gemini-3-flash-preview` — best quality/speed ratio, daily coding tasks
- `google/gemini-3.1-flash-lite-preview` — fastest, simple and repetitive tasks

### Scoping a variant to specific tasks (optional)

If a variant only makes sense for certain tasks — e.g. a skill whose examples are
tuned to one repo's conventions — declare a `tasks` list. It restricts the variant
to those tasks so `--all-variants` never runs it against the wrong codebase:

```toml
agent = "claude"
model = "claude-sonnet-4-6"
tasks = ["my-benchmark/task-a"] # only runs against these tasks
```

Omit `tasks` for a general-purpose variant (the default — runs against all tasks).
The scope wins even over an explicit `--tasks` filter.

### Claude Code variant

```
Expand Down
1 change: 1 addition & 0 deletions .claude/skills/nasde-dev/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ After any change to the evaluation pipeline, CLI flags, configuration schema, ag
**Think DX-first:** For every new option or feature, ask "where will the user be when they need this?" and put the documentation there. A feature that exists only in CLAUDE.md is invisible to most users. Check every touchpoint:

**Documentation:**
- `CHANGELOG.md` — **add an entry under `## [Unreleased]`** (Added / Changed / Fixed) for any user-visible change: new CLI flag, config field, behavior change, dependency bump. Add the `[#NN]` PR link-reference at the bottom. Easy to forget — do it as part of the change, not at release time.
- `README.md` — user-facing documentation (CLI options table, nasde.toml config reference, explanatory text). This is where most users look first.
- `CLAUDE.md` — agent instructions (CLI reference, nasde.toml example, architecture decisions)
- `ARCHITECTURE.md` — system architecture with mermaid diagrams (end-to-end flow, trial lifecycle, cloud sandbox providers, assessment evaluation)
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.DS_Store
.idea/
__pycache__/
*.pyc
.venv/
Expand Down
19 changes: 19 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,24 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure.

## [Unreleased]

### Added
- **Variant task-scoping: `tasks` array in `variant.toml`.** A variant may declare
`tasks = [...]` to restrict it to specific tasks — use it for a repo-specific
variant (e.g. a skill tuned to one repo's conventions) so it never runs against
the wrong codebase. With `--all-variants` a scoped variant runs only against its
declared tasks (others SKIPPED); with a single `--variant`, requesting a task
outside its scope aborts with a clear error. The scope wins over an explicit
`--tasks` filter. Absent/empty → unscoped (the default). ([#54])

### Changed
- **Evaluator default `max_turns` raised 30 → 60.** Avoids `error_max_turns` on
large/complex workspaces (e.g. DDD refactors). Override via `[evaluation]
max_turns` in `nasde.toml`. ([#54])

### Fixed
- **Bump `starlette` 1.0.0 → 1.1.0** (PYSEC-2026-161; transitive via
harbor/fastapi/mcp). ([#54])

## [0.4.0] — 2026-05-21

### Added
Expand Down Expand Up @@ -390,4 +408,5 @@ Initial release under the **nasde-toolkit** name (rebrand from
[#50]: https://github.com/NoesisVision/nasde-toolkit/pull/50
[#51]: https://github.com/NoesisVision/nasde-toolkit/pull/51
[#52]: https://github.com/NoesisVision/nasde-toolkit/pull/52
[#54]: https://github.com/NoesisVision/nasde-toolkit/pull/54
[gh-litellm-2026-04]: https://github.com/BerriAI/litellm/security/advisories/GHSA-xqmj-j6mv-4862
14 changes: 13 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ build_commands = []
backend = "claude" # "claude" (default) | "codex"
model = "claude-opus-4-7"
dimensions_file = "assessment_dimensions.json"
# max_turns = 30 # Max evaluator conversation turns
# max_turns = 60 # Max evaluator conversation turns (default 60)
# allowed_tools = ["Read", "Glob", "Grep"] # Override default tool whitelist
# mcp_config = "./evaluator_mcp.json" # MCP server config for evaluator
# skills_dir = "./evaluator_skills" # Skills directory for evaluator
Expand Down Expand Up @@ -255,13 +255,25 @@ Declares the agent type and the model the variant runs against. Every variant MU
agent = "claude" # "claude" | "codex" | "gemini"
model = "claude-sonnet-4-6" # REQUIRED. Model appropriate for the agent family.

tasks = ["my-benchmark/task-a"] # Optional: task-scope. Restrict this variant to specific tasks.

[[skill]] # Optional (ADR-009): skill-by-reference, Claude only.
path = "../../../src/plugins/my-plugin/skills/my-skill" # source skill dir (required)
ref = "abc1234" # optional git ref, same semantics as [nasde.source]
```

**Model priority**: `--model` CLI flag > `variant.toml [model]`. Missing model in both places → SystemExit with a clear error.

**`tasks` (variant task-scope)**: optional list of task names this variant is
meant to run against. Use it for a *repo-specific* variant — e.g. a skill whose
examples reference one repo's conventions — so it never runs against the wrong
codebase. The scope is enforced in **both** run modes: with `--all-variants` a
scoped variant runs only against its declared tasks (others are skipped with a
SKIPPED status); with a single `--variant`, if none of the requested tasks fall in
the variant's scope the run aborts with a clear error. Either way the scope wins
over an explicit `--tasks` filter. Absent or empty → unscoped (runs against every
task, the default).

**`[[skill]]` (skill-by-reference)**: each entry stages the **whole** skill dir
(incl. `references/`) into `/app/.claude/skills/<name>/` from a source path —
no copy under `variants/<v>/skills/`. Optional `ref` reads from a temp git
Expand Down
32 changes: 30 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,12 @@ The criteria spell out what each score means for each dimension. Here is the ful

**The insight:** the same "DDD guidance" skill helps Claude a little (+3.5) and *badly* hurts Codex (-22). The per-dimension breakdown pinpoints *where* Codex regresses — domain modeling, encapsulation, extensibility — which would be invisible without this assessment. Skill optimization is agent-specific.

### Deep dive — does a public skill, and tuning it, actually help?

A separate study took a public DDD skill (the `tactical-ddd` skill from `ntcoding/claude-skillz`) and its repo-tuned version across four configurations on two deliberately different tasks — a feature on a clean DDD codebase and a legacy anemic→rich refactor. The headline: **the effect is task-dependent** — tuning the skill to the repo adds **+0.13** quality on the clean-feature task but only **+0.06** on the legacy refactor (increment over the bare model; absolute scores across tasks aren't comparable). Two lessons that generalize: judge *per dimension*, not one aggregate; and a skill present on disk is not a skill used — verify it activated.

→ Full tables, per-dimension radars, and token/time charts in **[Benchmark Results](docs/benchmark-results.md#deep-dive--tactical-ddd-skill-public-vs-repo-tuned-claude-code)**.

### More benchmarks in the repo

- **Refactoring katas (Java + Python)** — four classic refactorings scored on behavior preservation, clarity, technique, scope discipline. *Takeaway:* a candidate "refactoring skill" didn't move the score — shipping it would have been based on vibes.
Expand Down Expand Up @@ -402,7 +408,7 @@ append_system_prompt = "Pay special attention to SOLID principles when scoring."
| `backend` | `claude` | Subprocess backend: `claude` or `codex` |
| `model` | `claude-opus-4-7` | Evaluator model |
| `dimensions_file` | `assessment_dimensions.json` | Scoring dimensions file |
| `max_turns` | `30` | Max conversation turns |
| `max_turns` | `60` | Max evaluator conversation turns (raise for DDD-rich workspaces with many small files) |
| `allowed_tools` | `["Read", "Glob", "Grep"]` | Tool whitelist |
| `mcp_config` | — | Path to MCP server config JSON |
| `skills_dir` | — | Path to evaluator skills directory |
Expand Down Expand Up @@ -473,6 +479,28 @@ sandbox — no copy under `variants/`. The legacy
`variants/<v>/skills/<name>/` copy path still works unchanged (and now also
carries `references/`, which it previously dropped).

## Scoping a variant to specific tasks (`tasks`)

Some variants only make sense for one task — for example, a skill whose code
examples are *tuned to a particular repo's conventions*. Running such a variant
against a different codebase produces misleading results. Declare a `tasks`
scope in the variant's `variant.toml`:

```toml
agent = "claude"
model = "claude-sonnet-4-6"

# This variant's skill references this repo's value objects, so it should only
# run against that task.
tasks = ["csharp-anemic-to-rich-domain"]
```

The scope is enforced either way you run: with `--all-variants` a scoped variant
runs **only** against its declared tasks (others show as `SKIPPED`); with a single
`--variant`, asking for a task outside its scope aborts with a clear error rather
than running against the wrong repo. Omit `tasks` (the default) for a
general-purpose variant that runs everywhere.

## Commands

### Core
Expand Down Expand Up @@ -581,7 +609,7 @@ build_commands = []
backend = "claude" # "claude" (default) | "codex"
model = "claude-opus-4-7"
dimensions_file = "assessment_dimensions.json"
# max_turns = 30 # Max evaluator conversation turns
# max_turns = 60 # Max evaluator conversation turns (default 60)
# allowed_tools = ["Read", "Glob", "Grep"] # Override default tool whitelist
# mcp_config = "./evaluator_mcp.json" # MCP server config for evaluator
# skills_dir = "./evaluator_skills" # Skills directory for evaluator
Expand Down
39 changes: 39 additions & 0 deletions docs/benchmark-results.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,45 @@ Results from the three example benchmarks included in `examples/`. All scores ar

**Takeaway:** Architectural guidance helps Claude (+3.5) but dramatically hurts Codex (-22.0). The same skill applied to different agents can have opposite effects — this is exactly the kind of insight NASDE is designed to surface.

### Deep dive — tactical-ddd skill: public vs repo-tuned (Claude Code)

A focused follow-up on the same benchmark family: we took a public DDD skill ([`tactical-ddd` from `ntcoding/claude-skillz`](https://github.com/NTCoding/claude-skillz)) and a repo-tuned version, and measured four configurations of Claude Code on two deliberately different tasks — a **feature on a clean DDD codebase** (`ddd-weather-discount`) and a **legacy anemic→rich refactor** (`csharp-movie-rental-anemic`). Each configuration was run repeatedly and each run scored repeatedly; the numbers below are medians (normalized 0–1). Skill activation was verified per run — a mounted skill the agent never invokes scores like no skill at all.

| Configuration | Weather (feature) | Movie (legacy) |
|---|:---:|:---:|
| vanilla (no skill) | 0.79 | 0.56 |
| guided (manual DDD hints, no skill) | 0.84 | 0.58 |
| public skill | 0.85 | 0.60 |
| repo-tuned skill | **0.92** | **0.62** |

**The effect is task-dependent.** Absolute scores across tasks aren't comparable — task difficulty sets the baseline (movie starts lower because it's harder). What *is* comparable is the **increment over vanilla** on each task: how much each step lifts quality above the bare model.

<p align="center">
<img src="../examples/ddd-architectural-challenges/assets/increment_vs_vanilla.png" width="520" alt="Quality gain over vanilla">
</p>

The same repo-tuned skill adds **+0.13** on the clean-feature task but only **+0.06** on the legacy refactor — twice the payoff where the design space is open.

Per-dimension radars show *where* the gains land (test quality stays flat everywhere — the skill teaches modeling, not testing):

<p align="center">
<img src="../examples/ddd-architectural-challenges/assets/radar_weather.png" width="380" alt="Weather radar">
<img src="../examples/ddd-architectural-challenges/assets/radar_movie.png" width="380" alt="Movie radar">
</p>

What does the gain cost? Token usage and run time per configuration — and the answer is **not** the simple "better costs more" you might expect:

<p align="center">
<img src="../examples/ddd-architectural-challenges/assets/ops_tokens_weather.png" width="240" alt="Weather tokens">
<img src="../examples/ddd-architectural-challenges/assets/ops_tokens_movie.png" width="240" alt="Movie tokens">
<img src="../examples/ddd-architectural-challenges/assets/ops_time_weather.png" width="240" alt="Weather time">
<img src="../examples/ddd-architectural-challenges/assets/ops_time_movie.png" width="240" alt="Movie time">
</p>

Cost doesn't track quality. On weather the top-scoring repo-tuned skill spends *fewer* tokens than guided or public; on movie the public skill is the cheapest of all four. The real overhead is run time on the messy refactor, where the skill arms take roughly twice as long as the bare model.

**Two lessons that generalize:** (1) judge one aggregate number and you miss the story — a real per-dimension gain hides inside an averaged score, which is why we show radars, not a single bar; (2) a skill present on disk is not a skill used — always verify it activated before trusting the result.

## UC1: Project-Specific Setup — NASDE Dev Skill

1 task: Add multi-attempt support to the nasde-toolkit itself. Claude only (project-specific skill, cross-agent comparison not applicable).
Expand Down
1 change: 1 addition & 0 deletions examples/ddd-architectural-challenges/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
jobs/
_focus_report.md
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions examples/ddd-architectural-challenges/nasde.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ build_commands = []
model = "claude-opus-4-7"
dimensions_file = "assessment_dimensions.json"
skills_dir = "./evaluator_skills"
max_turns = 60 # = current default; explicit here as a reminder DDD-rich workspaces need headroom

[reporting]
platform = "opik"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@

Evaluate the agent's solution across five dimensions.

**Modern .NET expectation (applies throughout):** the project targets .NET 8. A
strong solution reads like a modern .NET 8 codebase, not a port of a .NET Core 2.x
app. Reward idiomatic modern C# where it improves the domain model — `record` /
`readonly record struct` for value objects (free value equality), `required` /
`init`-only properties, file-scoped namespaces, nullable reference types, pattern
matching. Penalize value objects hand-rolled with manual `Equals`/`GetHashCode`
boilerplate and dated `netcoreapp`-era style when a `record` would express the
concept more clearly.

## 1. Domain Modeling (0–25)

| Score | Criteria |
Expand All @@ -18,6 +27,7 @@ Evaluate the agent's solution across five dimensions.
- Does `Company` protect its invariants (contact employee, source)?
- Is the employee-company relationship modeled with proper invariant enforcement?
- Are constructors/factory methods used instead of public setters?
- **Modern idioms**: are value objects expressed as `record` / `readonly record struct` (value equality for free) rather than classes with hand-written `Equals`/`GetHashCode`? Are file-scoped namespaces, `init`/`required`, and nullable reference types used? Code that achieves rich DDD but in dated .NET Core 2.x style caps this dimension at 20 (not exemplary).

## 2. Encapsulation (0–20)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM mcr.microsoft.com/dotnet/sdk:6.0
FROM mcr.microsoft.com/dotnet/sdk:8.0

RUN apt-get update && apt-get install -y \
git \
Expand All @@ -10,14 +10,14 @@ RUN apt-get update && apt-get install -y \
WORKDIR /app
RUN git clone https://github.com/kgrzybek/refactoring-from-anemic-to-rich-domain-model-example.git . && git checkout anemic

# Upgrade project from netcoreapp2.2 to net6.0
RUN sed -i 's/netcoreapp2.2/net6.0/' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
# Upgrade project from netcoreapp2.2 to net8.0
RUN sed -i 's/netcoreapp2.2/net8.0/' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
sed -i 's/<PackageReference Include="Microsoft.AspNetCore.App" \/>//' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
sed -i 's/<PackageReference Include="Microsoft.AspNetCore.Razor.Design" Version="2.2.0" PrivateAssets="All" \/>//' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
sed -i 's/Version="2.2.6"/Version="6.0.0"/g' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
sed -i 's/Version="2.2.6"/Version="8.0.0"/g' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj && \
sed -i 's/Version="4.0.1"/Version="6.5.0"/g' DotNetConfPl.Refactoring/DotNetConfPl.Refactoring.csproj

# Fix Swagger for .NET 6 (DescribeAllEnumsAsStrings removed, Info -> OpenApiInfo)
# Fix Swagger for modern .NET (DescribeAllEnumsAsStrings removed, Info -> OpenApiInfo)
RUN sed -i '/DescribeAllEnumsAsStrings/d' DotNetConfPl.Refactoring/Startup.cs && \
sed -i 's/Swashbuckle.AspNetCore.Swagger.Info/Microsoft.OpenApi.Models.OpenApiInfo/g' DotNetConfPl.Refactoring/Startup.cs && \
sed -i '1s/^/using Microsoft.OpenApi.Models;\n/' DotNetConfPl.Refactoring/Startup.cs
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Task: Refactor Anemic Domain Model to Rich Domain Model

## Context
You are working in `/app`, a C# ASP.NET Core 2.2 application that manages companies, persons, and employees. The codebase follows a classic anemic domain model anti-pattern: domain classes (`Person`, `Company`, `Employee`) are plain data containers with only public getters and setters, while all business logic lives in application service classes (`PersonService`, `CompanyService`, `EmployeeService`).
You are working in `/app`, a C# ASP.NET Core application (targeting **.NET 8**) that manages companies, persons, and employees. The codebase originated on an older .NET version and still carries its dated style: domain classes (`Person`, `Company`, `Employee`) are plain data containers with only public getters and setters, while all business logic lives in application service classes (`PersonService`, `CompanyService`, `EmployeeService`) — a classic anemic domain model anti-pattern. Bring the domain model up to modern standards as you enrich it.

The project structure:
```
Expand Down Expand Up @@ -34,10 +34,11 @@ Specifically:
- Domain objects enforce their own invariants — constructors and methods should reject invalid state
- Services contain only orchestration: load, call domain method, save
- Value objects are immutable with equality based on attributes
- Follow existing C# conventions in the project
- **Use modern C# / .NET idioms** appropriate to .NET 8 — e.g. `record` / `readonly record struct` for value objects, `required` / `init`-only properties, file-scoped namespaces, nullable reference types, and pattern matching where they make the code clearer. Do not preserve the legacy `netcoreapp`-era style.

## Success Criteria
1. The project compiles successfully
1. The project compiles successfully on .NET 8
2. Domain entities contain behavior methods (not just getters/setters)
3. Service classes are thin orchestrators — no business logic in if/else blocks
4. At least one value object has been introduced
5. The code reads like a modern .NET 8 codebase, not a port of a .NET Core 2.x app
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ framework = ".NET"
timeout_sec = 1800

[environment]
memory_mb = 4096
memory_mb = 6144

[verifier]
timeout_sec = 300
Expand Down
Loading
Loading