diff --git a/README.md b/README.md
index a5fa684..9d0e5d1 100644
--- a/README.md
+++ b/README.md
@@ -33,6 +33,25 @@ At work it helped significantly. But internal results have a conflict-of-interes
## The study
+The pipeline, end to end — one real commit, two arms with identical budgets, scored against the maintainer's actual test:
+
+```mermaid
+flowchart TD
+ COMMIT["real maintainer commit
post-cutoff · 3 codebases"] --> WK["isolated git worktree
.git link stripped"]
+ WK --> DEL["delete associated test file
v2 deletion protocol"]
+ DEL --> SPLIT((" "))
+ SPLIT --> A1["A1 — control
Read · Grep · Glob · Bash"]
+ SPLIT --> A2["A2 — treatment
Read · Grep · Glob · Bash
+ findtest MCP (voluntary)"]
+ A1 --> GEN1["generated test
# target file: declared"]
+ A2 --> GEN2["generated test
# target file: declared"]
+ GEN1 --> JUDGE["LLM judge
blinded pairwise"]
+ GEN2 --> JUDGE
+ GEN1 --> METRICS["AST alignment · location · taste"]
+ GEN2 --> METRICS
+ JUDGE --> RESULT["win-rate · Δalignment
adoption rate · per codebase"]
+ METRICS --> RESULT
+```
+
### v1: the null result (the interesting part)
I designed a rigorous open-source study. For each eval item: take a real git commit, hide the maintainer's test, have the agent regenerate it under two conditions — with and without findtest mounted — and score the output.
@@ -64,11 +83,15 @@ Voluntary adoption is itself a metric: if grep fails to find the deleted file, d
The study ran across three repos chosen to span a complexity gradient:
-| Codebase | Test infrastructure | Grep difficulty (folder + depth) |
-|----------|--------------------|----|
-| **pydantic** | Standard pytest, flat `tests/` directory | Low |
-| **dbt-core** | Standard pytest, 105 test directories, custom fixtures | Medium |
-| **SQLAlchemy** | Custom `sqlalchemy.testing` plugin, `@testing.combinations`, `assert_compile` | High |
+| Codebase | Test infrastructure | Test dirs under root | Max test-file depth | Grep difficulty |
+|----------|--------------------|----|----|----|
+| **pydantic** | Standard pytest, flat `tests/` directory | 18 | 1 | Low |
+| **dbt-core** | Standard pytest, 105 test directories, custom fixtures | 105 † | — † | Medium |
+| **SQLAlchemy** | Custom `sqlalchemy.testing` plugin, `@testing.combinations`, `assert_compile` | 34 | 2 | High |
+
+*"Test dirs under root" = directories containing tests beneath the repo's test root (`tests/` for pydantic/dbt-core, `test/` for SQLAlchemy). "Max test-file depth" = deepest nesting of a `test_*.py` / `*_test.py` file below that root (0 = sits directly in the root). pydantic and SQLAlchemy measured at current HEAD; full distribution in [`docs/repos.md`](docs/repos.md).*
+
+*† dbt-core's "105 test directories" is the figure recorded during the study. It is **not** re-measurable at HEAD: dbt-core's `main` has since been rewritten in Rust (no Python `tests/` tree remains), and the study config (`src/atw/config.py`) clones HEAD rather than pinning a SHA. Depth there should be read as the study-era Python layout, not today's repo.*
---
@@ -76,7 +99,9 @@ The study ran across three repos chosen to span a complexity gradient:
### The gradient
-The mechanism is **test-file discoverability** — how hard the right test is to locate through the repo's folder structure and depth. As that rises, grep fails and findtest's lift grows: null on a flat layout, decisive on a deep/custom one. dbt-core is the proof that *depth alone* drives it — standard pytest, but 105 test directories was enough.
+The mechanism is **test-file discoverability** — how hard the right test is to locate through the repo's folder structure and depth. As that rises, grep fails and findtest's lift grows: null on a flat layout, decisive on a deep/custom one. dbt-core is the proof that *breadth alone* drives it — standard pytest, but 105 test directories was enough.
+
+The three repos load that variable on *different* axes, which is why a single number doesn't capture it: pydantic is **shallow and narrow** (max depth 1, 18 dirs, 71 of 90 test files sitting directly in `tests/`), dbt-core is **broad** (105 directories), and SQLAlchemy is **moderately deep and gated** (max depth 2 with *zero* test files at the root — every test pushed at least one level down — behind a custom `sqlalchemy.testing` plugin). "Discoverability" is breadth + depth + framework idiosyncrasy, not any one of them.
| Metric | pydantic | dbt-core | SQLAlchemy |
|--------|----------|----------|------------|
diff --git a/_layouts/default.html b/_layouts/default.html
index 8f69104..bf5b474 100644
--- a/_layouts/default.html
+++ b/_layouts/default.html
@@ -39,5 +39,21 @@