stephendchu · stephendchu · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/README.md b/README.md
@@ -33,6 +33,25 @@ At work it helped significantly. But internal results have a conflict-of-interes
 
 ## The study
 
+The pipeline, end to end — one real commit, two arms with identical budgets, scored against the maintainer's actual test:
+
+```mermaid
+flowchart TD
+    COMMIT["real maintainer commit<br/>post-cutoff · 3 codebases"] --> WK["isolated git worktree<br/>.git link stripped"]
+    WK --> DEL["delete associated test file<br/>v2 deletion protocol"]
+    DEL --> SPLIT((" "))
+    SPLIT --> A1["A1 — control<br/>Read · Grep · Glob · Bash"]
+    SPLIT --> A2["A2 — treatment<br/>Read · Grep · Glob · Bash<br/>+ findtest MCP (voluntary)"]
+    A1 --> GEN1["generated test<br/># target file: declared"]
+    A2 --> GEN2["generated test<br/># target file: declared"]
+    GEN1 --> JUDGE["LLM judge<br/>blinded pairwise"]
+    GEN2 --> JUDGE
+    GEN1 --> METRICS["AST alignment · location · taste"]
+    GEN2 --> METRICS
+    JUDGE --> RESULT["win-rate · Δalignment<br/>adoption rate · per codebase"]
+    METRICS --> RESULT
+```
+
 ### v1: the null result (the interesting part)
 
 I designed a rigorous open-source study. For each eval item: take a real git commit, hide the maintainer's test, have the agent regenerate it under two conditions — with and without findtest mounted — and score the output.
@@ -64,19 +83,25 @@ Voluntary adoption is itself a metric: if grep fails to find the deleted file, d
 
 The study ran across three repos chosen to span a complexity gradient:
 
-| Codebase | Test infrastructure | Grep difficulty (folder + depth) |
-|----------|--------------------|----|
-| **pydantic** | Standard pytest, flat `tests/` directory | Low |
-| **dbt-core** | Standard pytest, 105 test directories, custom fixtures | Medium |
-| **SQLAlchemy** | Custom `sqlalchemy.testing` plugin, `@testing.combinations`, `assert_compile` | High |
+| Codebase | Test infrastructure | Test dirs under root | Max test-file depth | Grep difficulty |
+|----------|--------------------|----|----|----|
+| **pydantic** | Standard pytest, flat `tests/` directory | 18 | 1 | Low |
+| **dbt-core** | Standard pytest, 105 test directories, custom fixtures | 105 † | — † | Medium |
+| **SQLAlchemy** | Custom `sqlalchemy.testing` plugin, `@testing.combinations`, `assert_compile` | 34 | 2 | High |
+
+*"Test dirs under root" = directories containing tests beneath the repo's test root (`tests/` for pydantic/dbt-core, `test/` for SQLAlchemy). "Max test-file depth" = deepest nesting of a `test_*.py` / `*_test.py` file below that root (0 = sits directly in the root). pydantic and SQLAlchemy measured at current HEAD; full distribution in [`docs/repos.md`](docs/repos.md).*
+
+*† dbt-core's "105 test directories" is the figure recorded during the study. It is **not** re-measurable at HEAD: dbt-core's `main` has since been rewritten in Rust (no Python `tests/` tree remains), and the study config (`src/atw/config.py`) clones HEAD rather than pinning a SHA. Depth there should be read as the study-era Python layout, not today's repo.*
 
 ---
 
 ## Results
 
 ### The gradient
 
-The mechanism is **test-file discoverability** — how hard the right test is to locate through the repo's folder structure and depth. As that rises, grep fails and findtest's lift grows: null on a flat layout, decisive on a deep/custom one. dbt-core is the proof that *depth alone* drives it — standard pytest, but 105 test directories was enough.
+The mechanism is **test-file discoverability** — how hard the right test is to locate through the repo's folder structure and depth. As that rises, grep fails and findtest's lift grows: null on a flat layout, decisive on a deep/custom one. dbt-core is the proof that *breadth alone* drives it — standard pytest, but 105 test directories was enough.
+
+The three repos load that variable on *different* axes, which is why a single number doesn't capture it: pydantic is **shallow and narrow** (max depth 1, 18 dirs, 71 of 90 test files sitting directly in `tests/`), dbt-core is **broad** (105 directories), and SQLAlchemy is **moderately deep and gated** (max depth 2 with *zero* test files at the root — every test pushed at least one level down — behind a custom `sqlalchemy.testing` plugin). "Discoverability" is breadth + depth + framework idiosyncrasy, not any one of them.
 
 | Metric | pydantic | dbt-core | SQLAlchemy |
 |--------|----------|----------|------------|

diff --git a/_layouts/default.html b/_layouts/default.html
@@ -39,5 +39,21 @@ <h2 class="project-tagline">{{ page.description | default: site.description | de
         <span class="site-footer-credits">This page was generated by <a href="https://pages.github.com">GitHub Pages</a>.</span>
       </footer>
     </main>
+
+    <!-- Mermaid: Jekyll/kramdown does not render ```mermaid blocks on its own.
+         kramdown + Rouge wraps them as <div class="language-mermaid ...">...<code>.
+         Rewrite each to the <pre class="mermaid"> Mermaid expects, then render. -->
+    <script type="module">
+      import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
+      document.querySelectorAll('.language-mermaid, code.language-mermaid').forEach((el) => {
+        const src = el.querySelector('code') || el;
+        const pre = document.createElement('pre');
+        pre.className = 'mermaid';
+        pre.textContent = src.textContent;
+        el.replaceWith(pre);
+      });
+      mermaid.initialize({ startOnLoad: false, theme: 'default' });
+      await mermaid.run({ querySelector: 'pre.mermaid' });
+    </script>
   </body>
 </html>
diff --git a/docs/repos.md b/docs/repos.md
@@ -7,6 +7,46 @@ importantly — **rich, idiosyncratic test infrastructure**, because that is
 exactly where generic grep struggles and semantic tools should shine. A repo
 with trivial tests will show no gap no matter how good the tooling.
 
+## Test-infrastructure depth (the independent variable)
+
+Test-file discoverability is the variable the gradient is built on, so it is
+worth stating concretely. "Root" is the repo's test directory (`tests/` for
+pydantic and dbt-core, `test/` for SQLAlchemy). "Depth" counts directory levels
+*below* that root, so depth 0 means a test file lives directly in the root.
+
+| Repo | Test root | Dirs under root | Max test-file depth | Test files | Depth distribution |
+|------|-----------|----------------|---------------------|-----------|--------------------|
+| **pydantic** | `tests/` | 18 (17 subdirs) | 1 | 90 | depth 0: 71 · depth 1: 19 |
+| **dbt-core** | `tests/` | 105 † | — † | — † | — † |
+| **SQLAlchemy** | `test/` | 34 (33 subdirs) | 2 | 225 | depth 1: 152 · depth 2: 73 |
+
+Reading the rows:
+
+- **pydantic** — shallow *and* narrow. 79% of test files sit directly in
+  `tests/`; nothing is more than one level down. An agent can infer where a test
+  belongs from sibling files alone, which is exactly why grep is sufficient and
+  findtest goes unused (0% adoption).
+- **SQLAlchemy** — moderately deep and, tellingly, **no test files at the root
+  at all**: every test is pushed at least one level down (depth 1–2), behind the
+  custom `sqlalchemy.testing` plugin (`@testing.combinations`, `assert_compile`).
+  Depth + framework idiosyncrasy is what breaks grep here.
+- **dbt-core** — the breadth case: standard pytest, but the tests fan out across
+  105 directories. Sheer directory count, not depth, is enough to make the right
+  location hard to grep for.
+
+† **dbt-core is not re-measurable at HEAD.** The "105 test directories" figure is
+what was recorded during the study. Since then dbt-core's `main` has been
+rewritten in Rust (`crates/`, `lib/`) — no Python `tests/` tree remains — and
+`src/atw/config.py` clones HEAD rather than pinning a commit SHA, and the study's
+`data/` (including `data/commits/dbt-core/`) is git-ignored and not retained. The
+depth/file figures therefore can't be reconstructed without the study-era SHA.
+This is a reproducibility gap, disclosed rather than papered over; pinning a SHA
+per repo in `config.py` is the fix for any re-run.
+
+*Measured at current HEAD for pydantic and SQLAlchemy via
+`find <root> -type f -name 'test_*.py' -o -name '*_test.py'`, bucketed by path
+depth below the root.*
+
 ## v1 default (set in `config.py`)
 
 - **dbt-core** (dbt Labs) — company-backed, serious pytest culture, complex