Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 31 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,25 @@ At work it helped significantly. But internal results have a conflict-of-interes

## The study

The pipeline, end to end — one real commit, two arms with identical budgets, scored against the maintainer's actual test:

```mermaid
flowchart TD
COMMIT["real maintainer commit<br/>post-cutoff · 3 codebases"] --> WK["isolated git worktree<br/>.git link stripped"]
WK --> DEL["delete associated test file<br/>v2 deletion protocol"]
DEL --> SPLIT((" "))
SPLIT --> A1["A1 — control<br/>Read · Grep · Glob · Bash"]
SPLIT --> A2["A2 — treatment<br/>Read · Grep · Glob · Bash<br/>+ findtest MCP (voluntary)"]
A1 --> GEN1["generated test<br/># target file: declared"]
A2 --> GEN2["generated test<br/># target file: declared"]
GEN1 --> JUDGE["LLM judge<br/>blinded pairwise"]
GEN2 --> JUDGE
GEN1 --> METRICS["AST alignment · location · taste"]
GEN2 --> METRICS
JUDGE --> RESULT["win-rate · Δalignment<br/>adoption rate · per codebase"]
METRICS --> RESULT
```

### v1: the null result (the interesting part)

I designed a rigorous open-source study. For each eval item: take a real git commit, hide the maintainer's test, have the agent regenerate it under two conditions — with and without findtest mounted — and score the output.
Expand Down Expand Up @@ -64,19 +83,25 @@ Voluntary adoption is itself a metric: if grep fails to find the deleted file, d

The study ran across three repos chosen to span a complexity gradient:

| Codebase | Test infrastructure | Grep difficulty (folder + depth) |
|----------|--------------------|----|
| **pydantic** | Standard pytest, flat `tests/` directory | Low |
| **dbt-core** | Standard pytest, 105 test directories, custom fixtures | Medium |
| **SQLAlchemy** | Custom `sqlalchemy.testing` plugin, `@testing.combinations`, `assert_compile` | High |
| Codebase | Test infrastructure | Test dirs under root | Max test-file depth | Grep difficulty |
|----------|--------------------|----|----|----|
| **pydantic** | Standard pytest, flat `tests/` directory | 18 | 1 | Low |
| **dbt-core** | Standard pytest, 105 test directories, custom fixtures | 105 † | — † | Medium |
| **SQLAlchemy** | Custom `sqlalchemy.testing` plugin, `@testing.combinations`, `assert_compile` | 34 | 2 | High |

*"Test dirs under root" = directories containing tests beneath the repo's test root (`tests/` for pydantic/dbt-core, `test/` for SQLAlchemy). "Max test-file depth" = deepest nesting of a `test_*.py` / `*_test.py` file below that root (0 = sits directly in the root). pydantic and SQLAlchemy measured at current HEAD; full distribution in [`docs/repos.md`](docs/repos.md).*

*† dbt-core's "105 test directories" is the figure recorded during the study. It is **not** re-measurable at HEAD: dbt-core's `main` has since been rewritten in Rust (no Python `tests/` tree remains), and the study config (`src/atw/config.py`) clones HEAD rather than pinning a SHA. Depth there should be read as the study-era Python layout, not today's repo.*

---

## Results

### The gradient

The mechanism is **test-file discoverability** — how hard the right test is to locate through the repo's folder structure and depth. As that rises, grep fails and findtest's lift grows: null on a flat layout, decisive on a deep/custom one. dbt-core is the proof that *depth alone* drives it — standard pytest, but 105 test directories was enough.
The mechanism is **test-file discoverability** — how hard the right test is to locate through the repo's folder structure and depth. As that rises, grep fails and findtest's lift grows: null on a flat layout, decisive on a deep/custom one. dbt-core is the proof that *breadth alone* drives it — standard pytest, but 105 test directories was enough.

The three repos load that variable on *different* axes, which is why a single number doesn't capture it: pydantic is **shallow and narrow** (max depth 1, 18 dirs, 71 of 90 test files sitting directly in `tests/`), dbt-core is **broad** (105 directories), and SQLAlchemy is **moderately deep and gated** (max depth 2 with *zero* test files at the root — every test pushed at least one level down — behind a custom `sqlalchemy.testing` plugin). "Discoverability" is breadth + depth + framework idiosyncrasy, not any one of them.

| Metric | pydantic | dbt-core | SQLAlchemy |
|--------|----------|----------|------------|
Expand Down
16 changes: 16 additions & 0 deletions _layouts/default.html
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,21 @@ <h2 class="project-tagline">{{ page.description | default: site.description | de
<span class="site-footer-credits">This page was generated by <a href="https://pages.github.com">GitHub Pages</a>.</span>
</footer>
</main>

<!-- Mermaid: Jekyll/kramdown does not render ```mermaid blocks on its own.
kramdown + Rouge wraps them as <div class="language-mermaid ...">...<code>.
Rewrite each to the <pre class="mermaid"> Mermaid expects, then render. -->
<script type="module">
import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
document.querySelectorAll('.language-mermaid, code.language-mermaid').forEach((el) => {
const src = el.querySelector('code') || el;
const pre = document.createElement('pre');
pre.className = 'mermaid';
pre.textContent = src.textContent;
el.replaceWith(pre);
});
mermaid.initialize({ startOnLoad: false, theme: 'default' });
await mermaid.run({ querySelector: 'pre.mermaid' });
</script>
</body>
</html>
40 changes: 40 additions & 0 deletions docs/repos.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,46 @@ importantly — **rich, idiosyncratic test infrastructure**, because that is
exactly where generic grep struggles and semantic tools should shine. A repo
with trivial tests will show no gap no matter how good the tooling.

## Test-infrastructure depth (the independent variable)

Test-file discoverability is the variable the gradient is built on, so it is
worth stating concretely. "Root" is the repo's test directory (`tests/` for
pydantic and dbt-core, `test/` for SQLAlchemy). "Depth" counts directory levels
*below* that root, so depth 0 means a test file lives directly in the root.

| Repo | Test root | Dirs under root | Max test-file depth | Test files | Depth distribution |
|------|-----------|----------------|---------------------|-----------|--------------------|
| **pydantic** | `tests/` | 18 (17 subdirs) | 1 | 90 | depth 0: 71 · depth 1: 19 |
| **dbt-core** | `tests/` | 105 † | — † | — † | — † |
| **SQLAlchemy** | `test/` | 34 (33 subdirs) | 2 | 225 | depth 1: 152 · depth 2: 73 |

Reading the rows:

- **pydantic** — shallow *and* narrow. 79% of test files sit directly in
`tests/`; nothing is more than one level down. An agent can infer where a test
belongs from sibling files alone, which is exactly why grep is sufficient and
findtest goes unused (0% adoption).
- **SQLAlchemy** — moderately deep and, tellingly, **no test files at the root
at all**: every test is pushed at least one level down (depth 1–2), behind the
custom `sqlalchemy.testing` plugin (`@testing.combinations`, `assert_compile`).
Depth + framework idiosyncrasy is what breaks grep here.
- **dbt-core** — the breadth case: standard pytest, but the tests fan out across
105 directories. Sheer directory count, not depth, is enough to make the right
location hard to grep for.

† **dbt-core is not re-measurable at HEAD.** The "105 test directories" figure is
what was recorded during the study. Since then dbt-core's `main` has been
rewritten in Rust (`crates/`, `lib/`) — no Python `tests/` tree remains — and
`src/atw/config.py` clones HEAD rather than pinning a commit SHA, and the study's
`data/` (including `data/commits/dbt-core/`) is git-ignored and not retained. The
depth/file figures therefore can't be reconstructed without the study-era SHA.
This is a reproducibility gap, disclosed rather than papered over; pinning a SHA
per repo in `config.py` is the fix for any re-run.

*Measured at current HEAD for pydantic and SQLAlchemy via
`find <root> -type f -name 'test_*.py' -o -name '*_test.py'`, bucketed by path
depth below the root.*

## v1 default (set in `config.py`)

- **dbt-core** (dbt Labs) — company-backed, serious pytest culture, complex
Expand Down
Loading