diff --git a/README.md b/README.md index a5fa684..9d0e5d1 100644 --- a/README.md +++ b/README.md @@ -33,6 +33,25 @@ At work it helped significantly. But internal results have a conflict-of-interes ## The study +The pipeline, end to end — one real commit, two arms with identical budgets, scored against the maintainer's actual test: + +```mermaid +flowchart TD + COMMIT["real maintainer commit
post-cutoff · 3 codebases"] --> WK["isolated git worktree
.git link stripped"] + WK --> DEL["delete associated test file
v2 deletion protocol"] + DEL --> SPLIT((" ")) + SPLIT --> A1["A1 — control
Read · Grep · Glob · Bash"] + SPLIT --> A2["A2 — treatment
Read · Grep · Glob · Bash
+ findtest MCP (voluntary)"] + A1 --> GEN1["generated test
# target file: declared"] + A2 --> GEN2["generated test
# target file: declared"] + GEN1 --> JUDGE["LLM judge
blinded pairwise"] + GEN2 --> JUDGE + GEN1 --> METRICS["AST alignment · location · taste"] + GEN2 --> METRICS + JUDGE --> RESULT["win-rate · Δalignment
adoption rate · per codebase"] + METRICS --> RESULT +``` + ### v1: the null result (the interesting part) I designed a rigorous open-source study. For each eval item: take a real git commit, hide the maintainer's test, have the agent regenerate it under two conditions — with and without findtest mounted — and score the output. @@ -64,11 +83,15 @@ Voluntary adoption is itself a metric: if grep fails to find the deleted file, d The study ran across three repos chosen to span a complexity gradient: -| Codebase | Test infrastructure | Grep difficulty (folder + depth) | -|----------|--------------------|----| -| **pydantic** | Standard pytest, flat `tests/` directory | Low | -| **dbt-core** | Standard pytest, 105 test directories, custom fixtures | Medium | -| **SQLAlchemy** | Custom `sqlalchemy.testing` plugin, `@testing.combinations`, `assert_compile` | High | +| Codebase | Test infrastructure | Test dirs under root | Max test-file depth | Grep difficulty | +|----------|--------------------|----|----|----| +| **pydantic** | Standard pytest, flat `tests/` directory | 18 | 1 | Low | +| **dbt-core** | Standard pytest, 105 test directories, custom fixtures | 105 † | — † | Medium | +| **SQLAlchemy** | Custom `sqlalchemy.testing` plugin, `@testing.combinations`, `assert_compile` | 34 | 2 | High | + +*"Test dirs under root" = directories containing tests beneath the repo's test root (`tests/` for pydantic/dbt-core, `test/` for SQLAlchemy). "Max test-file depth" = deepest nesting of a `test_*.py` / `*_test.py` file below that root (0 = sits directly in the root). pydantic and SQLAlchemy measured at current HEAD; full distribution in [`docs/repos.md`](docs/repos.md).* + +*† dbt-core's "105 test directories" is the figure recorded during the study. It is **not** re-measurable at HEAD: dbt-core's `main` has since been rewritten in Rust (no Python `tests/` tree remains), and the study config (`src/atw/config.py`) clones HEAD rather than pinning a SHA. Depth there should be read as the study-era Python layout, not today's repo.* --- @@ -76,7 +99,9 @@ The study ran across three repos chosen to span a complexity gradient: ### The gradient -The mechanism is **test-file discoverability** — how hard the right test is to locate through the repo's folder structure and depth. As that rises, grep fails and findtest's lift grows: null on a flat layout, decisive on a deep/custom one. dbt-core is the proof that *depth alone* drives it — standard pytest, but 105 test directories was enough. +The mechanism is **test-file discoverability** — how hard the right test is to locate through the repo's folder structure and depth. As that rises, grep fails and findtest's lift grows: null on a flat layout, decisive on a deep/custom one. dbt-core is the proof that *breadth alone* drives it — standard pytest, but 105 test directories was enough. + +The three repos load that variable on *different* axes, which is why a single number doesn't capture it: pydantic is **shallow and narrow** (max depth 1, 18 dirs, 71 of 90 test files sitting directly in `tests/`), dbt-core is **broad** (105 directories), and SQLAlchemy is **moderately deep and gated** (max depth 2 with *zero* test files at the root — every test pushed at least one level down — behind a custom `sqlalchemy.testing` plugin). "Discoverability" is breadth + depth + framework idiosyncrasy, not any one of them. | Metric | pydantic | dbt-core | SQLAlchemy | |--------|----------|----------|------------| diff --git a/_layouts/default.html b/_layouts/default.html index 8f69104..bf5b474 100644 --- a/_layouts/default.html +++ b/_layouts/default.html @@ -39,5 +39,21 @@

{{ page.description | default: site.description | de This page was generated by GitHub Pages. + + + diff --git a/docs/repos.md b/docs/repos.md index 4badf32..1d33d18 100644 --- a/docs/repos.md +++ b/docs/repos.md @@ -7,6 +7,46 @@ importantly — **rich, idiosyncratic test infrastructure**, because that is exactly where generic grep struggles and semantic tools should shine. A repo with trivial tests will show no gap no matter how good the tooling. +## Test-infrastructure depth (the independent variable) + +Test-file discoverability is the variable the gradient is built on, so it is +worth stating concretely. "Root" is the repo's test directory (`tests/` for +pydantic and dbt-core, `test/` for SQLAlchemy). "Depth" counts directory levels +*below* that root, so depth 0 means a test file lives directly in the root. + +| Repo | Test root | Dirs under root | Max test-file depth | Test files | Depth distribution | +|------|-----------|----------------|---------------------|-----------|--------------------| +| **pydantic** | `tests/` | 18 (17 subdirs) | 1 | 90 | depth 0: 71 · depth 1: 19 | +| **dbt-core** | `tests/` | 105 † | — † | — † | — † | +| **SQLAlchemy** | `test/` | 34 (33 subdirs) | 2 | 225 | depth 1: 152 · depth 2: 73 | + +Reading the rows: + +- **pydantic** — shallow *and* narrow. 79% of test files sit directly in + `tests/`; nothing is more than one level down. An agent can infer where a test + belongs from sibling files alone, which is exactly why grep is sufficient and + findtest goes unused (0% adoption). +- **SQLAlchemy** — moderately deep and, tellingly, **no test files at the root + at all**: every test is pushed at least one level down (depth 1–2), behind the + custom `sqlalchemy.testing` plugin (`@testing.combinations`, `assert_compile`). + Depth + framework idiosyncrasy is what breaks grep here. +- **dbt-core** — the breadth case: standard pytest, but the tests fan out across + 105 directories. Sheer directory count, not depth, is enough to make the right + location hard to grep for. + +† **dbt-core is not re-measurable at HEAD.** The "105 test directories" figure is +what was recorded during the study. Since then dbt-core's `main` has been +rewritten in Rust (`crates/`, `lib/`) — no Python `tests/` tree remains — and +`src/atw/config.py` clones HEAD rather than pinning a commit SHA, and the study's +`data/` (including `data/commits/dbt-core/`) is git-ignored and not retained. The +depth/file figures therefore can't be reconstructed without the study-era SHA. +This is a reproducibility gap, disclosed rather than papered over; pinning a SHA +per repo in `config.py` is the fix for any re-run. + +*Measured at current HEAD for pydantic and SQLAlchemy via +`find -type f -name 'test_*.py' -o -name '*_test.py'`, bucketed by path +depth below the root.* + ## v1 default (set in `config.py`) - **dbt-core** (dbt Labs) — company-backed, serious pytest culture, complex