Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions docs/development/2026-06-24-eql-v3-pr-loc-analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# EQL v3 PR — LOC Analysis: Generated vs. Base Implementation

**Date:** 2026-06-24
**Branch:** `eql_v3` vs `main`
**Merge base:** `80a7a2bc21a04ed7af5916d252605d576dcbc21a`
**Method:** `git diff --numstat <merge-base> HEAD`, cross-checked by four parallel
analysis agents (one per independent domain) reading file contents and the
generation/verification mechanisms.

---

## TL;DR

The PR reports as **~100k LOC** (498 files, **102,511 insertions / 32,897
deletions**), but that number is dominated by machine-produced artifacts:

- **~59,142 lines (≈58% of insertions) are committed generated/snapshot
artifacts** — `cargo expand` macro snapshots and golden codegen reference SQL.
No human authored them; they are deterministically regenerable from a small
Rust catalog and CI-gated to byte-for-byte parity.
- A **further ~26,810 lines of generated scalar SQL** is materialized to disk by
the codegen but **gitignored**, so it is absent from the diff entirely.
- The **genuine hand-authored base implementation is ~11–13k LOC** (~8k Rust
catalog/codegen/macros/types + ~3.3k hand-written SQL + bespoke test suites).
- The leverage point for *all* of the generated output (committed golden refs,
gitignored SQL, and the expanded test matrix) is the **~5k-LOC `eql-scalars`
catalog + `eql-codegen` renderers**.

**Review implication:** the generated buckets do not need line-by-line human
review — the parity/inventory tests verify them. Reviewer attention belongs on
the ~5k catalog + codegen and the bespoke JSONB/SEM/ORE test suites.

---

## Top-level breakdown (insertions)

| Path | Added | Deleted | Class |
|---|---:|---:|---|
| `tests/sqlx/snapshots/` | 32,196 | 0 | **Generated** (macro-expand snapshot + matrix baselines) |
| `tests/codegen/reference/` | 26,946 | 0 | **Generated** (golden codegen reference SQL) |
| `tests/sqlx/src/` + `tests/sqlx/tests/` | ~17,213 | 0 | Hand-written test harness + suites |
| `crates/` | 12,383 | 0 | Hand-written Rust (+ JSON/TS fixtures) |
| `Cargo.lock` | 5,590 | 0 | Generated lockfile |
| `src/` | 3,336 | 7,505 | Hand-written SQL (v3) / legacy removal |
| `docs/` | 1,644 | 5,365 | Docs |
| `tasks/` | 902 | 442 | Build/CI tooling |
| `.github/` | 808 | 74 | CI workflows |
| `mise.toml` | 606 | 5 | Task config |
| Other (`DEVELOPMENT.md`, `SUPABASE.md`, `CLAUDE.md`, `README`, …) | ~700 | ~500 | Docs/meta |

**Generated/snapshot subtotal in the committed diff: ≈ 59,142 (58% of insertions).**

---

## 1. Generated artifacts (committed, not authored)

### 1a. `tests/sqlx/snapshots/` — 32,196 LOC

| File | LOC | What it is |
|---|---:|---|
| `int4_expanded.rs` | **31,260** | `cargo expand` macro-expansion snapshot of the `int4` matrix suite (220 `#[rustc_test_marker]` blocks). Pure rustc test-harness boilerplate. |
| `matrix_tests_text.txt` | 306 | token-normalized matrix baseline (text shape) |
| `matrix_tests.txt` | 220 | token-normalized matrix baseline (canonical) |
| `README.md` | 206 | docs (only authored file in the dir) |
| `v3_jsonb_tests.txt` | 76 | pinned test-name set (jsonb) |
| `matrix_jsonb_entry_tests.txt` | 55 | pinned test-name set (jsonb SteVec entry) |
| `matrix_tests_eq_only.txt` | 54 | derived matrix baseline (eq-only) |
| `matrix_tests_storage_only.txt` | 19 | matrix baseline (storage-only) |

- **Provenance:** `int4_expanded.rs` line 1 — `` `eql_v3_int4` matrix suite — generated by `scalar_types!` ``. The macro source is `tests/sqlx/src/matrix.rs`.
- **Regeneration:** `mise run test:matrix:expand` (`cargo +nightly-2026-05-01 expand --test encrypted_domain scalars::int4`), with pinned nightly + `cargo-expand 1.0.122` so the snapshot only moves when the macro moves.
- **Verification:** `.github/workflows/macro-expand-eql.yml` regenerates and runs `git diff --exit-code` (non-blocking drift backstop). The `.txt` baselines are gated by `mise run test:matrix:inventory` in the `matrix-coverage` CI job.
- **Note:** the `.rs` lives under `snapshots/` (not `tests/`) so Cargo never compiles it as a test target.

**Verdict:** generated snapshot. High confidence.

### 1b. `tests/codegen/reference/` — 26,946 LOC, 108 files

| Type | Files | | Type | Files |
|---|---:|---|---|---:|
| `text` | 16 | | `int8` | 11 |
| `date` | 11 | | `numeric` | 11 |
| `float4` | 11 | | `timestamptz` | 11 |
| `float8` | 11 | | `bool` | 3 |
| `int2` | 11 | | `README.md` | 1 |
| `int4` | 11 | | | |

- **Provenance:** every file carries `-- REFERENCE: hand-maintained parity baseline for crates/eql-codegen` followed by `-- AUTOMATICALLY GENERATED FILE.`
- **Generation:** `cargo run -p eql-codegen` renders the SQL from `eql_scalars::CATALOG` + minijinja templates; the body is copied verbatim into the reference tree with a one-line provenance header.
- **Verification (three layers):**
- `tasks/codegen-parity.sh` — strips the provenance line and `diff`s byte-for-byte against generated output; also asserts the reference dir set equals the catalog token set.
- `crates/eql-codegen/tests/parity.rs` — `reference_dirs_match_catalog_tokens`, `rust_generator_matches_reference_files`, `generate_all_is_deterministic_across_runs`.
- In-crate reference tests in `crates/eql-codegen/src/generate.rs`.

**Verdict:** generated golden files, deterministically reproducible. High confidence (95%+).

### 1c. Gitignored generated SQL — 26,810 LOC (NOT in the diff)

The codegen materializes the scalar SQL surface into `src/v3/scalars/<T>/`
(`*_types.sql` / `*_functions.sql` / `*_operators.sql` / `*_aggregates.sql`),
all excluded by `.gitignore` (lines 234–240). `git ls-files src/v3/scalars`
tracks exactly one hand-written file (`functions.sql`). This is ~27k lines of
real generated code that never appears in the LOC count.

---

## 2. Hand-written base implementation

### 2a. `crates/` — 12,383 LOC

| Crate | Added | Breakdown | Role |
|---|---:|---|---|
| **eql-scalars** | 2,446 | 2,425 `.rs`, 21 `.toml` | **The catalog — source of truth.** `CATALOG` of `ScalarSpec` rows; `Term` capabilities (`Hm`=eq, `Ore`=eq+ord) fixed in impls. Includes ~1,237 LOC unit tests + proptest invariants. Std-only, zero-dep. |
| **eql-codegen** | 2,532 | 2,401 `.rs`, 108 `.j2`, 23 `.toml` | **The renderers — codegen.** Reads `CATALOG`, renders SQL into `src/v3/scalars/<T>/`. Key files: `generate.rs` (683), `operator_surface.rs` (619), `context.rs` (380), `writer.rs` (272). Binary exposes `list-types` / `dump-catalog`. |
| **eql-tests-macros** | 774 | 757 `.rs`, 17 `.toml` | **Test-wiring proc-macros.** Expands one `scalar_types!` list into per-type SQLx-matrix wiring across the three test compilation contexts. |
| **eql-types** | 6,631 | 3,189 `.json`, 2,229 `.rs`, 1,097 `.ts`, 95 `.md`, 19 `.toml`, 2 `.gitignore` | **Mixed — mostly data.** v3 type models + conformance/catalog-parity tests in Rust; the bulk is committed JSON schema fixtures (`schema/v3/*.json`) and TS, not authored logic. |

**Genuine hand-authored Rust:** ~7,812 LOC (2,425 + 2,401 + 757 + 2,229), a
large share of which is tests. ~3,189 LOC is JSON schema data; ~1,097 is TS.

### 2b. `src/v3/` — 3,377 added (legacy: −7,505)

| Subdir | Added | Content |
|---|---:|---|
| `src/v3/jsonb/` | 1,780 | jsonb SteVec surface (types, functions, operators, aggregates, blockers, test) |
| `src/v3/sem/` | 1,018 | Hand-written SEM index-term types: `hmac_256`, `ore_block_256`, `ore_cllw`, `bloom_filter` |
| `src/v3/lint/` | 355 | `lints.sql` structural lint rules |
| `src/v3/` (root) | 164 | forked `crypto.sql` / `common.sql`, `schema.sql`, `version.template` |
| `src/v3/scalars/` | 60 | `functions.sql` — the sole committed scalar SQL (shared blocker) |

The **−7,505 deletions** are the old `eql_v2` surface removed in 3.0.0. The new
v3 implementation is entirely additive (`src/v3/`: +3,324 / −0 of genuinely-new
files; the subdir table sums to 3,377 because it counts the full body of
`crypto.sql`, which git records as a rename of `src/crypto.sql` contributing only
+12 net to the top-level `src/` total of 3,336).

### 2c. `tests/sqlx/src/` + `tests/sqlx/tests/` — ~17,213 LOC

**`tests/sqlx/src/` (9,137)** — reusable harness, near-zero per-type cost:

| File | LOC | Character |
|---|---:|---|
| `matrix.rs` | 3,572 | **Macro engine.** ~40 chained `macro_rules!` that fan out the cartesian product (category × domain × operator × pivot) + EXPLAIN-plan helpers. Emits the bulk of the suite's *expanded* test count from near-zero source. |
| `scalar_domains.rs` | 1,754 | Declarative per-type trait wiring (`ScalarType`/`OrderedScalar`/…), materialized via local macros (`int_values!`, `temporal_values!`). |
| `fixtures/` subtree | 2,948 | Real-ciphertext fixture generation: `driver.rs` (548), `eql_plaintext.rs` (509), `spec.rs` (413), `cipherstash.rs` (412, ZeroKMS path), `scalar_fixture.rs` (283), validation. |
| `property.rs` | 519 | **All-pairs oracle engine** — `assert_eq_oracle`/`assert_ord_oracle` over every ordered pair, function-double + extractor oracles, proptest bridging. |

**`tests/sqlx/tests/` (8,076)** — mostly bespoke assertions:

| Subtree | LOC | Character |
|---|---:|---|
| `encrypted_domain/` | 4,420 | `family/` structural SQL-catalog suites (sem, mutations, inlinability, support) + `property/` oracle drivers (thin row-sourcing over the shared engine) |
| `tests/` (root) | 3,656 | `v3_jsonb_tests.rs` (1,590 — 33 hand-authored SteVec/JSONB tests that can't fit the scalar matrix), `v3_jsonb_operator_surface_tests.rs` (474), `ore_block_comparator_tests.rs` (474), `ore_cllw_v3_opclass_tests.rs` (466), `text/text_match.rs` (398) |

**Key structural finding:** per-type wiring is **~10 lines total** —
`scalar_types.rs` lists `int4 => i32, int2 => i16, …` and `scalars/mod.rs` is a
single `scalar_types!(matrix_suites);` invocation. The large *expanded* test
count massively overstates hand-authored effort; the irreducible bespoke logic
is the JSONB/SteVec suite (~2,500 lines) plus the SEM/ORE/family structural checks.

---

## 3. Synthesis

| Class | LOC | Share of insertions |
|---|---:|---:|
| Committed generated/snapshot artifacts | ~59,142 | ~58% |
| Generated lockfile (`Cargo.lock`) | 5,590 | ~5% |
| Committed JSON/TS schema data (`eql-types`) | ~4,286 | ~4% |
| **Hand-authored base implementation** | **~11–13k** | **~11–13%** |
| Docs / tooling / CI | ~4,500 | ~4% |
| Other test wiring/harness (counted above in 13k) | — | — |

**Outside the diff:** ~26,810 lines of gitignored generated scalar SQL.

### Bottom line

The PR's apparent size is dominated by deterministically-regenerable, CI-gated
artifacts — not new logic. The **true hand-written base implementation is on the
order of 11–13k LOC**, and *all* generated output (the 59k committed + 27k
gitignored + the expanded test matrix) is single-sourced from the **~5k-LOC
`eql-scalars` catalog + `eql-codegen` renderers**. Adding a new scalar type is
one catalog row plus ~one line of test wiring.

---

## Reproduce

```bash
base=$(git merge-base HEAD origin/main) # 80a7a2bc...

# Total
git diff --stat $base HEAD | tail -1

# By top-level path
git diff --numstat $base HEAD \
| awk '{split($3,a,"/"); add[a[1]]+=$1; del[a[1]]+=$2}
END{for(t in add) printf "%10d + %8d - %s\n", add[t], del[t], t}' \
| sort -rn

# Generated buckets
git diff --numstat $base HEAD -- tests/sqlx/snapshots tests/codegen/reference \
| awk '{a+=$1} END{print a}'

# Gitignored generated SQL on disk (after a build)
find src/v3/scalars -name '*.sql' \
| grep -E '_(types|functions|operators|aggregates).sql$' \
| xargs wc -l | tail -1
```
Loading
Loading