cipherstash · tobyhede · Jun 24, 2026 · Jun 24, 2026
diff --git a/docs/development/2026-06-24-eql-v3-pr-loc-analysis.md b/docs/development/2026-06-24-eql-v3-pr-loc-analysis.md
@@ -0,0 +1,209 @@
+# EQL v3 PR — LOC Analysis: Generated vs. Base Implementation
+
+**Date:** 2026-06-24
+**Branch:** `eql_v3` vs `main`
+**Merge base:** `80a7a2bc21a04ed7af5916d252605d576dcbc21a`
+**Method:** `git diff --numstat <merge-base> HEAD`, cross-checked by four parallel
+analysis agents (one per independent domain) reading file contents and the
+generation/verification mechanisms.
+
+---
+
+## TL;DR
+
+The PR reports as **~100k LOC** (498 files, **102,511 insertions / 32,897
+deletions**), but that number is dominated by machine-produced artifacts:
+
+- **~59,142 lines (≈58% of insertions) are committed generated/snapshot
+  artifacts** — `cargo expand` macro snapshots and golden codegen reference SQL.
+  No human authored them; they are deterministically regenerable from a small
+  Rust catalog and CI-gated to byte-for-byte parity.
+- A **further ~26,810 lines of generated scalar SQL** is materialized to disk by
+  the codegen but **gitignored**, so it is absent from the diff entirely.
+- The **genuine hand-authored base implementation is ~11–13k LOC** (~8k Rust
+  catalog/codegen/macros/types + ~3.3k hand-written SQL + bespoke test suites).
+- The leverage point for *all* of the generated output (committed golden refs,
+  gitignored SQL, and the expanded test matrix) is the **~5k-LOC `eql-scalars`
+  catalog + `eql-codegen` renderers**.
+
+**Review implication:** the generated buckets do not need line-by-line human
+review — the parity/inventory tests verify them. Reviewer attention belongs on
+the ~5k catalog + codegen and the bespoke JSONB/SEM/ORE test suites.
+
+---
+
+## Top-level breakdown (insertions)
+
+| Path | Added | Deleted | Class |
+|---|---:|---:|---|
+| `tests/sqlx/snapshots/` | 32,196 | 0 | **Generated** (macro-expand snapshot + matrix baselines) |
+| `tests/codegen/reference/` | 26,946 | 0 | **Generated** (golden codegen reference SQL) |
+| `tests/sqlx/src/` + `tests/sqlx/tests/` | ~17,213 | 0 | Hand-written test harness + suites |
+| `crates/` | 12,383 | 0 | Hand-written Rust (+ JSON/TS fixtures) |
+| `Cargo.lock` | 5,590 | 0 | Generated lockfile |
+| `src/` | 3,336 | 7,505 | Hand-written SQL (v3) / legacy removal |
+| `docs/` | 1,644 | 5,365 | Docs |
+| `tasks/` | 902 | 442 | Build/CI tooling |
+| `.github/` | 808 | 74 | CI workflows |
+| `mise.toml` | 606 | 5 | Task config |
+| Other (`DEVELOPMENT.md`, `SUPABASE.md`, `CLAUDE.md`, `README`, …) | ~700 | ~500 | Docs/meta |
+
+**Generated/snapshot subtotal in the committed diff: ≈ 59,142 (58% of insertions).**
+
+---
+
+## 1. Generated artifacts (committed, not authored)
+
+### 1a. `tests/sqlx/snapshots/` — 32,196 LOC
+
+| File | LOC | What it is |
+|---|---:|---|
+| `int4_expanded.rs` | **31,260** | `cargo expand` macro-expansion snapshot of the `int4` matrix suite (220 `#[rustc_test_marker]` blocks). Pure rustc test-harness boilerplate. |
+| `matrix_tests_text.txt` | 306 | token-normalized matrix baseline (text shape) |
+| `matrix_tests.txt` | 220 | token-normalized matrix baseline (canonical) |
+| `README.md` | 206 | docs (only authored file in the dir) |
+| `v3_jsonb_tests.txt` | 76 | pinned test-name set (jsonb) |
+| `matrix_jsonb_entry_tests.txt` | 55 | pinned test-name set (jsonb SteVec entry) |
+| `matrix_tests_eq_only.txt` | 54 | derived matrix baseline (eq-only) |
+| `matrix_tests_storage_only.txt` | 19 | matrix baseline (storage-only) |
+
+- **Provenance:** `int4_expanded.rs` line 1 — `` `eql_v3_int4` matrix suite — generated by `scalar_types!` ``. The macro source is `tests/sqlx/src/matrix.rs`.
+- **Regeneration:** `mise run test:matrix:expand` (`cargo +nightly-2026-05-01 expand --test encrypted_domain scalars::int4`), with pinned nightly + `cargo-expand 1.0.122` so the snapshot only moves when the macro moves.
+- **Verification:** `.github/workflows/macro-expand-eql.yml` regenerates and runs `git diff --exit-code` (non-blocking drift backstop). The `.txt` baselines are gated by `mise run test:matrix:inventory` in the `matrix-coverage` CI job.
+- **Note:** the `.rs` lives under `snapshots/` (not `tests/`) so Cargo never compiles it as a test target.
+
+**Verdict:** generated snapshot. High confidence.
+
+### 1b. `tests/codegen/reference/` — 26,946 LOC, 108 files
+
+| Type | Files | | Type | Files |
+|---|---:|---|---|---:|
+| `text` | 16 | | `int8` | 11 |
+| `date` | 11 | | `numeric` | 11 |
+| `float4` | 11 | | `timestamptz` | 11 |
+| `float8` | 11 | | `bool` | 3 |
+| `int2` | 11 | | `README.md` | 1 |
+| `int4` | 11 | | | |
+
+- **Provenance:** every file carries `-- REFERENCE: hand-maintained parity baseline for crates/eql-codegen` followed by `-- AUTOMATICALLY GENERATED FILE.`
+- **Generation:** `cargo run -p eql-codegen` renders the SQL from `eql_scalars::CATALOG` + minijinja templates; the body is copied verbatim into the reference tree with a one-line provenance header.
+- **Verification (three layers):**
+  - `tasks/codegen-parity.sh` — strips the provenance line and `diff`s byte-for-byte against generated output; also asserts the reference dir set equals the catalog token set.
+  - `crates/eql-codegen/tests/parity.rs` — `reference_dirs_match_catalog_tokens`, `rust_generator_matches_reference_files`, `generate_all_is_deterministic_across_runs`.
+  - In-crate reference tests in `crates/eql-codegen/src/generate.rs`.
+
+**Verdict:** generated golden files, deterministically reproducible. High confidence (95%+).
+
+### 1c. Gitignored generated SQL — 26,810 LOC (NOT in the diff)
+
+The codegen materializes the scalar SQL surface into `src/v3/scalars/<T>/`
+(`*_types.sql` / `*_functions.sql` / `*_operators.sql` / `*_aggregates.sql`),
+all excluded by `.gitignore` (lines 234–240). `git ls-files src/v3/scalars`
+tracks exactly one hand-written file (`functions.sql`). This is ~27k lines of
+real generated code that never appears in the LOC count.
+
+---
+
+## 2. Hand-written base implementation
+
+### 2a. `crates/` — 12,383 LOC
+
+| Crate | Added | Breakdown | Role |
+|---|---:|---|---|
+| **eql-scalars** | 2,446 | 2,425 `.rs`, 21 `.toml` | **The catalog — source of truth.** `CATALOG` of `ScalarSpec` rows; `Term` capabilities (`Hm`=eq, `Ore`=eq+ord) fixed in impls. Includes ~1,237 LOC unit tests + proptest invariants. Std-only, zero-dep. |
+| **eql-codegen** | 2,532 | 2,401 `.rs`, 108 `.j2`, 23 `.toml` | **The renderers — codegen.** Reads `CATALOG`, renders SQL into `src/v3/scalars/<T>/`. Key files: `generate.rs` (683), `operator_surface.rs` (619), `context.rs` (380), `writer.rs` (272). Binary exposes `list-types` / `dump-catalog`. |
+| **eql-tests-macros** | 774 | 757 `.rs`, 17 `.toml` | **Test-wiring proc-macros.** Expands one `scalar_types!` list into per-type SQLx-matrix wiring across the three test compilation contexts. |
+| **eql-types** | 6,631 | 3,189 `.json`, 2,229 `.rs`, 1,097 `.ts`, 95 `.md`, 19 `.toml`, 2 `.gitignore` | **Mixed — mostly data.** v3 type models + conformance/catalog-parity tests in Rust; the bulk is committed JSON schema fixtures (`schema/v3/*.json`) and TS, not authored logic. |
+
+**Genuine hand-authored Rust:** ~7,812 LOC (2,425 + 2,401 + 757 + 2,229), a
+large share of which is tests. ~3,189 LOC is JSON schema data; ~1,097 is TS.
+
+### 2b. `src/v3/` — 3,377 added (legacy: −7,505)
+
+| Subdir | Added | Content |
+|---|---:|---|
+| `src/v3/jsonb/` | 1,780 | jsonb SteVec surface (types, functions, operators, aggregates, blockers, test) |
+| `src/v3/sem/` | 1,018 | Hand-written SEM index-term types: `hmac_256`, `ore_block_256`, `ore_cllw`, `bloom_filter` |
+| `src/v3/lint/` | 355 | `lints.sql` structural lint rules |
+| `src/v3/` (root) | 164 | forked `crypto.sql` / `common.sql`, `schema.sql`, `version.template` |
+| `src/v3/scalars/` | 60 | `functions.sql` — the sole committed scalar SQL (shared blocker) |
+
+The **−7,505 deletions** are the old `eql_v2` surface removed in 3.0.0. The new
+v3 implementation is entirely additive (`src/v3/`: +3,324 / −0 of genuinely-new
+files; the subdir table sums to 3,377 because it counts the full body of
+`crypto.sql`, which git records as a rename of `src/crypto.sql` contributing only
++12 net to the top-level `src/` total of 3,336).
+
+### 2c. `tests/sqlx/src/` + `tests/sqlx/tests/` — ~17,213 LOC
+
+**`tests/sqlx/src/` (9,137)** — reusable harness, near-zero per-type cost:
+
+| File | LOC | Character |
+|---|---:|---|
+| `matrix.rs` | 3,572 | **Macro engine.** ~40 chained `macro_rules!` that fan out the cartesian product (category × domain × operator × pivot) + EXPLAIN-plan helpers. Emits the bulk of the suite's *expanded* test count from near-zero source. |
+| `scalar_domains.rs` | 1,754 | Declarative per-type trait wiring (`ScalarType`/`OrderedScalar`/…), materialized via local macros (`int_values!`, `temporal_values!`). |
+| `fixtures/` subtree | 2,948 | Real-ciphertext fixture generation: `driver.rs` (548), `eql_plaintext.rs` (509), `spec.rs` (413), `cipherstash.rs` (412, ZeroKMS path), `scalar_fixture.rs` (283), validation. |
+| `property.rs` | 519 | **All-pairs oracle engine** — `assert_eq_oracle`/`assert_ord_oracle` over every ordered pair, function-double + extractor oracles, proptest bridging. |
+
+**`tests/sqlx/tests/` (8,076)** — mostly bespoke assertions:
+
+| Subtree | LOC | Character |
+|---|---:|---|
+| `encrypted_domain/` | 4,420 | `family/` structural SQL-catalog suites (sem, mutations, inlinability, support) + `property/` oracle drivers (thin row-sourcing over the shared engine) |
+| `tests/` (root) | 3,656 | `v3_jsonb_tests.rs` (1,590 — 33 hand-authored SteVec/JSONB tests that can't fit the scalar matrix), `v3_jsonb_operator_surface_tests.rs` (474), `ore_block_comparator_tests.rs` (474), `ore_cllw_v3_opclass_tests.rs` (466), `text/text_match.rs` (398) |
+
+**Key structural finding:** per-type wiring is **~10 lines total** —
+`scalar_types.rs` lists `int4 => i32, int2 => i16, …` and `scalars/mod.rs` is a
+single `scalar_types!(matrix_suites);` invocation. The large *expanded* test
+count massively overstates hand-authored effort; the irreducible bespoke logic
+is the JSONB/SteVec suite (~2,500 lines) plus the SEM/ORE/family structural checks.
+
+---
+
+## 3. Synthesis
+
+| Class | LOC | Share of insertions |
+|---|---:|---:|
+| Committed generated/snapshot artifacts | ~59,142 | ~58% |
+| Generated lockfile (`Cargo.lock`) | 5,590 | ~5% |
+| Committed JSON/TS schema data (`eql-types`) | ~4,286 | ~4% |
+| **Hand-authored base implementation** | **~11–13k** | **~11–13%** |
+| Docs / tooling / CI | ~4,500 | ~4% |
+| Other test wiring/harness (counted above in 13k) | — | — |
+
+**Outside the diff:** ~26,810 lines of gitignored generated scalar SQL.
+
+### Bottom line
+
+The PR's apparent size is dominated by deterministically-regenerable, CI-gated
+artifacts — not new logic. The **true hand-written base implementation is on the
+order of 11–13k LOC**, and *all* generated output (the 59k committed + 27k
+gitignored + the expanded test matrix) is single-sourced from the **~5k-LOC
+`eql-scalars` catalog + `eql-codegen` renderers**. Adding a new scalar type is
+one catalog row plus ~one line of test wiring.
+
+---
+
+## Reproduce
+
+```bash
+base=$(git merge-base HEAD origin/main)   # 80a7a2bc...
+
+# Total
+git diff --stat $base HEAD | tail -1
+
+# By top-level path
+git diff --numstat $base HEAD \
+  | awk '{split($3,a,"/"); add[a[1]]+=$1; del[a[1]]+=$2}
+         END{for(t in add) printf "%10d + %8d -  %s\n", add[t], del[t], t}' \
+  | sort -rn
+
+# Generated buckets
+git diff --numstat $base HEAD -- tests/sqlx/snapshots tests/codegen/reference \
+  | awk '{a+=$1} END{print a}'
+
+# Gitignored generated SQL on disk (after a build)
+find src/v3/scalars -name '*.sql' \
+  | grep -E '_(types|functions|operators|aggregates).sql$' \
+  | xargs wc -l | tail -1
+```