Investigate DuckDB for CSV ingestion in the ETL

## Summary

Investigate replacing the hand-rolled ZIP + CSV parsing and row-by-row upsert path in `esiid-etl` with [DuckDB](https://duckdb.org/) as the ingestion engine.

Today the ETL:

1. Downloads a ZIP from B2 to local disk
2. Opens the ZIP, extracts the CSV with the `zip` crate
3. Parses rows with `csv` in flexible mode (see #29, #32)
4. Batches 1,000 rows at a time into a Postgres `UNNEST` bulk upsert
5. For FUL files, loads the full set of incoming ESIIDs into memory to drive deactivation (see #25)

DuckDB can potentially collapse steps 2–5 by:

- Reading ZIP-compressed CSVs directly (`read_csv_auto` with `compression='zstd'|'gzip'` — ERCOT ZIPs would still need extraction first, but DuckDB's CSV sniffer handles the parsing, header validation, and type coercion natively)
- Using the [Postgres extension](https://duckdb.org/docs/extensions/postgres) to push rows into Postgres directly via `INSERT INTO postgres.esiids SELECT ... FROM read_csv(...)`, replacing the manual UNNEST batching
- Streaming the full file without materializing the ESIID set in Rust memory — FUL deactivation can be expressed as a SQL `ANTI JOIN` between the staging scan and the existing `esiids` rows for that TDSP

## Why

- **Simpler code** — removes the bespoke parser + batcher + deactivation-set-builder in favour of declarative SQL
- **Natural fixes for several open issues:**
  - #29 — DuckDB's CSV reader surfaces parse errors instead of silently discarding malformed rows (configurable via `ignore_errors`)
  - #32 — DuckDB validates headers at read time
  - #25 — streaming anti-join removes the in-memory set bloat
  - #35 — upserts can write `normalized_address` once at load time rather than relying on a generated column
  - Arguably #18 — removing the hand-rolled sync/async bridging in the loader
- **Performance** — vectorized CSV parsing is dramatically faster than row-by-row `csv` crate parsing at 3–5M rows
- **Schema validation** — DuckDB enforces column count and types at load, catching upstream ERCOT format drift early

## Risks / open questions

- Adds a native dependency (DuckDB via `duckdb` crate or FFI) to the ETL binary — needs to build cleanly on the deployment target (Linux x86_64/arm64)
- Postgres extension must be available in the DuckDB build — the `duckdb` Rust crate ships extensions differently than the CLI; verify the postgres scanner works in embedded mode
- Upsert semantics: we currently use `ON CONFLICT (esiid) DO UPDATE ... WHERE` to set `last_seen_at` / `is_active`. Need to confirm we can express this via the Postgres extension without losing the conditional update behaviour — may require a `COPY` into a temp table + a single `INSERT ... ON CONFLICT` statement
- Transaction boundaries: the document lifecycle transitions (#31) would need to span the DuckDB load + Postgres state update
- If we drop `sqlx::query_as!` on the ingest hot path, the `.sqlx/` offline cache impact is minimal (the SQL becomes a plain `execute`), but worth confirming

## Proposal

Prototype this as a second ingestion backend behind a trait in `esiid-etl`, keeping the current implementation as the fallback, so we can A/B it against real files and benchmark end-to-end throughput and memory.

## Relates to

Adopting DuckDB would reshape the current backlog — once a prototype lands, each of these should be re-evaluated and either closed as obsolete or rescoped against the new ingestion path:

- **Likely closable:**
  - #29 — DuckDB's CSV reader surfaces parse errors natively
  - #32 — header validation becomes a read-time concern
  - #35 — `normalized_address` can be written at load time rather than via a generated column
- **Likely rescoped / simplified:**
  - #18 — the hand-rolled `block_in_place`/`block_on` pattern goes away if the loader is expressed in SQL
  - #25 — FUL deactivation becomes a streaming `ANTI JOIN` rather than an in-memory set
  - #31 — transaction boundaries still need to be drawn, but around a smaller surface
- **Still relevant, unchanged:**
  - #3, #4, #27 — upsert semantics (`UpsertStats`, DAILY re-activation, `first_seen_at`) apply regardless of the parsing engine
  - #39 — `pipeline_runs` bookkeeping is orthogonal


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate DuckDB for CSV ingestion in the ETL #43

Summary

Why

Risks / open questions

Proposal

Relates to

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Investigate DuckDB for CSV ingestion in the ETL #43

Description

Summary

Why

Risks / open questions

Proposal

Relates to

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions