Skip to content

Investigate DuckDB for CSV ingestion in the ETL #43

@haydenk

Description

@haydenk

Summary

Investigate replacing the hand-rolled ZIP + CSV parsing and row-by-row upsert path in esiid-etl with DuckDB as the ingestion engine.

Today the ETL:

  1. Downloads a ZIP from B2 to local disk
  2. Opens the ZIP, extracts the CSV with the zip crate
  3. Parses rows with csv in flexible mode (see CSV parser flexible mode silently discards malformed rows #29, CSV header layout is hardcoded and never validated against actual file headers #32)
  4. Batches 1,000 rows at a time into a Postgres UNNEST bulk upsert
  5. For FUL files, loads the full set of incoming ESIIDs into memory to drive deactivation (see Fix FUL deactivation memory scaling #25)

DuckDB can potentially collapse steps 2–5 by:

  • Reading ZIP-compressed CSVs directly (read_csv_auto with compression='zstd'|'gzip' — ERCOT ZIPs would still need extraction first, but DuckDB's CSV sniffer handles the parsing, header validation, and type coercion natively)
  • Using the Postgres extension to push rows into Postgres directly via INSERT INTO postgres.esiids SELECT ... FROM read_csv(...), replacing the manual UNNEST batching
  • Streaming the full file without materializing the ESIID set in Rust memory — FUL deactivation can be expressed as a SQL ANTI JOIN between the staging scan and the existing esiids rows for that TDSP

Why

Risks / open questions

  • Adds a native dependency (DuckDB via duckdb crate or FFI) to the ETL binary — needs to build cleanly on the deployment target (Linux x86_64/arm64)
  • Postgres extension must be available in the DuckDB build — the duckdb Rust crate ships extensions differently than the CLI; verify the postgres scanner works in embedded mode
  • Upsert semantics: we currently use ON CONFLICT (esiid) DO UPDATE ... WHERE to set last_seen_at / is_active. Need to confirm we can express this via the Postgres extension without losing the conditional update behaviour — may require a COPY into a temp table + a single INSERT ... ON CONFLICT statement
  • Transaction boundaries: the document lifecycle transitions (Document lifecycle state transitions are not atomic (no transaction wrapping) #31) would need to span the DuckDB load + Postgres state update
  • If we drop sqlx::query_as! on the ingest hot path, the .sqlx/ offline cache impact is minimal (the SQL becomes a plain execute), but worth confirming

Proposal

Prototype this as a second ingestion backend behind a trait in esiid-etl, keeping the current implementation as the fallback, so we can A/B it against real files and benchmark end-to-end throughput and memory.

Relates to

Adopting DuckDB would reshape the current backlog — once a prototype lands, each of these should be re-evaluated and either closed as obsolete or rescoped against the new ingestion path:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance improvement

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions