You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reading ZIP-compressed CSVs directly (read_csv_auto with compression='zstd'|'gzip' — ERCOT ZIPs would still need extraction first, but DuckDB's CSV sniffer handles the parsing, header validation, and type coercion natively)
Using the Postgres extension to push rows into Postgres directly via INSERT INTO postgres.esiids SELECT ... FROM read_csv(...), replacing the manual UNNEST batching
Streaming the full file without materializing the ESIID set in Rust memory — FUL deactivation can be expressed as a SQL ANTI JOIN between the staging scan and the existing esiids rows for that TDSP
Why
Simpler code — removes the bespoke parser + batcher + deactivation-set-builder in favour of declarative SQL
Performance — vectorized CSV parsing is dramatically faster than row-by-row csv crate parsing at 3–5M rows
Schema validation — DuckDB enforces column count and types at load, catching upstream ERCOT format drift early
Risks / open questions
Adds a native dependency (DuckDB via duckdb crate or FFI) to the ETL binary — needs to build cleanly on the deployment target (Linux x86_64/arm64)
Postgres extension must be available in the DuckDB build — the duckdb Rust crate ships extensions differently than the CLI; verify the postgres scanner works in embedded mode
Upsert semantics: we currently use ON CONFLICT (esiid) DO UPDATE ... WHERE to set last_seen_at / is_active. Need to confirm we can express this via the Postgres extension without losing the conditional update behaviour — may require a COPY into a temp table + a single INSERT ... ON CONFLICT statement
If we drop sqlx::query_as! on the ingest hot path, the .sqlx/ offline cache impact is minimal (the SQL becomes a plain execute), but worth confirming
Proposal
Prototype this as a second ingestion backend behind a trait in esiid-etl, keeping the current implementation as the fallback, so we can A/B it against real files and benchmark end-to-end throughput and memory.
Relates to
Adopting DuckDB would reshape the current backlog — once a prototype lands, each of these should be re-evaluated and either closed as obsolete or rescoped against the new ingestion path:
Summary
Investigate replacing the hand-rolled ZIP + CSV parsing and row-by-row upsert path in
esiid-etlwith DuckDB as the ingestion engine.Today the ETL:
zipcratecsvin flexible mode (see CSV parser flexible mode silently discards malformed rows #29, CSV header layout is hardcoded and never validated against actual file headers #32)UNNESTbulk upsertDuckDB can potentially collapse steps 2–5 by:
read_csv_autowithcompression='zstd'|'gzip'— ERCOT ZIPs would still need extraction first, but DuckDB's CSV sniffer handles the parsing, header validation, and type coercion natively)INSERT INTO postgres.esiids SELECT ... FROM read_csv(...), replacing the manual UNNEST batchingANTI JOINbetween the staging scan and the existingesiidsrows for that TDSPWhy
ignore_errors)normalized_addressonce at load time rather than relying on a generated columncsvcrate parsing at 3–5M rowsRisks / open questions
duckdbcrate or FFI) to the ETL binary — needs to build cleanly on the deployment target (Linux x86_64/arm64)duckdbRust crate ships extensions differently than the CLI; verify the postgres scanner works in embedded modeON CONFLICT (esiid) DO UPDATE ... WHEREto setlast_seen_at/is_active. Need to confirm we can express this via the Postgres extension without losing the conditional update behaviour — may require aCOPYinto a temp table + a singleINSERT ... ON CONFLICTstatementsqlx::query_as!on the ingest hot path, the.sqlx/offline cache impact is minimal (the SQL becomes a plainexecute), but worth confirmingProposal
Prototype this as a second ingestion backend behind a trait in
esiid-etl, keeping the current implementation as the fallback, so we can A/B it against real files and benchmark end-to-end throughput and memory.Relates to
Adopting DuckDB would reshape the current backlog — once a prototype lands, each of these should be re-evaluated and either closed as obsolete or rescoped against the new ingestion path:
normalized_addresscan be written at load time rather than via a generated columnblock_in_place/block_onpattern goes away if the loader is expressed in SQLANTI JOINrather than an in-memory setUpsertStats, DAILY re-activation,first_seen_at) apply regardless of the parsing enginepipeline_runsbookkeeping is orthogonal