feat(extract): stream sources with bounded memory by NagaYu · Pull Request #10 · NagaYu/audiencesync

NagaYu · 2026-06-29T04:08:33Z

Resolves #2.

What

Replaces the load-everything-then-upload pipeline with a streaming one so peak memory stays flat regardless of audience size.

extractor: streamCustomers() async-generates rows via Postgres/MySQL server-side cursors (pg-query-stream / mysql2 row streams) and Stripe lazy pagination. New batchAsync() groups the stream into bounded batches; extractCustomers() remains as an array-draining convenience for callers/tests.
sync: new createSyncSession() with send()/finalize(). Each hashed batch fans out to both platforms (re-chunked to each one's own limit). Google's offline job is created lazily on the first batch and run on finalize. A failure in one platform is captured per-platform without aborting the other or the source stream.
index: runSync loops extract → hash → upload one batch at a time; the full audience is never materialized.

Why

The previous implementation buffered the entire result set (pool.query) and the entire hashed array in memory — fine for small lists, but it undercut the "lightweight" positioning for multi-million-row audiences. (See #2.)

Tests

New test/extractor.test.ts covers batchAsync grouping, the empty stream, size validation, and laziness (only one batch is pulled ahead → bounded buffering).
39/39 tests pass; typecheck, lint, format:check, build all green locally.

Note

The DB streaming code paths require a live Postgres/MySQL to integration-test and aren't exercised by CI (no DB in the runner). The pure batching utility is unit-tested; the driver wiring is type-checked and built.

Replace the load-everything-then-upload pipeline with a streaming one: - extractor: streamCustomers() async-generates rows via Postgres/MySQL server-side cursors (pg-query-stream / mysql2 row streams) and Stripe lazy pagination. Add batchAsync() to group the stream into bounded batches, and keep extractCustomers() as an array-draining convenience. - sync: introduce createSyncSession() with send()/finalize(). Each hashed batch is fanned out to both platforms (re-chunked to each one's limit); Google's offline job is created lazily on the first batch and run on finalize. Per-platform failures are captured without aborting the other. - index: runSync now loops extract → hash → upload one batch at a time, so the full audience is never materialized in memory. Adds bounded-memory streaming tests for batchAsync. Documents the memory characteristics in the README and CHANGELOG. Closes #2 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

NagaYu merged commit 0b0a159 into main Jun 29, 2026
0 of 3 checks passed

NagaYu deleted the feat/streaming-extraction branch June 29, 2026 04:09

NagaYu mentioned this pull request Jun 29, 2026

fix(lint): disable require-await for test async-generator fixtures #11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): stream sources with bounded memory#10

feat(extract): stream sources with bounded memory#10
NagaYu merged 1 commit into
mainfrom
feat/streaming-extraction

NagaYu commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NagaYu commented Jun 29, 2026

What

Why

Tests

Note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant