Skip to content

runtime: reactor spools journal appends to /mnt/local without bound or backpressure, fills disk and hard-fails all shards #3095

Description

@jwhartley

1. Priority

Medium-High - no data loss, but a single task can fill a reactor's local disk via unbounded journal-append spool and hard-fail every shard on that reactor (collateral outage), including the data plane's L1 stats rollup. Recovered per-incident by a reactor restart; latent on every data plane.

2. Scope and prevalence

The reactor (flowctl-go) spools journal appends to /mnt/local before persisting fragments to the fragment store. When appends arrive faster than they persist, or a single append/checkpoint is very large, the local spool grows with no cap and no backpressure. Theoretical reach: every data plane. Observed prevalence: a private AWS plane during a large Postgres backfill; the same class of reactor disk-fill has been seen before on another source-postgres backfill, so this is a recurring pattern, not a one-off.

3. Trigger (testable)

A high-volume capture (here source-amazon-rds-postgres) running a large backfill emits a checkpoint per backfill chunk, and each checkpoint is written to the reactor's gazette append buffer on /mnt/local. With a large backfill_chunk_size and large documents, a single checkpoint reaches tens of GB. Combined with high sustained append volume into a few hot collections, the spool grows until /mnt/local is exhausted, at which point every shard on the reactor fails with disk-full.

4. Root cause (confirmed)

Confirmed - no cap or backpressure on the local append/fragment spool. lsof +L1 showed flowctl-go (the reactor process) holding ~124 GB of deleted-but-open spool files: hundreds of /mnt/local/reactor/gazette-append* (~70 MB each) plus one anonymous #<inode> buffer of 71.6 GB (a single backfill-chunk checkpoint). The files were unlinked but held open, so du could not see them while the blocks stayed allocated.

5. Investigation steps / reproduction (testable)

  • df -h /mnt/local: 97% used, 0 free at peak. du -x /mnt/local: ~1.3 GB. podman system df: <1 GB. The ~124 GB gap was unaccounted.
  • lsof +L1 | grep mnt/local: holder flowctl-go, entries gazette-append* plus a 71.6 GB anonymous deleted file.
  • Every shard on the reactor failed disk-full. The L1 catalog-stats derivation failed:
runTransactions: txnStartCommit: app.FinalizeTxn: h2 protocol error: error reading a body from connection
  • sudo systemctl restart reactor.service released the held fds; /mnt/local recovered. It refilled on each subsequent large backfill chunk until the capture's backfill_chunk_size was lowered and the reactor disk was enlarged.

6. Possible fixes

  1. Operational (used here): restart the reactor to release held spool; lower the capture's backfill_chunk_size (smaller checkpoints); enlarge reactor /mnt/local.
  2. Platform: cap the local append/fragment spool and apply backpressure (stall or slow appends) before /mnt/local is exhausted, instead of spooling unbounded and hard-failing every shard on the reactor.
  3. Investigate why a single 71.6 GB checkpoint buffer never persisted to the fragment store (was persistence keeping up with the append rate?).

7. References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions