runtime: reactor spools journal appends to /mnt/local without bound or backpressure, fills disk and hard-fails all shards

## 1. Priority
Medium-High - no data loss, but a single task can fill a reactor's local disk via unbounded journal-append spool and hard-fail every shard on that reactor (collateral outage), including the data plane's L1 stats rollup. Recovered per-incident by a reactor restart; latent on every data plane.

## 2. Scope and prevalence
The reactor (`flowctl-go`) spools journal appends to `/mnt/local` before persisting fragments to the fragment store. When appends arrive faster than they persist, or a single append/checkpoint is very large, the local spool grows with no cap and no backpressure. Theoretical reach: every data plane. Observed prevalence: a private AWS plane during a large Postgres backfill; the same class of reactor disk-fill has been seen before on another source-postgres backfill, so this is a recurring pattern, not a one-off.

## 3. Trigger (testable)
A high-volume capture (here source-amazon-rds-postgres) running a large backfill emits a checkpoint per backfill chunk, and each checkpoint is written to the reactor's gazette append buffer on `/mnt/local`. With a large `backfill_chunk_size` and large documents, a single checkpoint reaches tens of GB. Combined with high sustained append volume into a few hot collections, the spool grows until `/mnt/local` is exhausted, at which point every shard on the reactor fails with disk-full.

## 4. Root cause (confirmed)
Confirmed - no cap or backpressure on the local append/fragment spool. `lsof +L1` showed `flowctl-go` (the reactor process) holding ~124 GB of deleted-but-open spool files: hundreds of `/mnt/local/reactor/gazette-append*` (~70 MB each) plus one anonymous `#<inode>` buffer of 71.6 GB (a single backfill-chunk checkpoint). The files were unlinked but held open, so `du` could not see them while the blocks stayed allocated.

## 5. Investigation steps / reproduction (testable)
- `df -h /mnt/local`: 97% used, 0 free at peak. `du -x /mnt/local`: ~1.3 GB. `podman system df`: <1 GB. The ~124 GB gap was unaccounted.
- `lsof +L1 | grep mnt/local`: holder `flowctl-go`, entries `gazette-append*` plus a 71.6 GB anonymous deleted file.
- Every shard on the reactor failed disk-full. The L1 `catalog-stats` derivation failed:
```
runTransactions: txnStartCommit: app.FinalizeTxn: h2 protocol error: error reading a body from connection
```
- `sudo systemctl restart reactor.service` released the held fds; `/mnt/local` recovered. It refilled on each subsequent large backfill chunk until the capture's `backfill_chunk_size` was lowered and the reactor disk was enlarged.

## 6. Possible fixes
1. Operational (used here): restart the reactor to release held spool; lower the capture's `backfill_chunk_size` (smaller checkpoints); enlarge reactor `/mnt/local`.
2. Platform: cap the local append/fragment spool and apply backpressure (stall or slow appends) before `/mnt/local` is exhausted, instead of spooling unbounded and hard-failing every shard on the reactor.
3. Investigate why a single 71.6 GB checkpoint buffer never persisted to the fragment store (was persistence keeping up with the append rate?).

## 7. References
- Companion issues: estuary/connectors#4728 (source-postgres backfill checkpoint size), estuary/est-dry-dock#334 (reactor disk sizing)
- Internal #integrations escalation: repost pending (original retracted while root cause was corrected)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: reactor spools journal appends to /mnt/local without bound or backpressure, fills disk and hard-fails all shards #3095

1. Priority

2. Scope and prevalence

3. Trigger (testable)

4. Root cause (confirmed)

5. Investigation steps / reproduction (testable)

6. Possible fixes

7. References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

runtime: reactor spools journal appends to /mnt/local without bound or backpressure, fills disk and hard-fails all shards #3095

Description

1. Priority

2. Scope and prevalence

3. Trigger (testable)

4. Root cause (confirmed)

5. Investigation steps / reproduction (testable)

6. Possible fixes

7. References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions