Skip to content

feat(pgbackrest): scale WAL drop + spool ceilings by volume size#76

Merged
paulocsanz merged 1 commit into
mainfrom
pc/scale-wal-thresholds-by-volume
May 14, 2026
Merged

feat(pgbackrest): scale WAL drop + spool ceilings by volume size#76
paulocsanz merged 1 commit into
mainfrom
pc/scale-wal-thresholds-by-volume

Conversation

@paulocsanz
Copy link
Copy Markdown
Contributor

Summary

Both archive-failure thresholds were absolute constants — WAL_DROP_THRESHOLD_MB=500 for pg_wal/ hard-failure drops, archive-push-queue-max=5GiB for the async spool. Tuned for mid-size volumes. On a 1 GiB Hobby volume the 5 GiB spool was unreachable and 500 MiB of pg_wal/ was half the disk. Scale them DOWN proportionally — never above the existing 500 MiB / 5 GiB caps:

  • wal-drop: min(500 MiB, ~10% of volume), floor 64 MiB
  • queue-max: min(5 GiB, ~50% of volume), floor 128 MiB

The 10× spread between the two budgets is preserved across volume sizes — hard failures still bail fast, transient stalls still absorb generously. On volumes ≥25 GiB both ceilings hold (= today's behavior).

wrapper.sh computes thresholds at boot via df -Pk $RAILWAY_VOLUME_MOUNT_PATH, exports WAL_DROP_THRESHOLD_MB (only if not operator-set) so the archive_command wrapper inherits it, and threads the queue-max value into the rendered pgbackrest.conf.

Compatibility

  • Operator-set WAL_DROP_THRESHOLD_MB is respected (env var pre-existing wins over the computed default).
  • Operator-set PGBACKREST_ARCHIVE_PUSH_QUEUE_MAX is respected via pgBackRest's env-var > config precedence.
  • Existing e2e tests (t_s3_unreachable_pg_stays_up, t_queue_max_5gib_trips) both rely on these override paths and continue to work.

Test plan

  • Boot on a fresh ≥25 GiB volume → logs wal-drop=500 MiB queue-max=5120 MiB, behaves identically to current main
  • Boot on a 1 GiB Hobby volume → logs wal-drop=102 MiB queue-max=512 MiB, conf has archive-push-queue-max=512MiB
  • e2e suite still green (override paths unchanged)

Both archive-failure thresholds were absolute constants (500 MiB pg_wal drop,
5 GiB archive-push-queue-max), tuned for mid-size volumes. On a 1 GiB Hobby
volume the 5 GiB spool was unreachable and 500 MiB of pg_wal was half the
disk. On a 1 TB volume both were needlessly tight. Scale them DOWN
proportionally — never above the existing 500 MiB / 5 GiB caps:

- wal-drop: min(500 MiB, ~10% of volume), floor 64 MiB
- queue-max: min(5 GiB,  ~50% of volume), floor 128 MiB

The 10x spread between the two budgets is preserved across volume sizes —
hard failures still bail fast, transient stalls still absorb generously.
On volumes >=25 GiB both ceilings hold (= today's behavior).

wrapper.sh computes thresholds at boot, exports WAL_DROP_THRESHOLD_MB (if
not operator-set) so the archive_command wrapper inherits it, and threads
the queue-max value into the rendered pgbackrest.conf. Operator overrides
keep working: WAL_DROP_THRESHOLD_MB is respected when pre-set, and pgBackRest
env-var precedence (env > config) means PGBACKREST_ARCHIVE_PUSH_QUEUE_MAX
still wins over the rendered conf value.
@paulocsanz paulocsanz merged commit cf3aebe into main May 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant