fix(pitr): drop WAL immediately on NoSuchBucket instead of accumulating by paulocsanz · Pull Request #77 · railwayapp-templates/postgres-ssl

paulocsanz · 2026-05-14T20:58:49Z

Summary

Captures pgBackRest stderr so we can inspect the failure reason
When the error contains NoSuchBucket (S3's bucket-not-found code), drops the WAL segment immediately instead of accumulating up to the threshold
Touches .pgbackrest_gap_pending so the backup watcher takes a fresh full if archiving recovers

Why

When a user deletes the PITR bucket, every archive-push fails with S3 NoSuchBucket. The old behavior let WAL accumulate in pg_wal up to the threshold (~10% of volume, capped at 500 MiB) before finally dropping. Since there's no recovery path without operator action (a deleted bucket won't come back on its own), accumulating WAL is pure waste.

NoSuchBucket is the canonical S3 XML error code returned by AWS, R2, MinIO, and GCS S3-interop when the bucket doesn't exist — distinct from transient errors, bad creds, or network blips — so it's safe to drop without waiting.

Test plan

Verify normal archive-push (success path) still exits 0 — pgbackrest output is captured and re-emitted to stderr unchanged
Verify transient failures (non-zero, no NoSuchBucket) still accumulate normally and trip the threshold
Verify bucket-deleted scenario (non-zero + NoSuchBucket in output) drops immediately and logs the right message

1. validate_wal_archive_bucket wrote the invalid-bucket sentinel into PGDATA before docker-entrypoint.sh ran initdb. docker-entrypoint.sh skips initdb when `ls -A "$PGDATA"` is non-empty (even hidden files), so postgres tried to start from uninitialized PGDATA and died. Fix: export PGBACKREST_BUCKET_INVALID_REASON; write sentinel from pgbackrest-init.sh after PGDATA is initialized (fresh-volume path), or from validate_wal_archive_bucket when PG_VERSION already exists (restart path). 2. catalog_has_backup() used `|| return 1` which conflated pgbackrest exit non-zero (S3 unreachable, 403, stanza not yet created) with "conclusively no backup." A transient auth failure during the verify interval would clear last_full_at and trigger a spurious full backup attempt. Fix: rename to catalog_check_backup() with three return codes (0=has backup, 1=no backup, 2=inconclusive); decide_action only clears state on exit 1. 3. t_watcher_gap_recovery_failed_count_path disabled the MinIO user to cause archive-push failures. Disabling produces InvalidAccessKeyId, which the archive-push wrapper instant-drops (exit 0 since #77/#78) to avoid WAL accumulation on deleted buckets — keeping failed_count=0 and defeating the test. Fix: switch the user to read-only policy so PutObject fails with AccessDenied (not in the instant-drop list), causing failed_count to grow as the test expects.

…er exit-55 self-heal (#81) * test(e2e): fix t_watcher_gap_recovery_full race `cleared gap marker` and `backup --type=full completed` are emitted back-to-back by run_backup() in the same shell function, but docker's stdout flush window can split them across separate `docker logs` snapshots. The old loop broke on seeing the new "completed" count and then re-queried for "cleared gap marker" — racing the flush. Capture `docker logs` once per iteration and require BOTH signals before declaring success. The "marker file is gone" assertion stays after the loop since it reads the filesystem, not stdout. * feat(pitr): refuse to enable archiving when WAL_ARCHIVE_BUCKET is junk If Railway's variable resolver can't bind ${{<bucket-id>.BUCKET}} to a live bucket (the bucket got tombstoned upstream, the env was forked without re-resolving, …), the literal template-ref string lands in the container's env. Today that string flows straight into pgbackrest.conf's repo1-s3-bucket, pgBackRest hard-fails every archive_command, and pgbackrest-archive-push-wrapper.sh's 500 MiB pg_wal threshold eventually drops segments — turning an upstream wiring bug into a real, unrecoverable PITR coverage gap. Validate up front: - unresolved template ref (contains ${{ or }}) - bucket-id UUID shape (8-4-4-4-12 hex) - whitespace / control chars When invalid, log, drop $PGDATA/.pgbackrest_invalid_bucket sentinel, unset the WAL_ARCHIVE_* vars so every downstream gate treats archiving as off (and clear_pgbackrest_state_if_disabled wipes any stale config from a previous valid bucket). Postgres boots clean; the dashboard surfaces the distinct invalid-bucket state via the sentinel + the existing monitor. * feat(pitr): periodic S3 catalog verification in backup watcher Adds a `catalog_has_backup()` check that runs `pgbackrest info --stanza=main --repo=1 --output=json` once per hour (WAL_BACKUP_CATALOG_VERIFY_INTERVAL_SECONDS, default 3600). When local state says a full backup was taken but the catalog shows none, clears last_full_at so NEEDS_INITIAL_BACKUP fires on the next poll. Catches divergence between watcher state and S3 reality: - backup command returned exit 0 but catalog metadata was never committed (S3 partial write, stanza-create race at promotion time) - volume survived a redeployment with stale state pointing at a different sysid/stanza path on a fresh cluster Non-zero pgbackrest exit (stanza not yet created, S3 unreachable, auth failure) is treated as inconclusive — local state is not cleared — so transient S3 hiccups don't burn extra full backups. Mirrors the postgres-ha backup_watcher.rs change (same env knob, same logic). * fix(pitr): fix three test/runtime bugs in catalog-verify + bucket-guard 1. validate_wal_archive_bucket wrote the invalid-bucket sentinel into PGDATA before docker-entrypoint.sh ran initdb. docker-entrypoint.sh skips initdb when `ls -A "$PGDATA"` is non-empty (even hidden files), so postgres tried to start from uninitialized PGDATA and died. Fix: export PGBACKREST_BUCKET_INVALID_REASON; write sentinel from pgbackrest-init.sh after PGDATA is initialized (fresh-volume path), or from validate_wal_archive_bucket when PG_VERSION already exists (restart path). 2. catalog_has_backup() used `|| return 1` which conflated pgbackrest exit non-zero (S3 unreachable, 403, stanza not yet created) with "conclusively no backup." A transient auth failure during the verify interval would clear last_full_at and trigger a spurious full backup attempt. Fix: rename to catalog_check_backup() with three return codes (0=has backup, 1=no backup, 2=inconclusive); decide_action only clears state on exit 1. 3. t_watcher_gap_recovery_failed_count_path disabled the MinIO user to cause archive-push failures. Disabling produces InvalidAccessKeyId, which the archive-push wrapper instant-drops (exit 0 since #77/#78) to avoid WAL accumulation on deleted buckets — keeping failed_count=0 and defeating the test. Fix: switch the user to read-only policy so PutObject fails with AccessDenied (not in the instant-drop list), causing failed_count to grow as the test expects. * fix: retry stanza-create up to 5 times on transient failure If the bucket isn't ready immediately after provisioning (timing race, transient network error), the single stanza-create attempt fails and the container is permanently stuck — every subsequent pgbackrest backup exits 55 (FileMissingError) and the watcher retries forever without self-healing. Retry up to 5 times with a 30s gap. 5 × 30s covers the typical bucket- ready window while keeping total latency under 3 minutes. Permanent failures (bad credentials, sysid mismatch) still exhaust all attempts and fall back to the same "retry on next boot" path. * fix: retry stanza-create continuously until success * fix: stanza-create recovery in bootstrap (bounded) and watcher (exit-55) Bootstrap (bootstrap_pgbackrest_stanza): cap at 5 attempts with 30s gaps so transient bucket-provisioning races don't spin forever on permanent failures (e.g. node demoted mid-run). Watcher (run_backup): detect exit 55 (FileMissingError / backup.info absent), run stanza-create inline, then retry the backup once. The watcher poll loop handles further retries, so no deeper nesting is needed here. Together these ensure a stanza-create failure at boot is recovered on the next watcher cycle rather than requiring a manual SSH trigger. * fix: stanza-create bootstrap retries indefinitely (standalone has no demotion) Reverts the 5-attempt cap. Standalone postgres has no failover concept so there is no demotion scenario — the node is always primary while running. Infinite retry ensures stanza-create always completes despite transient bucket-provisioning races or auth blips on first boot. The watcher exit-55 path is a secondary recovery layer for any edge case that bootstrap misses.

fix(pitr): drop WAL immediately on NoSuchBucket instead of accumulating

d1f8b60

paulocsanz merged commit 3b7846d into main May 14, 2026
1 check passed

paulocsanz mentioned this pull request May 14, 2026

fix(pitr): also match InvalidAccessKeyId when bucket credentials are revoked #78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pitr): drop WAL immediately on NoSuchBucket instead of accumulating#77

fix(pitr): drop WAL immediately on NoSuchBucket instead of accumulating#77
paulocsanz merged 1 commit into
mainfrom
pc/bucket-deleted-drop-wal

paulocsanz commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paulocsanz commented May 14, 2026

Summary

Why

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant