fix(pitr): also match InvalidAccessKeyId when bucket credentials are revoked by paulocsanz · Pull Request #78 · railwayapp-templates/postgres-ssl

paulocsanz · 2026-05-14T21:11:35Z

Summary

Extends the bucket-deleted immediate-drop logic (landed in #77) to also match InvalidAccessKeyId.

Why InvalidAccessKeyId and not just NoSuchBucket

Direct API probing against Tigris showed that the two S3 error codes apply to different request types:

GET to a non-existent bucket (any auth): 404 NoSuchBucket — Tigris checks bucket existence before credentials on reads.
PUT to a non-existent bucket with invalid/revoked credentials: 403 InvalidAccessKeyId — Tigris validates credentials before bucket existence on writes.

Since archive-push is a PUT, and Railway revokes the bucket credentials when the bucket is deleted, pgBackRest would see InvalidAccessKeyId in practice — not NoSuchBucket. This PR adds that pattern to the grep so both are caught.

What happens when the bucket comes back / a redeploy points at a new bucket

Both recovery paths work the same way:

Bucket re-created or credentials restored: pgBackRest's next archive-push succeeds. The wrapper returns 0 normally — the NoSuchBucket/InvalidAccessKeyId branch is never taken. The .pgbackrest_gap_pending sentinel left by the drop path is still on disk; the backup watcher picks it up on its next poll and takes a fresh full backup. That full backup becomes the new recovery base, so the PITR window restarts cleanly from the point archiving recovered.
Redeploy with a new bucket (WAL_ARCHIVE_BUCKET updated): wrapper.sh rewrites /etc/pgbackrest/pgbackrest.conf with the new bucket on boot and runs stanza-create. First archive-push to the new bucket succeeds. Same as above — gap_pending is already set, the watcher takes a full, PITR window starts fresh.

In both cases the gap itself (the segments dropped while the bucket was gone) is unrestorable — that's inherent to dropping them — but everything from the first successful archive after recovery is fully covered.

Edge case: credential rotation redeploy

InvalidAccessKeyId would also fire for a few seconds if the old key is revoked before the new container is live during a normal credential rotation. That creates a tiny archiving gap (~seconds of WAL vs. the previous 500 MiB accumulation). It's a better tradeoff in Railway's model where bucket deletion is permanent and immediate drops are the right response.

…entials are revoked

1. validate_wal_archive_bucket wrote the invalid-bucket sentinel into PGDATA before docker-entrypoint.sh ran initdb. docker-entrypoint.sh skips initdb when `ls -A "$PGDATA"` is non-empty (even hidden files), so postgres tried to start from uninitialized PGDATA and died. Fix: export PGBACKREST_BUCKET_INVALID_REASON; write sentinel from pgbackrest-init.sh after PGDATA is initialized (fresh-volume path), or from validate_wal_archive_bucket when PG_VERSION already exists (restart path). 2. catalog_has_backup() used `|| return 1` which conflated pgbackrest exit non-zero (S3 unreachable, 403, stanza not yet created) with "conclusively no backup." A transient auth failure during the verify interval would clear last_full_at and trigger a spurious full backup attempt. Fix: rename to catalog_check_backup() with three return codes (0=has backup, 1=no backup, 2=inconclusive); decide_action only clears state on exit 1. 3. t_watcher_gap_recovery_failed_count_path disabled the MinIO user to cause archive-push failures. Disabling produces InvalidAccessKeyId, which the archive-push wrapper instant-drops (exit 0 since #77/#78) to avoid WAL accumulation on deleted buckets — keeping failed_count=0 and defeating the test. Fix: switch the user to read-only policy so PutObject fails with AccessDenied (not in the instant-drop list), causing failed_count to grow as the test expects.

…er exit-55 self-heal (#81) * test(e2e): fix t_watcher_gap_recovery_full race `cleared gap marker` and `backup --type=full completed` are emitted back-to-back by run_backup() in the same shell function, but docker's stdout flush window can split them across separate `docker logs` snapshots. The old loop broke on seeing the new "completed" count and then re-queried for "cleared gap marker" — racing the flush. Capture `docker logs` once per iteration and require BOTH signals before declaring success. The "marker file is gone" assertion stays after the loop since it reads the filesystem, not stdout. * feat(pitr): refuse to enable archiving when WAL_ARCHIVE_BUCKET is junk If Railway's variable resolver can't bind ${{<bucket-id>.BUCKET}} to a live bucket (the bucket got tombstoned upstream, the env was forked without re-resolving, …), the literal template-ref string lands in the container's env. Today that string flows straight into pgbackrest.conf's repo1-s3-bucket, pgBackRest hard-fails every archive_command, and pgbackrest-archive-push-wrapper.sh's 500 MiB pg_wal threshold eventually drops segments — turning an upstream wiring bug into a real, unrecoverable PITR coverage gap. Validate up front: - unresolved template ref (contains ${{ or }}) - bucket-id UUID shape (8-4-4-4-12 hex) - whitespace / control chars When invalid, log, drop $PGDATA/.pgbackrest_invalid_bucket sentinel, unset the WAL_ARCHIVE_* vars so every downstream gate treats archiving as off (and clear_pgbackrest_state_if_disabled wipes any stale config from a previous valid bucket). Postgres boots clean; the dashboard surfaces the distinct invalid-bucket state via the sentinel + the existing monitor. * feat(pitr): periodic S3 catalog verification in backup watcher Adds a `catalog_has_backup()` check that runs `pgbackrest info --stanza=main --repo=1 --output=json` once per hour (WAL_BACKUP_CATALOG_VERIFY_INTERVAL_SECONDS, default 3600). When local state says a full backup was taken but the catalog shows none, clears last_full_at so NEEDS_INITIAL_BACKUP fires on the next poll. Catches divergence between watcher state and S3 reality: - backup command returned exit 0 but catalog metadata was never committed (S3 partial write, stanza-create race at promotion time) - volume survived a redeployment with stale state pointing at a different sysid/stanza path on a fresh cluster Non-zero pgbackrest exit (stanza not yet created, S3 unreachable, auth failure) is treated as inconclusive — local state is not cleared — so transient S3 hiccups don't burn extra full backups. Mirrors the postgres-ha backup_watcher.rs change (same env knob, same logic). * fix(pitr): fix three test/runtime bugs in catalog-verify + bucket-guard 1. validate_wal_archive_bucket wrote the invalid-bucket sentinel into PGDATA before docker-entrypoint.sh ran initdb. docker-entrypoint.sh skips initdb when `ls -A "$PGDATA"` is non-empty (even hidden files), so postgres tried to start from uninitialized PGDATA and died. Fix: export PGBACKREST_BUCKET_INVALID_REASON; write sentinel from pgbackrest-init.sh after PGDATA is initialized (fresh-volume path), or from validate_wal_archive_bucket when PG_VERSION already exists (restart path). 2. catalog_has_backup() used `|| return 1` which conflated pgbackrest exit non-zero (S3 unreachable, 403, stanza not yet created) with "conclusively no backup." A transient auth failure during the verify interval would clear last_full_at and trigger a spurious full backup attempt. Fix: rename to catalog_check_backup() with three return codes (0=has backup, 1=no backup, 2=inconclusive); decide_action only clears state on exit 1. 3. t_watcher_gap_recovery_failed_count_path disabled the MinIO user to cause archive-push failures. Disabling produces InvalidAccessKeyId, which the archive-push wrapper instant-drops (exit 0 since #77/#78) to avoid WAL accumulation on deleted buckets — keeping failed_count=0 and defeating the test. Fix: switch the user to read-only policy so PutObject fails with AccessDenied (not in the instant-drop list), causing failed_count to grow as the test expects. * fix: retry stanza-create up to 5 times on transient failure If the bucket isn't ready immediately after provisioning (timing race, transient network error), the single stanza-create attempt fails and the container is permanently stuck — every subsequent pgbackrest backup exits 55 (FileMissingError) and the watcher retries forever without self-healing. Retry up to 5 times with a 30s gap. 5 × 30s covers the typical bucket- ready window while keeping total latency under 3 minutes. Permanent failures (bad credentials, sysid mismatch) still exhaust all attempts and fall back to the same "retry on next boot" path. * fix: retry stanza-create continuously until success * fix: stanza-create recovery in bootstrap (bounded) and watcher (exit-55) Bootstrap (bootstrap_pgbackrest_stanza): cap at 5 attempts with 30s gaps so transient bucket-provisioning races don't spin forever on permanent failures (e.g. node demoted mid-run). Watcher (run_backup): detect exit 55 (FileMissingError / backup.info absent), run stanza-create inline, then retry the backup once. The watcher poll loop handles further retries, so no deeper nesting is needed here. Together these ensure a stanza-create failure at boot is recovered on the next watcher cycle rather than requiring a manual SSH trigger. * fix: stanza-create bootstrap retries indefinitely (standalone has no demotion) Reverts the 5-attempt cap. Standalone postgres has no failover concept so there is no demotion scenario — the node is always primary while running. Infinite retry ensures stanza-create always completes despite transient bucket-provisioning races or auth blips on first boot. The watcher exit-55 path is a secondary recovery layer for any edge case that bootstrap misses.

fix(pitr): also match InvalidAccessKeyId on PUT path when bucket cred…

ad25e8d

…entials are revoked

paulocsanz merged commit 5bc03eb into main May 14, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pitr): also match InvalidAccessKeyId when bucket credentials are revoked#78

fix(pitr): also match InvalidAccessKeyId when bucket credentials are revoked#78
paulocsanz merged 1 commit into
mainfrom
pc/bucket-drop-invalid-key

paulocsanz commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paulocsanz commented May 14, 2026

Summary

Why InvalidAccessKeyId and not just NoSuchBucket

What happens when the bucket comes back / a redeploy points at a new bucket

Edge case: credential rotation redeploy

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant