Skip to content

fix(pitr): also match InvalidAccessKeyId when bucket credentials are revoked#78

Merged
paulocsanz merged 1 commit into
mainfrom
pc/bucket-drop-invalid-key
May 14, 2026
Merged

fix(pitr): also match InvalidAccessKeyId when bucket credentials are revoked#78
paulocsanz merged 1 commit into
mainfrom
pc/bucket-drop-invalid-key

Conversation

@paulocsanz
Copy link
Copy Markdown
Contributor

Summary

Extends the bucket-deleted immediate-drop logic (landed in #77) to also match InvalidAccessKeyId.

Why InvalidAccessKeyId and not just NoSuchBucket

Direct API probing against Tigris showed that the two S3 error codes apply to different request types:

  • GET to a non-existent bucket (any auth): 404 NoSuchBucket — Tigris checks bucket existence before credentials on reads.
  • PUT to a non-existent bucket with invalid/revoked credentials: 403 InvalidAccessKeyId — Tigris validates credentials before bucket existence on writes.

Since archive-push is a PUT, and Railway revokes the bucket credentials when the bucket is deleted, pgBackRest would see InvalidAccessKeyId in practice — not NoSuchBucket. This PR adds that pattern to the grep so both are caught.

What happens when the bucket comes back / a redeploy points at a new bucket

Both recovery paths work the same way:

  1. Bucket re-created or credentials restored: pgBackRest's next archive-push succeeds. The wrapper returns 0 normally — the NoSuchBucket/InvalidAccessKeyId branch is never taken. The .pgbackrest_gap_pending sentinel left by the drop path is still on disk; the backup watcher picks it up on its next poll and takes a fresh full backup. That full backup becomes the new recovery base, so the PITR window restarts cleanly from the point archiving recovered.

  2. Redeploy with a new bucket (WAL_ARCHIVE_BUCKET updated): wrapper.sh rewrites /etc/pgbackrest/pgbackrest.conf with the new bucket on boot and runs stanza-create. First archive-push to the new bucket succeeds. Same as above — gap_pending is already set, the watcher takes a full, PITR window starts fresh.

In both cases the gap itself (the segments dropped while the bucket was gone) is unrestorable — that's inherent to dropping them — but everything from the first successful archive after recovery is fully covered.

Edge case: credential rotation redeploy

InvalidAccessKeyId would also fire for a few seconds if the old key is revoked before the new container is live during a normal credential rotation. That creates a tiny archiving gap (~seconds of WAL vs. the previous 500 MiB accumulation). It's a better tradeoff in Railway's model where bucket deletion is permanent and immediate drops are the right response.

@paulocsanz paulocsanz merged commit 5bc03eb into main May 14, 2026
1 check failed
paulocsanz added a commit that referenced this pull request May 15, 2026
1. validate_wal_archive_bucket wrote the invalid-bucket sentinel into
   PGDATA before docker-entrypoint.sh ran initdb. docker-entrypoint.sh
   skips initdb when `ls -A "$PGDATA"` is non-empty (even hidden files),
   so postgres tried to start from uninitialized PGDATA and died.
   Fix: export PGBACKREST_BUCKET_INVALID_REASON; write sentinel from
   pgbackrest-init.sh after PGDATA is initialized (fresh-volume path),
   or from validate_wal_archive_bucket when PG_VERSION already exists
   (restart path).

2. catalog_has_backup() used `|| return 1` which conflated pgbackrest
   exit non-zero (S3 unreachable, 403, stanza not yet created) with
   "conclusively no backup." A transient auth failure during the verify
   interval would clear last_full_at and trigger a spurious full backup
   attempt. Fix: rename to catalog_check_backup() with three return
   codes (0=has backup, 1=no backup, 2=inconclusive); decide_action only
   clears state on exit 1.

3. t_watcher_gap_recovery_failed_count_path disabled the MinIO user to
   cause archive-push failures. Disabling produces InvalidAccessKeyId,
   which the archive-push wrapper instant-drops (exit 0 since #77/#78)
   to avoid WAL accumulation on deleted buckets — keeping failed_count=0
   and defeating the test. Fix: switch the user to read-only policy so
   PutObject fails with AccessDenied (not in the instant-drop list),
   causing failed_count to grow as the test expects.
paulocsanz added a commit that referenced this pull request May 18, 2026
…er exit-55 self-heal (#81)

* test(e2e): fix t_watcher_gap_recovery_full race

`cleared gap marker` and `backup --type=full completed` are emitted
back-to-back by run_backup() in the same shell function, but docker's
stdout flush window can split them across separate `docker logs`
snapshots. The old loop broke on seeing the new "completed" count and
then re-queried for "cleared gap marker" — racing the flush.

Capture `docker logs` once per iteration and require BOTH signals
before declaring success. The "marker file is gone" assertion stays
after the loop since it reads the filesystem, not stdout.

* feat(pitr): refuse to enable archiving when WAL_ARCHIVE_BUCKET is junk

If Railway's variable resolver can't bind ${{<bucket-id>.BUCKET}} to a
live bucket (the bucket got tombstoned upstream, the env was forked
without re-resolving, …), the literal template-ref string lands in
the container's env. Today that string flows straight into
pgbackrest.conf's repo1-s3-bucket, pgBackRest hard-fails every
archive_command, and pgbackrest-archive-push-wrapper.sh's 500 MiB
pg_wal threshold eventually drops segments — turning an upstream
wiring bug into a real, unrecoverable PITR coverage gap.

Validate up front:
  - unresolved template ref (contains ${{ or }})
  - bucket-id UUID shape (8-4-4-4-12 hex)
  - whitespace / control chars

When invalid, log, drop $PGDATA/.pgbackrest_invalid_bucket sentinel,
unset the WAL_ARCHIVE_* vars so every downstream gate treats archiving
as off (and clear_pgbackrest_state_if_disabled wipes any stale config
from a previous valid bucket). Postgres boots clean; the dashboard
surfaces the distinct invalid-bucket state via the sentinel + the
existing monitor.

* feat(pitr): periodic S3 catalog verification in backup watcher

Adds a `catalog_has_backup()` check that runs `pgbackrest info --stanza=main
--repo=1 --output=json` once per hour (WAL_BACKUP_CATALOG_VERIFY_INTERVAL_SECONDS,
default 3600). When local state says a full backup was taken but the catalog
shows none, clears last_full_at so NEEDS_INITIAL_BACKUP fires on the next poll.

Catches divergence between watcher state and S3 reality:
- backup command returned exit 0 but catalog metadata was never committed
  (S3 partial write, stanza-create race at promotion time)
- volume survived a redeployment with stale state pointing at a different
  sysid/stanza path on a fresh cluster

Non-zero pgbackrest exit (stanza not yet created, S3 unreachable, auth
failure) is treated as inconclusive — local state is not cleared — so
transient S3 hiccups don't burn extra full backups.

Mirrors the postgres-ha backup_watcher.rs change (same env knob, same logic).

* fix(pitr): fix three test/runtime bugs in catalog-verify + bucket-guard

1. validate_wal_archive_bucket wrote the invalid-bucket sentinel into
   PGDATA before docker-entrypoint.sh ran initdb. docker-entrypoint.sh
   skips initdb when `ls -A "$PGDATA"` is non-empty (even hidden files),
   so postgres tried to start from uninitialized PGDATA and died.
   Fix: export PGBACKREST_BUCKET_INVALID_REASON; write sentinel from
   pgbackrest-init.sh after PGDATA is initialized (fresh-volume path),
   or from validate_wal_archive_bucket when PG_VERSION already exists
   (restart path).

2. catalog_has_backup() used `|| return 1` which conflated pgbackrest
   exit non-zero (S3 unreachable, 403, stanza not yet created) with
   "conclusively no backup." A transient auth failure during the verify
   interval would clear last_full_at and trigger a spurious full backup
   attempt. Fix: rename to catalog_check_backup() with three return
   codes (0=has backup, 1=no backup, 2=inconclusive); decide_action only
   clears state on exit 1.

3. t_watcher_gap_recovery_failed_count_path disabled the MinIO user to
   cause archive-push failures. Disabling produces InvalidAccessKeyId,
   which the archive-push wrapper instant-drops (exit 0 since #77/#78)
   to avoid WAL accumulation on deleted buckets — keeping failed_count=0
   and defeating the test. Fix: switch the user to read-only policy so
   PutObject fails with AccessDenied (not in the instant-drop list),
   causing failed_count to grow as the test expects.

* fix: retry stanza-create up to 5 times on transient failure

If the bucket isn't ready immediately after provisioning (timing race,
transient network error), the single stanza-create attempt fails and the
container is permanently stuck — every subsequent pgbackrest backup exits
55 (FileMissingError) and the watcher retries forever without self-healing.

Retry up to 5 times with a 30s gap. 5 × 30s covers the typical bucket-
ready window while keeping total latency under 3 minutes. Permanent
failures (bad credentials, sysid mismatch) still exhaust all attempts and
fall back to the same "retry on next boot" path.

* fix: retry stanza-create continuously until success

* fix: stanza-create recovery in bootstrap (bounded) and watcher (exit-55)

Bootstrap (bootstrap_pgbackrest_stanza): cap at 5 attempts with 30s gaps
so transient bucket-provisioning races don't spin forever on permanent
failures (e.g. node demoted mid-run).

Watcher (run_backup): detect exit 55 (FileMissingError / backup.info
absent), run stanza-create inline, then retry the backup once. The watcher
poll loop handles further retries, so no deeper nesting is needed here.

Together these ensure a stanza-create failure at boot is recovered on the
next watcher cycle rather than requiring a manual SSH trigger.

* fix: stanza-create bootstrap retries indefinitely (standalone has no demotion)

Reverts the 5-attempt cap. Standalone postgres has no failover concept so
there is no demotion scenario — the node is always primary while running.
Infinite retry ensures stanza-create always completes despite transient
bucket-provisioning races or auth blips on first boot.

The watcher exit-55 path is a secondary recovery layer for any edge case
that bootstrap misses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant