feat(pitr): periodic S3 catalog verification in backup watcher#80
Closed
paulocsanz wants to merge 4 commits into
Closed
feat(pitr): periodic S3 catalog verification in backup watcher#80paulocsanz wants to merge 4 commits into
paulocsanz wants to merge 4 commits into
Conversation
`cleared gap marker` and `backup --type=full completed` are emitted back-to-back by run_backup() in the same shell function, but docker's stdout flush window can split them across separate `docker logs` snapshots. The old loop broke on seeing the new "completed" count and then re-queried for "cleared gap marker" — racing the flush. Capture `docker logs` once per iteration and require BOTH signals before declaring success. The "marker file is gone" assertion stays after the loop since it reads the filesystem, not stdout.
If Railway's variable resolver can't bind ${{<bucket-id>.BUCKET}} to a
live bucket (the bucket got tombstoned upstream, the env was forked
without re-resolving, …), the literal template-ref string lands in
the container's env. Today that string flows straight into
pgbackrest.conf's repo1-s3-bucket, pgBackRest hard-fails every
archive_command, and pgbackrest-archive-push-wrapper.sh's 500 MiB
pg_wal threshold eventually drops segments — turning an upstream
wiring bug into a real, unrecoverable PITR coverage gap.
Validate up front:
- unresolved template ref (contains ${{ or }})
- bucket-id UUID shape (8-4-4-4-12 hex)
- whitespace / control chars
When invalid, log, drop $PGDATA/.pgbackrest_invalid_bucket sentinel,
unset the WAL_ARCHIVE_* vars so every downstream gate treats archiving
as off (and clear_pgbackrest_state_if_disabled wipes any stale config
from a previous valid bucket). Postgres boots clean; the dashboard
surfaces the distinct invalid-bucket state via the sentinel + the
existing monitor.
Adds a `catalog_has_backup()` check that runs `pgbackrest info --stanza=main --repo=1 --output=json` once per hour (WAL_BACKUP_CATALOG_VERIFY_INTERVAL_SECONDS, default 3600). When local state says a full backup was taken but the catalog shows none, clears last_full_at so NEEDS_INITIAL_BACKUP fires on the next poll. Catches divergence between watcher state and S3 reality: - backup command returned exit 0 but catalog metadata was never committed (S3 partial write, stanza-create race at promotion time) - volume survived a redeployment with stale state pointing at a different sysid/stanza path on a fresh cluster Non-zero pgbackrest exit (stanza not yet created, S3 unreachable, auth failure) is treated as inconclusive — local state is not cleared — so transient S3 hiccups don't burn extra full backups. Mirrors the postgres-ha backup_watcher.rs change (same env knob, same logic).
2 tasks
1. validate_wal_archive_bucket wrote the invalid-bucket sentinel into PGDATA before docker-entrypoint.sh ran initdb. docker-entrypoint.sh skips initdb when `ls -A "$PGDATA"` is non-empty (even hidden files), so postgres tried to start from uninitialized PGDATA and died. Fix: export PGBACKREST_BUCKET_INVALID_REASON; write sentinel from pgbackrest-init.sh after PGDATA is initialized (fresh-volume path), or from validate_wal_archive_bucket when PG_VERSION already exists (restart path). 2. catalog_has_backup() used `|| return 1` which conflated pgbackrest exit non-zero (S3 unreachable, 403, stanza not yet created) with "conclusively no backup." A transient auth failure during the verify interval would clear last_full_at and trigger a spurious full backup attempt. Fix: rename to catalog_check_backup() with three return codes (0=has backup, 1=no backup, 2=inconclusive); decide_action only clears state on exit 1. 3. t_watcher_gap_recovery_failed_count_path disabled the MinIO user to cause archive-push failures. Disabling produces InvalidAccessKeyId, which the archive-push wrapper instant-drops (exit 0 since #77/#78) to avoid WAL accumulation on deleted buckets — keeping failed_count=0 and defeating the test. Fix: switch the user to read-only policy so PutObject fails with AccessDenied (not in the instant-drop list), causing failed_count to grow as the test expects.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a periodic catalog check that runs
pgbackrest info --stanza=main --repo=1 --output=jsonevery hour (WAL_BACKUP_CATALOG_VERIFY_INTERVAL_SECONDS, default 3600). When local state claims a full backup was taken but the catalog is empty,last_full_atis cleared soNEEDS_INITIAL_BACKUPfires on the next poll.pgbackrest backupreturned exit 0 but catalog metadata was never committed (S3 partial write, stanza-create race at failover promotion)