fix(pitr): fall back to --type=immediate when target_time predates earliest backup by paulocsanz · Pull Request #70 · railwayapp-templates/postgres-ssl

paulocsanz · 2026-05-10T21:39:13Z

Summary

Customer flow this fixes: cluster runs → commits some rows → user enables PITR → first backup taken AFTER the commits → user clicks restore before any more writes. Frontend pins target to lastCommittedTxnAt, which predates the earliest backup. pgbackrest's default `--type=time` selection rule requires backup_stop ≤ target; no backup qualifies → `[075]: unable to find backup set with stop time less than ''` → crash-loop.

That rule is correct in general — pgbackrest needs a backup whose state predates the target so WAL can be replayed forward. But on an idle source, the latest backup's contents already include everything ≤ the user's last commit (no writes since), so the right answer is "use the latest backup, stop at the consistent point" — `--type=immediate`.

Change

wrapper.sh: before pgbackrest restore, probe `pgbackrest info`, extract earliest backup's stop_time. If `target_time < earliest_stop`, swap `--type=time` for `--type=immediate`. Plain-text parse (no jq/python on the image).
Existing `_XID` priority unchanged — picker's xid path still wins.
Adds `t_pitr_target_predates_earliest_backup_uses_immediate_fallback` covering wrapper diagnostic, `--type=immediate` flag, absence of [075], and end-to-end row presence on the restored cluster.

Why image-only, no mono change

The customer-reachable failure mode is pgbackrest [075] at the image layer. Fix lives where the failure is. No env var contract, no cross-repo forward-compat dance.

Test plan

CI: existing 30+ image-level tests + new one pass on PG 17/18
After merge + deploy, un-SKIP `idleRestore` in test-postgres-pitr/e2e/run-test.ts and confirm prod cron suite goes 8 PASS / 1 SKIP

…rliest backup Customer scenario: cluster runs, user commits some rows, user enables PITR later. First base backup is taken AFTER the commits. User clicks restore before doing any more writes. Frontend pins target to lastCommittedTxnAt (last real commit), which predates the earliest backup. Backend doesn't clamp (target == ceiling), no xid emitted, image runs `pgbackrest restore --type=time --target=<lastCommitTime>`. pgbackrest's --type=time selection rule requires backup_stop ≤ target so WAL can be replayed forward to target. That rule is correct in general, but here NO backup qualifies (every backup was taken AFTER the user's last commit) → pgbackrest aborts with `[075]: unable to find backup set with stop time less than '<target>'` and the restored container crash-loops. The data IS in the bucket: latest backup's contents = state at backup_begin_lsn, which on an idle source already includes everything ≤ the user's last commit. --type=immediate tells pgbackrest to take the latest backup and tells postgres to stop at the consistent point (= backup_end_lsn). Net effect: customer gets their data, no recovery_target match needed. Wrapper change: before launching pgbackrest restore, probe `pgbackrest info`, extract the earliest backup's stop_time, and if target_time predates it, swap --type=time for --type=immediate. Plain-text output is parsed (no jq/python dep — pgbackrest emits a deterministic `timestamp start/stop: <start> / <stop>` line per backup in chronological order). Self-contained in the image: no mono picker change, no new env var, no forward-compat dance. Existing _XID path still wins priority (idle-source target_xid case already handled separately for the test-harness flow that bypasses the frontend pin). Adds t_pitr_target_predates_earliest_backup_uses_immediate_fallback — pins the wrapper diagnostic, pgbackrest --type=immediate flag, absence of [075], and end-to-end row presence (validates the "data lives in the snapshot" claim).

paulocsanz · 2026-05-11T16:55:06Z

Closing — the wrapper-only fallback is unsafe. The condition target_time < earliest_backup_stop conflates two scenarios that need opposite resolutions:

Scenario	target	earliest_backup	Right answer
Idle-since-PITR-enable	`lastCommittedTxnAt`	a few sec later	`--type=immediate` (data is in the snapshot)
Retention culled old backups	some old T	newer T+days	refuse `[075]` (post-target commits exist; restoring silently corrupts user intent)

The wrapper can't tell these apart from pgbackrest info alone — it'd need to know "are there commits between target and the snapshot?" Only the picker has that signal (it has lastCommittedTxnAt from the source probe).

Caught by CI: t_pitr_target_before_retention_window_refuses started silently succeeding instead of refusing (run/25683479496 — restored container stayed running on --type=immediate instead of exiting non-zero).

Reopening the picker-side approach (mono #28878 / image #69) with a tighter trigger: emit POSTGRES_RECOVERY_TARGET_IMMEDIATE=1 only when target >= lastCommittedTxnAt AND lastCommittedTxnAt <= latestBackupAt. That handles the idle-source flow without ambiguity vs retention.

paulocsanz force-pushed the pc/pitr-restore-fallback-to-immediate-when-target-pre-backup branch from 289007b to 45f8f27 Compare May 11, 2026 16:36

paulocsanz closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pitr): fall back to --type=immediate when target_time predates earliest backup#70

fix(pitr): fall back to --type=immediate when target_time predates earliest backup#70
paulocsanz wants to merge 1 commit into
mainfrom
pc/pitr-restore-fallback-to-immediate-when-target-pre-backup

paulocsanz commented May 10, 2026

Uh oh!

paulocsanz commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paulocsanz commented May 10, 2026

Summary

Change

Why image-only, no mono change

Test plan

Uh oh!

paulocsanz commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant