Skip to content

fix(pitr): fall back to --type=immediate when target_time predates earliest backup#70

Closed
paulocsanz wants to merge 1 commit into
mainfrom
pc/pitr-restore-fallback-to-immediate-when-target-pre-backup
Closed

fix(pitr): fall back to --type=immediate when target_time predates earliest backup#70
paulocsanz wants to merge 1 commit into
mainfrom
pc/pitr-restore-fallback-to-immediate-when-target-pre-backup

Conversation

@paulocsanz
Copy link
Copy Markdown
Contributor

Summary

Customer flow this fixes: cluster runs → commits some rows → user enables PITR → first backup taken AFTER the commits → user clicks restore before any more writes. Frontend pins target to lastCommittedTxnAt, which predates the earliest backup. pgbackrest's default `--type=time` selection rule requires backup_stop ≤ target; no backup qualifies → `[075]: unable to find backup set with stop time less than ''` → crash-loop.

That rule is correct in general — pgbackrest needs a backup whose state predates the target so WAL can be replayed forward. But on an idle source, the latest backup's contents already include everything ≤ the user's last commit (no writes since), so the right answer is "use the latest backup, stop at the consistent point" — `--type=immediate`.

Change

  • wrapper.sh: before pgbackrest restore, probe `pgbackrest info`, extract earliest backup's stop_time. If `target_time < earliest_stop`, swap `--type=time` for `--type=immediate`. Plain-text parse (no jq/python on the image).
  • Existing `_XID` priority unchanged — picker's xid path still wins.
  • Adds `t_pitr_target_predates_earliest_backup_uses_immediate_fallback` covering wrapper diagnostic, `--type=immediate` flag, absence of [075], and end-to-end row presence on the restored cluster.

Why image-only, no mono change

The customer-reachable failure mode is pgbackrest [075] at the image layer. Fix lives where the failure is. No env var contract, no cross-repo forward-compat dance.

Test plan

  • CI: existing 30+ image-level tests + new one pass on PG 17/18
  • After merge + deploy, un-SKIP `idleRestore` in test-postgres-pitr/e2e/run-test.ts and confirm prod cron suite goes 8 PASS / 1 SKIP

…rliest backup

Customer scenario: cluster runs, user commits some rows, user enables PITR
later. First base backup is taken AFTER the commits. User clicks restore
before doing any more writes. Frontend pins target to lastCommittedTxnAt
(last real commit), which predates the earliest backup. Backend doesn't
clamp (target == ceiling), no xid emitted, image runs `pgbackrest restore
--type=time --target=<lastCommitTime>`.

pgbackrest's --type=time selection rule requires backup_stop ≤ target so
WAL can be replayed forward to target. That rule is correct in general,
but here NO backup qualifies (every backup was taken AFTER the user's
last commit) → pgbackrest aborts with `[075]: unable to find backup set
with stop time less than '<target>'` and the restored container
crash-loops.

The data IS in the bucket: latest backup's contents = state at
backup_begin_lsn, which on an idle source already includes everything ≤
the user's last commit. --type=immediate tells pgbackrest to take the
latest backup and tells postgres to stop at the consistent point
(= backup_end_lsn). Net effect: customer gets their data, no
recovery_target match needed.

Wrapper change: before launching pgbackrest restore, probe `pgbackrest
info`, extract the earliest backup's stop_time, and if target_time
predates it, swap --type=time for --type=immediate. Plain-text output is
parsed (no jq/python dep — pgbackrest emits a deterministic
`timestamp start/stop: <start> / <stop>` line per backup in chronological
order).

Self-contained in the image: no mono picker change, no new env var, no
forward-compat dance. Existing _XID path still wins priority (idle-source
target_xid case already handled separately for the test-harness flow
that bypasses the frontend pin).

Adds t_pitr_target_predates_earliest_backup_uses_immediate_fallback —
pins the wrapper diagnostic, pgbackrest --type=immediate flag, absence
of [075], and end-to-end row presence (validates the "data lives in
the snapshot" claim).
@paulocsanz paulocsanz force-pushed the pc/pitr-restore-fallback-to-immediate-when-target-pre-backup branch from 289007b to 45f8f27 Compare May 11, 2026 16:36
@paulocsanz
Copy link
Copy Markdown
Contributor Author

Closing — the wrapper-only fallback is unsafe. The condition target_time < earliest_backup_stop conflates two scenarios that need opposite resolutions:

Scenario target earliest_backup Right answer
Idle-since-PITR-enable lastCommittedTxnAt a few sec later --type=immediate (data is in the snapshot)
Retention culled old backups some old T newer T+days refuse [075] (post-target commits exist; restoring silently corrupts user intent)

The wrapper can't tell these apart from pgbackrest info alone — it'd need to know "are there commits between target and the snapshot?" Only the picker has that signal (it has lastCommittedTxnAt from the source probe).

Caught by CI: t_pitr_target_before_retention_window_refuses started silently succeeding instead of refusing (run/25683479496 — restored container stayed running on --type=immediate instead of exiting non-zero).

Reopening the picker-side approach (mono #28878 / image #69) with a tighter trigger: emit POSTGRES_RECOVERY_TARGET_IMMEDIATE=1 only when target >= lastCommittedTxnAt AND lastCommittedTxnAt <= latestBackupAt. That handles the idle-source flow without ambiguity vs retention.

@paulocsanz paulocsanz closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant