fix(pitr): fall back to --type=immediate when target_time predates earliest backup#70
Conversation
…rliest backup Customer scenario: cluster runs, user commits some rows, user enables PITR later. First base backup is taken AFTER the commits. User clicks restore before doing any more writes. Frontend pins target to lastCommittedTxnAt (last real commit), which predates the earliest backup. Backend doesn't clamp (target == ceiling), no xid emitted, image runs `pgbackrest restore --type=time --target=<lastCommitTime>`. pgbackrest's --type=time selection rule requires backup_stop ≤ target so WAL can be replayed forward to target. That rule is correct in general, but here NO backup qualifies (every backup was taken AFTER the user's last commit) → pgbackrest aborts with `[075]: unable to find backup set with stop time less than '<target>'` and the restored container crash-loops. The data IS in the bucket: latest backup's contents = state at backup_begin_lsn, which on an idle source already includes everything ≤ the user's last commit. --type=immediate tells pgbackrest to take the latest backup and tells postgres to stop at the consistent point (= backup_end_lsn). Net effect: customer gets their data, no recovery_target match needed. Wrapper change: before launching pgbackrest restore, probe `pgbackrest info`, extract the earliest backup's stop_time, and if target_time predates it, swap --type=time for --type=immediate. Plain-text output is parsed (no jq/python dep — pgbackrest emits a deterministic `timestamp start/stop: <start> / <stop>` line per backup in chronological order). Self-contained in the image: no mono picker change, no new env var, no forward-compat dance. Existing _XID path still wins priority (idle-source target_xid case already handled separately for the test-harness flow that bypasses the frontend pin). Adds t_pitr_target_predates_earliest_backup_uses_immediate_fallback — pins the wrapper diagnostic, pgbackrest --type=immediate flag, absence of [075], and end-to-end row presence (validates the "data lives in the snapshot" claim).
289007b to
45f8f27
Compare
|
Closing — the wrapper-only fallback is unsafe. The condition
The wrapper can't tell these apart from Caught by CI: Reopening the picker-side approach (mono #28878 / image #69) with a tighter trigger: emit |
Summary
Customer flow this fixes: cluster runs → commits some rows → user enables PITR → first backup taken AFTER the commits → user clicks restore before any more writes. Frontend pins target to
lastCommittedTxnAt, which predates the earliest backup. pgbackrest's default `--type=time` selection rule requiresbackup_stop ≤ target; no backup qualifies → `[075]: unable to find backup set with stop time less than ''` → crash-loop.That rule is correct in general — pgbackrest needs a backup whose state predates the target so WAL can be replayed forward. But on an idle source, the latest backup's contents already include everything ≤ the user's last commit (no writes since), so the right answer is "use the latest backup, stop at the consistent point" — `--type=immediate`.
Change
wrapper.sh: before pgbackrest restore, probe `pgbackrest info`, extract earliest backup's stop_time. If `target_time < earliest_stop`, swap `--type=time` for `--type=immediate`. Plain-text parse (no jq/python on the image).Why image-only, no mono change
The customer-reachable failure mode is pgbackrest [075] at the image layer. Fix lives where the failure is. No env var contract, no cross-repo forward-compat dance.
Test plan