Add preparing_backup restore phase with LSN-derived progress#232
Open
alexole wants to merge 1 commit into
Open
Add preparing_backup restore phase with LSN-derived progress#232alexole wants to merge 1 commit into
alexole wants to merge 1 commit into
Conversation
8b36fae to
60ff004
Compare
60ff004 to
20e0822
Compare
alexole
added a commit
that referenced
this pull request
May 13, 2026
Three independent correctness issues in the prepare-progress plumbing: 1. Stale phase/progress between required-backup iterations. In an incremental restore, restore_basebackup() calls _run_basebackup_restore_operation() once per required backup. The first prepare ends at pct=100 with phase=preparing_backup, then the next iteration starts xbstream extract (minutes on big backups) with those values still in state — /status/restore reports the wrong phase and a stale 100%. Reset phase + pct at the top of every _run_basebackup_restore_operation() call so the invariant is enforced per-iteration instead of once per restore. 2. prepare_progress_callback fired on every LSN-advancing scan line rather than only when the integer pct changed. InnoDB emits scan lines much more often than the derived pct crosses a 1% bucket, so non-coordinator callback consumers would see duplicates and the coordinator still had to take its state lock for every dup (even though update_state() itself is a no-op on unchanged values). Track the last emitted pct on BasebackupRestoreOperation and route all three call sites through a single _emit_prepare_progress() helper that enforces the documented contract. 3. /status/restore exposed the stored basebackup_prepare_progress unconditionally. Failure paths (DiskFullError → Phase.failed, repeated errors → Phase.failed_basebackup) do not clear the stored pct, so the API could surface a stale 42% / 100% long after prepare was gone. Gate the exposed value on the live coordinator phase in web_server — single choke point, robust to any future failure path that forgets to clear the field. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole
added a commit
that referenced
this pull request
May 13, 2026
Four additional issues flagged after the first fixup commit: 1. Controller.is_safe_to_reload only blocked reload during restoring_basebackup. A SIGHUP received while the new preparing_backup phase was running (hours on multi-TB restores) would tear the controller down and force a full rewind of xbstream extract + prepare + move-back. Extend the unsafe set to cover both restoring_basebackup and preparing_backup. 2. Adding the basebackup_prepare_progress key to DEFAULT state without a migration is not safe across upgrades. StateManager.read_state() does `state.clear(); state.update(json)`, so any default the on-disk file doesn't carry vanishes; update_state() then trips its `assert name in state` check. Snapshot the defaults before handing the dict to the state manager and setdefault() any missing keys after load. This also transparently fixes the same pre-existing hazard for backup_xtrabackup_version added in an earlier commit. 3. BasebackupRestoreOperation.prepare_progress_callback docstring claimed the callback runs on "the subprocess reader thread" and that the coordinator handles synchronisation. Untrue: _process_output_loop drains stdout/stderr in-line on the thread that called prepare_backup() (the coordinator thread). Update the docstring and note the implication that long callbacks stall the output parser. 4. _make_restore_op helper in the parser tests documented a non-existent short-circuit: the previous "parser only runs when a callback is attached" optimization was removed when _emit_prepare_progress became the single gate. Update the docstring and drop the unused default no-op callback. Also add a regression test for the state-file backfill so an older state JSON without basebackup_prepare_progress loads, backfills, and lets update_state() write the key without crashing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole
added a commit
that referenced
this pull request
May 13, 2026
Two doc gaps and one test gap: 1. README /status/restore response schema did not mention the new basebackup_prepare_progress field. Add it to the sample JSON and a dedicated section explaining its range, null semantics, and behaviour across incremental restore iterations. 2. README restore-phase list went straight from restoring_basebackup to rebuilding_tables and said restoring_basebackup includes prepare. Both are now wrong: restoring_basebackup is download + xbstream extract only, and preparing_backup is its own phase. Insert preparing_backup in the list and scope restoring_basebackup to download/extract. 3. The preparing_backup reload-safety rule was covered only incidentally by integration tests. Add a focused TestIsSafeToReload class exercising the rule deterministically — including both unsafe restore phases and a handful of definitely-safe ones — so a future change can't silently drop preparing_backup from the unsafe set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole
added a commit
that referenced
this pull request
May 19, 2026
Three independent correctness issues in the prepare-progress plumbing: 1. Stale phase/progress between required-backup iterations. In an incremental restore, restore_basebackup() calls _run_basebackup_restore_operation() once per required backup. The first prepare ends at pct=100 with phase=preparing_backup, then the next iteration starts xbstream extract (minutes on big backups) with those values still in state — /status/restore reports the wrong phase and a stale 100%. Reset phase + pct at the top of every _run_basebackup_restore_operation() call so the invariant is enforced per-iteration instead of once per restore. 2. prepare_progress_callback fired on every LSN-advancing scan line rather than only when the integer pct changed. InnoDB emits scan lines much more often than the derived pct crosses a 1% bucket, so non-coordinator callback consumers would see duplicates and the coordinator still had to take its state lock for every dup (even though update_state() itself is a no-op on unchanged values). Track the last emitted pct on BasebackupRestoreOperation and route all three call sites through a single _emit_prepare_progress() helper that enforces the documented contract. 3. /status/restore exposed the stored basebackup_prepare_progress unconditionally. Failure paths (DiskFullError → Phase.failed, repeated errors → Phase.failed_basebackup) do not clear the stored pct, so the API could surface a stale 42% / 100% long after prepare was gone. Gate the exposed value on the live coordinator phase in web_server — single choke point, robust to any future failure path that forgets to clear the field. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole
added a commit
that referenced
this pull request
May 19, 2026
Four additional issues flagged after the first fixup commit: 1. Controller.is_safe_to_reload only blocked reload during restoring_basebackup. A SIGHUP received while the new preparing_backup phase was running (hours on multi-TB restores) would tear the controller down and force a full rewind of xbstream extract + prepare + move-back. Extend the unsafe set to cover both restoring_basebackup and preparing_backup. 2. Adding the basebackup_prepare_progress key to DEFAULT state without a migration is not safe across upgrades. StateManager.read_state() does `state.clear(); state.update(json)`, so any default the on-disk file doesn't carry vanishes; update_state() then trips its `assert name in state` check. Snapshot the defaults before handing the dict to the state manager and setdefault() any missing keys after load. This also transparently fixes the same pre-existing hazard for backup_xtrabackup_version added in an earlier commit. 3. BasebackupRestoreOperation.prepare_progress_callback docstring claimed the callback runs on "the subprocess reader thread" and that the coordinator handles synchronisation. Untrue: _process_output_loop drains stdout/stderr in-line on the thread that called prepare_backup() (the coordinator thread). Update the docstring and note the implication that long callbacks stall the output parser. 4. _make_restore_op helper in the parser tests documented a non-existent short-circuit: the previous "parser only runs when a callback is attached" optimization was removed when _emit_prepare_progress became the single gate. Update the docstring and drop the unused default no-op callback. Also add a regression test for the state-file backfill so an older state JSON without basebackup_prepare_progress loads, backfills, and lets update_state() write the key without crashing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole
added a commit
that referenced
this pull request
May 19, 2026
Two doc gaps and one test gap: 1. README /status/restore response schema did not mention the new basebackup_prepare_progress field. Add it to the sample JSON and a dedicated section explaining its range, null semantics, and behaviour across incremental restore iterations. 2. README restore-phase list went straight from restoring_basebackup to rebuilding_tables and said restoring_basebackup includes prepare. Both are now wrong: restoring_basebackup is download + xbstream extract only, and preparing_backup is its own phase. Insert preparing_backup in the list and scope restoring_basebackup to download/extract. 3. The preparing_backup reload-safety rule was covered only incidentally by integration tests. Add a focused TestIsSafeToReload class exercising the rule deterministically — including both unsafe restore phases and a handful of definitely-safe ones — so a future change can't silently drop preparing_backup from the unsafe set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
e15d7bc to
ac8c044
Compare
alexole
added a commit
that referenced
this pull request
May 20, 2026
Three additional issues flagged after the first fixup commit: 1. Controller.is_safe_to_reload only blocked reload during restoring_basebackup. A SIGHUP received while the new preparing_backup phase was running (hours on multi-TB restores) would tear the controller down and force a full rewind of xbstream extract + prepare + move-back. Extend the unsafe set to cover both restoring_basebackup and preparing_backup. 2. BasebackupRestoreOperation.prepare_progress_callback docstring claimed the callback runs on "the subprocess reader thread" and that the coordinator handles synchronisation. Untrue: _process_output_loop drains stdout/stderr in-line on the thread that called prepare_backup() (the coordinator thread). Update the docstring and note the implication that long callbacks stall the output parser. 3. _make_restore_op helper in the parser tests documented a non-existent short-circuit: the previous "parser only runs when a callback is attached" optimization was removed when _emit_prepare_progress became the single gate. Update the docstring and drop the unused default no-op callback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole
added a commit
that referenced
this pull request
May 20, 2026
Two doc gaps and one test gap: 1. README /status/restore response schema did not mention the new basebackup_prepare_progress field. Add it to the sample JSON and a dedicated section explaining its range, null semantics, and behaviour across incremental restore iterations. 2. README restore-phase list went straight from restoring_basebackup to rebuilding_tables and said restoring_basebackup includes prepare. Both are now wrong: restoring_basebackup is download + xbstream extract only, and preparing_backup is its own phase. Insert preparing_backup in the list and scope restoring_basebackup to download/extract. 3. The preparing_backup reload-safety rule was covered only incidentally by integration tests. Add a focused TestIsSafeToReload class exercising the rule deterministically — including both unsafe restore phases and a handful of definitely-safe ones — so a future change can't silently drop preparing_backup from the unsafe set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ac8c044 to
3643d61
Compare
alexole
added a commit
that referenced
this pull request
May 20, 2026
As requested in review: prepare_progress_callback would be better as a private attribute set via the constructor instead of a public field assigned post-construction. It has a single producer (the coordinator) and the post-construction reassignment was a code smell that suggested external rebinding was supported when nothing actually relies on it. Move it to a constructor kwarg, store as self._prepare_progress_callback (private), and drop the post-construction assignment in RestoreCoordinator._run_basebackup_restore_operation. Tests pass the callback in via _make_restore_op's new parameter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9edb17c to
c1f9f63
Compare
On multi-TB MySQL restores `xtrabackup --prepare` can run for hours between the basebackup download finishing and the database becoming available. From the outside the restore looks stuck — the bytes- downloaded gauge is pinned at 100% and there's no other signal. Add a `RestoreCoordinator.Phase.preparing_backup` entered during `restore_basebackup()` once `xbstream` extract is done and `xtrabackup --prepare` (followed by the short `--move-back`) starts running. Surface LSN-derived progress for that phase. The pct is computed from the InnoDB scan-line LSN advancing across the window (first observed scan LSN, `last_lsn` from `xtrabackup_checkpoints`). `last_lsn` is the redo stop LSN; using `to_lsn` would hit 100% before the redo tail is applied. The lower bound is the first observed scan LSN rather than `from_lsn` because `from_lsn` is `0` on full backups, which would pin the bar near the top of the range. `BasebackupRestoreOperation` parses `xtrabackup` output, derives the pct, and fires a callback when the integer pct changes. Scan-line LSNs advance much more often than the truncated pct, so the callback is deduped at the pct level — at most ~101 emissions per prepare run. The coordinator's callback updates `state["basebackup_prepare_progress"]` and sets `Phase.preparing_backup`; the first call carries `pct=None` and is what flips the phase from `restoring_basebackup`. `/status/restore` exposes `basebackup_prepare_progress`, gated on the live phase so failure paths (`Phase.failed`, `Phase.failed_basebackup`) don't leak a stale pct. `Controller.is_safe_to_reload` now also blocks reload during `preparing_backup`; neither it nor `restoring_basebackup` is independently resumable, so a reload would discard hours of work. A `myhoard.basebackup_restore.xtrabackup_prepare_progress` gauge is emitted alongside the callback so dashboards can show in-flight progress without polling `/status/restore`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c1f9f63 to
5ee41ea
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On multi-TB MySQL restores
xtrabackup --preparecan run for hours between the basebackup download finishing and the database becoming available. From the outside the restore looks stuck — the bytes-downloaded gauge is pinned at 100% and there's no other signal.This PR adds a
RestoreCoordinator.Phase.preparing_backupentered duringrestore_basebackup()oncexbstreamextract is done andxtrabackup --prepare(followed by the short--move-back) starts running. The pct is derived from the InnoDB scan-line LSN advancing across[first observed scan LSN, last_lsn from xtrabackup_checkpoints].last_lsnis the redo stop LSN — usingto_lsnwould hit 100% before the redo tail is applied; the lower bound is the first observed scan LSN rather thanfrom_lsn, which is0on full backups and would pinthe bar near the top of the range.
BasebackupRestoreOperationparsesxtrabackupoutput, derives the pct, and fires a callback when the integer pct changes (deduped at the pct level — at most ~101 emissions per run). The coordinator's callback updatesstate["basebackup_prepare_progress"]and setsPhase.preparing_backup; the first call carriespct=Noneand is what flips the phase fromrestoring_basebackup./status/restoreexposesbasebackup_prepare_progress, gated on the live phase so failure paths (Phase.failed,Phase.failed_basebackup) don't leak a stale pct.Controller.is_safe_to_reloadnow also blocks reload duringpreparing_backup; neither it norrestoring_basebackupis independently resumable, so a reload would discard hours of work. Amyhoard.basebackup_restore.xtrabackup_prepare_progressgauge is emitted alongside the callback so dashboards can show in-flight progress without polling/status/restore.