Add preparing_backup restore phase with LSN-derived progress by alexole · Pull Request #232 · Aiven-Open/myhoard

alexole · 2026-05-11T13:18:44Z

On multi-TB MySQL restores xtrabackup --prepare can run for hours between the basebackup download finishing and the database becoming available. From the outside the restore looks stuck — the bytes-downloaded gauge is pinned at 100% and there's no other signal.

This PR adds a RestoreCoordinator.Phase.preparing_backup entered during restore_basebackup() once xbstream extract is done and xtrabackup --prepare (followed by the short --move-back) starts running. The pct is derived from the InnoDB scan-line LSN advancing across
[first observed scan LSN, last_lsn from xtrabackup_checkpoints]. last_lsn is the redo stop LSN — using to_lsn would hit 100% before the redo tail is applied; the lower bound is the first observed scan LSN rather than from_lsn, which is 0 on full backups and would pin
the bar near the top of the range.

BasebackupRestoreOperation parses xtrabackup output, derives the pct, and fires a callback when the integer pct changes (deduped at the pct level — at most ~101 emissions per run). The coordinator's callback updates state["basebackup_prepare_progress"] and sets
Phase.preparing_backup; the first call carries pct=None and is what flips the phase from restoring_basebackup.

/status/restore exposes basebackup_prepare_progress, gated on the live phase so failure paths (Phase.failed, Phase.failed_basebackup) don't leak a stale pct. Controller.is_safe_to_reload now also blocks reload during preparing_backup; neither it nor
restoring_basebackup is independently resumable, so a reload would discard hours of work. A myhoard.basebackup_restore.xtrabackup_prepare_progress gauge is emitted alongside the callback so dashboards can show in-flight progress without polling /status/restore.

VitaliVinahradski

LGTM

Three independent correctness issues in the prepare-progress plumbing: 1. Stale phase/progress between required-backup iterations. In an incremental restore, restore_basebackup() calls _run_basebackup_restore_operation() once per required backup. The first prepare ends at pct=100 with phase=preparing_backup, then the next iteration starts xbstream extract (minutes on big backups) with those values still in state — /status/restore reports the wrong phase and a stale 100%. Reset phase + pct at the top of every _run_basebackup_restore_operation() call so the invariant is enforced per-iteration instead of once per restore. 2. prepare_progress_callback fired on every LSN-advancing scan line rather than only when the integer pct changed. InnoDB emits scan lines much more often than the derived pct crosses a 1% bucket, so non-coordinator callback consumers would see duplicates and the coordinator still had to take its state lock for every dup (even though update_state() itself is a no-op on unchanged values). Track the last emitted pct on BasebackupRestoreOperation and route all three call sites through a single _emit_prepare_progress() helper that enforces the documented contract. 3. /status/restore exposed the stored basebackup_prepare_progress unconditionally. Failure paths (DiskFullError → Phase.failed, repeated errors → Phase.failed_basebackup) do not clear the stored pct, so the API could surface a stale 42% / 100% long after prepare was gone. Gate the exposed value on the live coordinator phase in web_server — single choke point, robust to any future failure path that forgets to clear the field. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Four additional issues flagged after the first fixup commit: 1. Controller.is_safe_to_reload only blocked reload during restoring_basebackup. A SIGHUP received while the new preparing_backup phase was running (hours on multi-TB restores) would tear the controller down and force a full rewind of xbstream extract + prepare + move-back. Extend the unsafe set to cover both restoring_basebackup and preparing_backup. 2. Adding the basebackup_prepare_progress key to DEFAULT state without a migration is not safe across upgrades. StateManager.read_state() does `state.clear(); state.update(json)`, so any default the on-disk file doesn't carry vanishes; update_state() then trips its `assert name in state` check. Snapshot the defaults before handing the dict to the state manager and setdefault() any missing keys after load. This also transparently fixes the same pre-existing hazard for backup_xtrabackup_version added in an earlier commit. 3. BasebackupRestoreOperation.prepare_progress_callback docstring claimed the callback runs on "the subprocess reader thread" and that the coordinator handles synchronisation. Untrue: _process_output_loop drains stdout/stderr in-line on the thread that called prepare_backup() (the coordinator thread). Update the docstring and note the implication that long callbacks stall the output parser. 4. _make_restore_op helper in the parser tests documented a non-existent short-circuit: the previous "parser only runs when a callback is attached" optimization was removed when _emit_prepare_progress became the single gate. Update the docstring and drop the unused default no-op callback. Also add a regression test for the state-file backfill so an older state JSON without basebackup_prepare_progress loads, backfills, and lets update_state() write the key without crashing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two doc gaps and one test gap: 1. README /status/restore response schema did not mention the new basebackup_prepare_progress field. Add it to the sample JSON and a dedicated section explaining its range, null semantics, and behaviour across incremental restore iterations. 2. README restore-phase list went straight from restoring_basebackup to rebuilding_tables and said restoring_basebackup includes prepare. Both are now wrong: restoring_basebackup is download + xbstream extract only, and preparing_backup is its own phase. Insert preparing_backup in the list and scope restoring_basebackup to download/extract. 3. The preparing_backup reload-safety rule was covered only incidentally by integration tests. Add a focused TestIsSafeToReload class exercising the rule deterministically — including both unsafe restore phases and a handful of definitely-safe ones — so a future change can't silently drop preparing_backup from the unsafe set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

VitaliVinahradski

LGTM

Three independent correctness issues in the prepare-progress plumbing: 1. Stale phase/progress between required-backup iterations. In an incremental restore, restore_basebackup() calls _run_basebackup_restore_operation() once per required backup. The first prepare ends at pct=100 with phase=preparing_backup, then the next iteration starts xbstream extract (minutes on big backups) with those values still in state — /status/restore reports the wrong phase and a stale 100%. Reset phase + pct at the top of every _run_basebackup_restore_operation() call so the invariant is enforced per-iteration instead of once per restore. 2. prepare_progress_callback fired on every LSN-advancing scan line rather than only when the integer pct changed. InnoDB emits scan lines much more often than the derived pct crosses a 1% bucket, so non-coordinator callback consumers would see duplicates and the coordinator still had to take its state lock for every dup (even though update_state() itself is a no-op on unchanged values). Track the last emitted pct on BasebackupRestoreOperation and route all three call sites through a single _emit_prepare_progress() helper that enforces the documented contract. 3. /status/restore exposed the stored basebackup_prepare_progress unconditionally. Failure paths (DiskFullError → Phase.failed, repeated errors → Phase.failed_basebackup) do not clear the stored pct, so the API could surface a stale 42% / 100% long after prepare was gone. Gate the exposed value on the live coordinator phase in web_server — single choke point, robust to any future failure path that forgets to clear the field. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Four additional issues flagged after the first fixup commit: 1. Controller.is_safe_to_reload only blocked reload during restoring_basebackup. A SIGHUP received while the new preparing_backup phase was running (hours on multi-TB restores) would tear the controller down and force a full rewind of xbstream extract + prepare + move-back. Extend the unsafe set to cover both restoring_basebackup and preparing_backup. 2. Adding the basebackup_prepare_progress key to DEFAULT state without a migration is not safe across upgrades. StateManager.read_state() does `state.clear(); state.update(json)`, so any default the on-disk file doesn't carry vanishes; update_state() then trips its `assert name in state` check. Snapshot the defaults before handing the dict to the state manager and setdefault() any missing keys after load. This also transparently fixes the same pre-existing hazard for backup_xtrabackup_version added in an earlier commit. 3. BasebackupRestoreOperation.prepare_progress_callback docstring claimed the callback runs on "the subprocess reader thread" and that the coordinator handles synchronisation. Untrue: _process_output_loop drains stdout/stderr in-line on the thread that called prepare_backup() (the coordinator thread). Update the docstring and note the implication that long callbacks stall the output parser. 4. _make_restore_op helper in the parser tests documented a non-existent short-circuit: the previous "parser only runs when a callback is attached" optimization was removed when _emit_prepare_progress became the single gate. Update the docstring and drop the unused default no-op callback. Also add a regression test for the state-file backfill so an older state JSON without basebackup_prepare_progress loads, backfills, and lets update_state() write the key without crashing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two doc gaps and one test gap: 1. README /status/restore response schema did not mention the new basebackup_prepare_progress field. Add it to the sample JSON and a dedicated section explaining its range, null semantics, and behaviour across incremental restore iterations. 2. README restore-phase list went straight from restoring_basebackup to rebuilding_tables and said restoring_basebackup includes prepare. Both are now wrong: restoring_basebackup is download + xbstream extract only, and preparing_backup is its own phase. Insert preparing_backup in the list and scope restoring_basebackup to download/extract. 3. The preparing_backup reload-safety rule was covered only incidentally by integration tests. Add a focused TestIsSafeToReload class exercising the rule deterministically — including both unsafe restore phases and a handful of definitely-safe ones — so a future change can't silently drop preparing_backup from the unsafe set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three additional issues flagged after the first fixup commit: 1. Controller.is_safe_to_reload only blocked reload during restoring_basebackup. A SIGHUP received while the new preparing_backup phase was running (hours on multi-TB restores) would tear the controller down and force a full rewind of xbstream extract + prepare + move-back. Extend the unsafe set to cover both restoring_basebackup and preparing_backup. 2. BasebackupRestoreOperation.prepare_progress_callback docstring claimed the callback runs on "the subprocess reader thread" and that the coordinator handles synchronisation. Untrue: _process_output_loop drains stdout/stderr in-line on the thread that called prepare_backup() (the coordinator thread). Update the docstring and note the implication that long callbacks stall the output parser. 3. _make_restore_op helper in the parser tests documented a non-existent short-circuit: the previous "parser only runs when a callback is attached" optimization was removed when _emit_prepare_progress became the single gate. Update the docstring and drop the unused default no-op callback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two doc gaps and one test gap: 1. README /status/restore response schema did not mention the new basebackup_prepare_progress field. Add it to the sample JSON and a dedicated section explaining its range, null semantics, and behaviour across incremental restore iterations. 2. README restore-phase list went straight from restoring_basebackup to rebuilding_tables and said restoring_basebackup includes prepare. Both are now wrong: restoring_basebackup is download + xbstream extract only, and preparing_backup is its own phase. Insert preparing_backup in the list and scope restoring_basebackup to download/extract. 3. The preparing_backup reload-safety rule was covered only incidentally by integration tests. Add a focused TestIsSafeToReload class exercising the rule deterministically — including both unsafe restore phases and a handful of definitely-safe ones — so a future change can't silently drop preparing_backup from the unsafe set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

As requested in review: prepare_progress_callback would be better as a private attribute set via the constructor instead of a public field assigned post-construction. It has a single producer (the coordinator) and the post-construction reassignment was a code smell that suggested external rebinding was supported when nothing actually relies on it. Move it to a constructor kwarg, store as self._prepare_progress_callback (private), and drop the post-construction assignment in RestoreCoordinator._run_basebackup_restore_operation. Tests pass the callback in via _make_restore_op's new parameter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

On multi-TB MySQL restores `xtrabackup --prepare` can run for hours between the basebackup download finishing and the database becoming available. From the outside the restore looks stuck — the bytes- downloaded gauge is pinned at 100% and there's no other signal. Add a `RestoreCoordinator.Phase.preparing_backup` entered during `restore_basebackup()` once `xbstream` extract is done and `xtrabackup --prepare` (followed by the short `--move-back`) starts running. Surface LSN-derived progress for that phase. The pct is computed from the InnoDB scan-line LSN advancing across the window (first observed scan LSN, `last_lsn` from `xtrabackup_checkpoints`). `last_lsn` is the redo stop LSN; using `to_lsn` would hit 100% before the redo tail is applied. The lower bound is the first observed scan LSN rather than `from_lsn` because `from_lsn` is `0` on full backups, which would pin the bar near the top of the range. `BasebackupRestoreOperation` parses `xtrabackup` output, derives the pct, and fires a callback when the integer pct changes. Scan-line LSNs advance much more often than the truncated pct, so the callback is deduped at the pct level — at most ~101 emissions per prepare run. The coordinator's callback updates `state["basebackup_prepare_progress"]` and sets `Phase.preparing_backup`; the first call carries `pct=None` and is what flips the phase from `restoring_basebackup`. `/status/restore` exposes `basebackup_prepare_progress`, gated on the live phase so failure paths (`Phase.failed`, `Phase.failed_basebackup`) don't leak a stale pct. `Controller.is_safe_to_reload` now also blocks reload during `preparing_backup`; neither it nor `restoring_basebackup` is independently resumable, so a reload would discard hours of work. A `myhoard.basebackup_restore.xtrabackup_prepare_progress` gauge is emitted alongside the callback so dashboards can show in-flight progress without polling `/status/restore`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

danielblasina-tech

LGTM

alexole force-pushed the alex-add-preparation-phase branch from 8b36fae to 60ff004 Compare May 13, 2026 10:06

alexole marked this pull request as ready for review May 13, 2026 12:19

alexole requested a review from VitaliVinahradski May 13, 2026 12:21

VitaliVinahradski requested a review from Copilot May 13, 2026 12:22

Copilot started reviewing on behalf of VitaliVinahradski May 13, 2026 12:23 View session

alexole force-pushed the alex-add-preparation-phase branch from 60ff004 to 20e0822 Compare May 13, 2026 12:26

This comment was marked as resolved.

Sign in to view

VitaliVinahradski previously approved these changes May 13, 2026

View reviewed changes

VitaliVinahradski self-requested a review May 13, 2026 12:38

alexole dismissed VitaliVinahradski’s stale review via 4e467bd May 13, 2026 12:40

alexole requested a review from Copilot May 13, 2026 12:41

Copilot started reviewing on behalf of alexole May 13, 2026 12:42 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread myhoard/restore_coordinator.py Outdated

Comment thread myhoard/restore_coordinator.py

Comment thread test/test_basebackup_restore_operation.py Outdated

Comment thread myhoard/basebackup_restore_operation.py Outdated

alexole requested a review from Copilot May 13, 2026 12:59

Copilot started reviewing on behalf of alexole May 13, 2026 12:59 View session

This comment was marked as outdated.

Sign in to view

alexole requested a review from Copilot May 13, 2026 13:11

Copilot started reviewing on behalf of alexole May 13, 2026 13:12 View session

This comment was marked as outdated.

Sign in to view

VitaliVinahradski previously approved these changes May 13, 2026

View reviewed changes

VitaliVinahradski marked this pull request as draft May 13, 2026 13:16

alexole dismissed VitaliVinahradski’s stale review via ac8c044 May 19, 2026 12:43

alexole force-pushed the alex-add-preparation-phase branch from e15d7bc to ac8c044 Compare May 19, 2026 12:43

alexole marked this pull request as ready for review May 20, 2026 12:51

alexole force-pushed the alex-add-preparation-phase branch from ac8c044 to 3643d61 Compare May 20, 2026 13:21