Skip to content

Add preparing_backup restore phase with LSN-derived progress#232

Open
alexole wants to merge 1 commit into
masterfrom
alex-add-preparation-phase
Open

Add preparing_backup restore phase with LSN-derived progress#232
alexole wants to merge 1 commit into
masterfrom
alex-add-preparation-phase

Conversation

@alexole
Copy link
Copy Markdown
Contributor

@alexole alexole commented May 11, 2026

On multi-TB MySQL restores xtrabackup --prepare can run for hours between the basebackup download finishing and the database becoming available. From the outside the restore looks stuck — the bytes-downloaded gauge is pinned at 100% and there's no other signal.

This PR adds a RestoreCoordinator.Phase.preparing_backup entered during restore_basebackup() once xbstream extract is done and xtrabackup --prepare (followed by the short --move-back) starts running. The pct is derived from the InnoDB scan-line LSN advancing across
[first observed scan LSN, last_lsn from xtrabackup_checkpoints]. last_lsn is the redo stop LSN — using to_lsn would hit 100% before the redo tail is applied; the lower bound is the first observed scan LSN rather than from_lsn, which is 0 on full backups and would pin
the bar near the top of the range.

BasebackupRestoreOperation parses xtrabackup output, derives the pct, and fires a callback when the integer pct changes (deduped at the pct level — at most ~101 emissions per run). The coordinator's callback updates state["basebackup_prepare_progress"] and sets
Phase.preparing_backup; the first call carries pct=None and is what flips the phase from restoring_basebackup.

/status/restore exposes basebackup_prepare_progress, gated on the live phase so failure paths (Phase.failed, Phase.failed_basebackup) don't leak a stale pct. Controller.is_safe_to_reload now also blocks reload during preparing_backup; neither it nor
restoring_basebackup is independently resumable, so a reload would discard hours of work. A myhoard.basebackup_restore.xtrabackup_prepare_progress gauge is emitted alongside the callback so dashboards can show in-flight progress without polling /status/restore.

@alexole alexole force-pushed the alex-add-preparation-phase branch from 8b36fae to 60ff004 Compare May 13, 2026 10:06
@alexole alexole marked this pull request as ready for review May 13, 2026 12:19
@alexole alexole requested a review from VitaliVinahradski May 13, 2026 12:21
@VitaliVinahradski VitaliVinahradski requested a review from Copilot May 13, 2026 12:22
@alexole alexole force-pushed the alex-add-preparation-phase branch from 60ff004 to 20e0822 Compare May 13, 2026 12:26

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@VitaliVinahradski VitaliVinahradski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VitaliVinahradski VitaliVinahradski self-requested a review May 13, 2026 12:38
alexole added a commit that referenced this pull request May 13, 2026
Three independent correctness issues in the prepare-progress plumbing:

1. Stale phase/progress between required-backup iterations. In an
   incremental restore, restore_basebackup() calls
   _run_basebackup_restore_operation() once per required backup. The
   first prepare ends at pct=100 with phase=preparing_backup, then the
   next iteration starts xbstream extract (minutes on big backups) with
   those values still in state — /status/restore reports the wrong
   phase and a stale 100%. Reset phase + pct at the top of every
   _run_basebackup_restore_operation() call so the invariant is
   enforced per-iteration instead of once per restore.

2. prepare_progress_callback fired on every LSN-advancing scan line
   rather than only when the integer pct changed. InnoDB emits scan
   lines much more often than the derived pct crosses a 1% bucket, so
   non-coordinator callback consumers would see duplicates and the
   coordinator still had to take its state lock for every dup (even
   though update_state() itself is a no-op on unchanged values). Track
   the last emitted pct on BasebackupRestoreOperation and route all
   three call sites through a single _emit_prepare_progress() helper
   that enforces the documented contract.

3. /status/restore exposed the stored basebackup_prepare_progress
   unconditionally. Failure paths (DiskFullError → Phase.failed,
   repeated errors → Phase.failed_basebackup) do not clear the stored
   pct, so the API could surface a stale 42% / 100% long after prepare
   was gone. Gate the exposed value on the live coordinator phase in
   web_server — single choke point, robust to any future failure path
   that forgets to clear the field.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alexole alexole requested a review from Copilot May 13, 2026 12:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread myhoard/restore_coordinator.py Outdated
Comment thread myhoard/restore_coordinator.py
Comment thread test/test_basebackup_restore_operation.py Outdated
Comment thread myhoard/basebackup_restore_operation.py Outdated
alexole added a commit that referenced this pull request May 13, 2026
Four additional issues flagged after the first fixup commit:

1. Controller.is_safe_to_reload only blocked reload during
   restoring_basebackup. A SIGHUP received while the new
   preparing_backup phase was running (hours on multi-TB restores)
   would tear the controller down and force a full rewind of xbstream
   extract + prepare + move-back. Extend the unsafe set to cover both
   restoring_basebackup and preparing_backup.

2. Adding the basebackup_prepare_progress key to DEFAULT state
   without a migration is not safe across upgrades.
   StateManager.read_state() does `state.clear(); state.update(json)`,
   so any default the on-disk file doesn't carry vanishes; update_state()
   then trips its `assert name in state` check. Snapshot the defaults
   before handing the dict to the state manager and setdefault() any
   missing keys after load. This also transparently fixes the same
   pre-existing hazard for backup_xtrabackup_version added in an
   earlier commit.

3. BasebackupRestoreOperation.prepare_progress_callback docstring
   claimed the callback runs on "the subprocess reader thread" and
   that the coordinator handles synchronisation. Untrue:
   _process_output_loop drains stdout/stderr in-line on the thread
   that called prepare_backup() (the coordinator thread). Update the
   docstring and note the implication that long callbacks stall the
   output parser.

4. _make_restore_op helper in the parser tests documented a
   non-existent short-circuit: the previous "parser only runs when a
   callback is attached" optimization was removed when
   _emit_prepare_progress became the single gate. Update the
   docstring and drop the unused default no-op callback.

Also add a regression test for the state-file backfill so an older
state JSON without basebackup_prepare_progress loads, backfills, and
lets update_state() write the key without crashing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alexole alexole requested a review from Copilot May 13, 2026 12:59

This comment was marked as outdated.

alexole added a commit that referenced this pull request May 13, 2026
Two doc gaps and one test gap:

1. README /status/restore response schema did not mention the new
   basebackup_prepare_progress field. Add it to the sample JSON and a
   dedicated section explaining its range, null semantics, and
   behaviour across incremental restore iterations.

2. README restore-phase list went straight from restoring_basebackup
   to rebuilding_tables and said restoring_basebackup includes
   prepare. Both are now wrong: restoring_basebackup is download +
   xbstream extract only, and preparing_backup is its own phase.
   Insert preparing_backup in the list and scope restoring_basebackup
   to download/extract.

3. The preparing_backup reload-safety rule was covered only
   incidentally by integration tests. Add a focused
   TestIsSafeToReload class exercising the rule deterministically —
   including both unsafe restore phases and a handful of
   definitely-safe ones — so a future change can't silently drop
   preparing_backup from the unsafe set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alexole alexole requested a review from Copilot May 13, 2026 13:11

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@VitaliVinahradski VitaliVinahradski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VitaliVinahradski VitaliVinahradski marked this pull request as draft May 13, 2026 13:16
alexole added a commit that referenced this pull request May 19, 2026
Three independent correctness issues in the prepare-progress plumbing:

1. Stale phase/progress between required-backup iterations. In an
   incremental restore, restore_basebackup() calls
   _run_basebackup_restore_operation() once per required backup. The
   first prepare ends at pct=100 with phase=preparing_backup, then the
   next iteration starts xbstream extract (minutes on big backups) with
   those values still in state — /status/restore reports the wrong
   phase and a stale 100%. Reset phase + pct at the top of every
   _run_basebackup_restore_operation() call so the invariant is
   enforced per-iteration instead of once per restore.

2. prepare_progress_callback fired on every LSN-advancing scan line
   rather than only when the integer pct changed. InnoDB emits scan
   lines much more often than the derived pct crosses a 1% bucket, so
   non-coordinator callback consumers would see duplicates and the
   coordinator still had to take its state lock for every dup (even
   though update_state() itself is a no-op on unchanged values). Track
   the last emitted pct on BasebackupRestoreOperation and route all
   three call sites through a single _emit_prepare_progress() helper
   that enforces the documented contract.

3. /status/restore exposed the stored basebackup_prepare_progress
   unconditionally. Failure paths (DiskFullError → Phase.failed,
   repeated errors → Phase.failed_basebackup) do not clear the stored
   pct, so the API could surface a stale 42% / 100% long after prepare
   was gone. Gate the exposed value on the live coordinator phase in
   web_server — single choke point, robust to any future failure path
   that forgets to clear the field.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole added a commit that referenced this pull request May 19, 2026
Four additional issues flagged after the first fixup commit:

1. Controller.is_safe_to_reload only blocked reload during
   restoring_basebackup. A SIGHUP received while the new
   preparing_backup phase was running (hours on multi-TB restores)
   would tear the controller down and force a full rewind of xbstream
   extract + prepare + move-back. Extend the unsafe set to cover both
   restoring_basebackup and preparing_backup.

2. Adding the basebackup_prepare_progress key to DEFAULT state
   without a migration is not safe across upgrades.
   StateManager.read_state() does `state.clear(); state.update(json)`,
   so any default the on-disk file doesn't carry vanishes; update_state()
   then trips its `assert name in state` check. Snapshot the defaults
   before handing the dict to the state manager and setdefault() any
   missing keys after load. This also transparently fixes the same
   pre-existing hazard for backup_xtrabackup_version added in an
   earlier commit.

3. BasebackupRestoreOperation.prepare_progress_callback docstring
   claimed the callback runs on "the subprocess reader thread" and
   that the coordinator handles synchronisation. Untrue:
   _process_output_loop drains stdout/stderr in-line on the thread
   that called prepare_backup() (the coordinator thread). Update the
   docstring and note the implication that long callbacks stall the
   output parser.

4. _make_restore_op helper in the parser tests documented a
   non-existent short-circuit: the previous "parser only runs when a
   callback is attached" optimization was removed when
   _emit_prepare_progress became the single gate. Update the
   docstring and drop the unused default no-op callback.

Also add a regression test for the state-file backfill so an older
state JSON without basebackup_prepare_progress loads, backfills, and
lets update_state() write the key without crashing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole added a commit that referenced this pull request May 19, 2026
Two doc gaps and one test gap:

1. README /status/restore response schema did not mention the new
   basebackup_prepare_progress field. Add it to the sample JSON and a
   dedicated section explaining its range, null semantics, and
   behaviour across incremental restore iterations.

2. README restore-phase list went straight from restoring_basebackup
   to rebuilding_tables and said restoring_basebackup includes
   prepare. Both are now wrong: restoring_basebackup is download +
   xbstream extract only, and preparing_backup is its own phase.
   Insert preparing_backup in the list and scope restoring_basebackup
   to download/extract.

3. The preparing_backup reload-safety rule was covered only
   incidentally by integration tests. Add a focused
   TestIsSafeToReload class exercising the rule deterministically —
   including both unsafe restore phases and a handful of
   definitely-safe ones — so a future change can't silently drop
   preparing_backup from the unsafe set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alexole alexole force-pushed the alex-add-preparation-phase branch from e15d7bc to ac8c044 Compare May 19, 2026 12:43
@alexole alexole marked this pull request as ready for review May 20, 2026 12:51
alexole added a commit that referenced this pull request May 20, 2026
Three additional issues flagged after the first fixup commit:

1. Controller.is_safe_to_reload only blocked reload during
   restoring_basebackup. A SIGHUP received while the new
   preparing_backup phase was running (hours on multi-TB restores)
   would tear the controller down and force a full rewind of xbstream
   extract + prepare + move-back. Extend the unsafe set to cover both
   restoring_basebackup and preparing_backup.

2. BasebackupRestoreOperation.prepare_progress_callback docstring
   claimed the callback runs on "the subprocess reader thread" and
   that the coordinator handles synchronisation. Untrue:
   _process_output_loop drains stdout/stderr in-line on the thread
   that called prepare_backup() (the coordinator thread). Update the
   docstring and note the implication that long callbacks stall the
   output parser.

3. _make_restore_op helper in the parser tests documented a
   non-existent short-circuit: the previous "parser only runs when a
   callback is attached" optimization was removed when
   _emit_prepare_progress became the single gate. Update the
   docstring and drop the unused default no-op callback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexole added a commit that referenced this pull request May 20, 2026
Two doc gaps and one test gap:

1. README /status/restore response schema did not mention the new
   basebackup_prepare_progress field. Add it to the sample JSON and a
   dedicated section explaining its range, null semantics, and
   behaviour across incremental restore iterations.

2. README restore-phase list went straight from restoring_basebackup
   to rebuilding_tables and said restoring_basebackup includes
   prepare. Both are now wrong: restoring_basebackup is download +
   xbstream extract only, and preparing_backup is its own phase.
   Insert preparing_backup in the list and scope restoring_basebackup
   to download/extract.

3. The preparing_backup reload-safety rule was covered only
   incidentally by integration tests. Add a focused
   TestIsSafeToReload class exercising the rule deterministically —
   including both unsafe restore phases and a handful of
   definitely-safe ones — so a future change can't silently drop
   preparing_backup from the unsafe set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alexole alexole force-pushed the alex-add-preparation-phase branch from ac8c044 to 3643d61 Compare May 20, 2026 13:21
Comment thread myhoard/basebackup_restore_operation.py Outdated
Comment thread myhoard/basebackup_restore_operation.py Outdated
alexole added a commit that referenced this pull request May 20, 2026
As requested in review: prepare_progress_callback would be better as a
private attribute set via the constructor instead of a public field
assigned post-construction. It has a single producer (the coordinator)
and the post-construction reassignment was a code smell that suggested
external rebinding was supported when nothing actually relies on it.

Move it to a constructor kwarg, store as self._prepare_progress_callback
(private), and drop the post-construction assignment in
RestoreCoordinator._run_basebackup_restore_operation. Tests pass the
callback in via _make_restore_op's new parameter.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread myhoard/basebackup_restore_operation.py Outdated
Comment thread myhoard/basebackup_restore_operation.py Outdated
Comment thread myhoard/basebackup_restore_operation.py Outdated
@alexole alexole force-pushed the alex-add-preparation-phase branch 7 times, most recently from 9edb17c to c1f9f63 Compare May 21, 2026 12:32
On multi-TB MySQL restores `xtrabackup --prepare` can run for hours
between the basebackup download finishing and the database becoming
available. From the outside the restore looks stuck — the bytes-
downloaded gauge is pinned at 100% and there's no other signal.

Add a `RestoreCoordinator.Phase.preparing_backup` entered during
`restore_basebackup()` once `xbstream` extract is done and
`xtrabackup --prepare` (followed by the short `--move-back`) starts
running. Surface LSN-derived progress for that phase.

The pct is computed from the InnoDB scan-line LSN advancing across the
window (first observed scan LSN, `last_lsn` from `xtrabackup_checkpoints`).
`last_lsn` is the redo stop LSN; using `to_lsn` would hit 100% before
the redo tail is applied. The lower bound is the first observed scan
LSN rather than `from_lsn` because `from_lsn` is `0` on full backups,
which would pin the bar near the top of the range.

`BasebackupRestoreOperation` parses `xtrabackup` output, derives the
pct, and fires a callback when the integer pct changes. Scan-line LSNs
advance much more often than the truncated pct, so the callback is
deduped at the pct level — at most ~101 emissions per prepare run.
The coordinator's callback updates `state["basebackup_prepare_progress"]`
and sets `Phase.preparing_backup`; the first call carries `pct=None`
and is what flips the phase from `restoring_basebackup`.

`/status/restore` exposes `basebackup_prepare_progress`, gated on the
live phase so failure paths (`Phase.failed`, `Phase.failed_basebackup`)
don't leak a stale pct. `Controller.is_safe_to_reload` now also blocks
reload during `preparing_backup`; neither it nor `restoring_basebackup`
is independently resumable, so a reload would discard hours of work.

A `myhoard.basebackup_restore.xtrabackup_prepare_progress` gauge is
emitted alongside the callback so dashboards can show in-flight
progress without polling `/status/restore`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alexole alexole force-pushed the alex-add-preparation-phase branch from c1f9f63 to 5ee41ea Compare May 21, 2026 12:41
Copy link
Copy Markdown
Contributor

@danielblasina-tech danielblasina-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants