CNTRLPLANE-2685, CNTRLPLANE-2707: integrate HCPEtcdBackup lifecycle into OADP backup flow#238
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jparrill The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds HCPEtcdBackup-based etcd snapshot orchestration: new orchestrator creates/monitors HCPEtcdBackup CRs, maps Velero BSL to storage, copies/remaps credentials, injects snapshot URLs for restores, removes legacy config keys, updates types/scheme, and includes docs and tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
pkg/core/validation/backup_test.go (1)
24-27: Cover the new config keys in this table
ValidatePluginConfignow special-casesetcdBackupMethodandhoNamespace, but this suite only exercisesmigration. Adding a happy-path entry for those keys would keep them from regressing back to the unknown-config path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/validation/backup_test.go` around lines 24 - 27, Add table-driven test entries to the existing backup_test.go table that exercise the new special-cased keys so they don't regress: add a happy-path case with config key "etcdBackupMethod" set to a valid value and another with "hoNamespace" set to a valid value, and assert ValidatePluginConfig accepts them (same expectations as the existing "migration" case). Locate the test that calls ValidatePluginConfig in this file (the table-driven cases around the "migration" entry) and append two entries mirroring that structure to cover those keys.docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (2)
39-67: Consider adding language specifier to the ASCII diagram code block.Static analysis flagged this as missing a language specifier. Using
textorplaintextwould satisfy the linter while preserving the diagram formatting.📝 Suggested fix
-``` +```text OADP Plugin (BackupPlugin)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines 39 - 67, The ASCII diagram code block is missing a language specifier; update the opening triple-backtick that begins the diagram (the block containing the OADP Plugin / HCPEtcdBackup ASCII art) to include a plain text specifier such as ```text or ```plaintext so the linter stops flagging it and the diagram formatting is preserved.
218-225: Add language specifier to storage path code blocks.The storage layout examples at lines 218-225 are flagged by markdownlint for missing language specifiers. Using
textwould resolve this.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines 218 - 225, Update the two fenced code blocks showing the S3 storage layout / example so they include a language specifier (use "text") after the opening backticks; locate the blocks containing the lines that start with "s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db" and the subsequent "Example:" block in HCPEtcdBackup-implementation.md and change the opening "```" to "```text" for both blocks to satisfy markdownlint.pkg/core/backup.go (1)
275-293: Consider caching completion state to avoid redundant polling.
waitForEtcdBackupCompletionmay be called from both HCP and HC cases. The second call will re-poll theHCPEtcdBackupCR even though it already succeeded. While this works correctly (the poll returns immediately), adding acompletionWaited boolflag would eliminate the redundant API call.♻️ Optional: Add completion state tracking
type BackupPlugin struct { // ... etcdOrchestrator *etcdbackup.Orchestrator + etcdBackupDone bool // ... } func (p *BackupPlugin) waitForEtcdBackupCompletion(ctx context.Context) error { - if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() { + if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() || p.etcdBackupDone { return nil } snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx) if err != nil { return fmt.Errorf("HCPEtcdBackup failed: %v", err) } p.log.Infof("HCPEtcdBackup completed, snapshotURL: %s", snapshotURL) + p.etcdBackupDone = true // ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/backup.go` around lines 275 - 293, Add a cached completion flag on the BackupPlugin (e.g., a bool field completionWaited) and update waitForEtcdBackupCompletion to return early if that flag is true; when calling etcdOrchestrator.WaitForCompletion and it succeeds, set completionWaited = true and still call CleanupCredentialSecret as before. If waitForEtcdBackupCompletion can be invoked concurrently, protect completionWaited with a mutex or use an atomic to avoid races; reference the BackupPlugin struct, waitForEtcdBackupCompletion, etcdOrchestrator.WaitForCompletion and etdcOrchestrator.CleanupCredentialSecret when locating where to add the flag and guard.pkg/common/types.go (1)
47-51: Remove the unusedEtcdBackupSucceededconstant.While
BackupInProgressReason(line 49) andBackupRejectedReason(line 50) are actively used in the orchestrator logic,EtcdBackupSucceededis defined but never referenced. Removing this unused constant will reduce clutter.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/common/types.go` around lines 47 - 51, Remove the unused EtcdBackupSucceeded constant from the local constants block so only the actively used BackupInProgressReason and BackupRejectedReason remain; delete the line defining EtcdBackupSucceeded and ensure no other code references that symbol (search for "EtcdBackupSucceeded") and remove or update any usages if found, leaving the TODO comment and the two used constants intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/common/utils.go`:
- Around line 174-175: The code currently swallows errors from the Kubernetes
client List call (c.List(ctx, hcList, crclient.InNamespace(ns))) by continuing
on any error, which hides API/RBAC/transient failures; change the behavior so
that if c.List returns a non-nil error you return/propagate that error (wrap
with context mentioning the namespace) instead of continuing, while keeping the
existing logic that continues only when the list is empty; apply the same change
to the other identical c.List occurrence at the referenced location (both places
should return the error rather than falling through to nil,nil).
---
Nitpick comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 39-67: The ASCII diagram code block is missing a language
specifier; update the opening triple-backtick that begins the diagram (the block
containing the OADP Plugin / HCPEtcdBackup ASCII art) to include a plain text
specifier such as ```text or ```plaintext so the linter stops flagging it and
the diagram formatting is preserved.
- Around line 218-225: Update the two fenced code blocks showing the S3 storage
layout / example so they include a language specifier (use "text") after the
opening backticks; locate the blocks containing the lines that start with
"s3://<bucket>/<bsl-prefix>/backups/<backup-name>/etcd-backup/<timestamp>.db"
and the subsequent "Example:" block in HCPEtcdBackup-implementation.md and
change the opening "```" to "```text" for both blocks to satisfy markdownlint.
In `@pkg/common/types.go`:
- Around line 47-51: Remove the unused EtcdBackupSucceeded constant from the
local constants block so only the actively used BackupInProgressReason and
BackupRejectedReason remain; delete the line defining EtcdBackupSucceeded and
ensure no other code references that symbol (search for "EtcdBackupSucceeded")
and remove or update any usages if found, leaving the TODO comment and the two
used constants intact.
In `@pkg/core/backup.go`:
- Around line 275-293: Add a cached completion flag on the BackupPlugin (e.g., a
bool field completionWaited) and update waitForEtcdBackupCompletion to return
early if that flag is true; when calling etcdOrchestrator.WaitForCompletion and
it succeeds, set completionWaited = true and still call CleanupCredentialSecret
as before. If waitForEtcdBackupCompletion can be invoked concurrently, protect
completionWaited with a mutex or use an atomic to avoid races; reference the
BackupPlugin struct, waitForEtcdBackupCompletion,
etcdOrchestrator.WaitForCompletion and etdcOrchestrator.CleanupCredentialSecret
when locating where to add the flag and guard.
In `@pkg/core/validation/backup_test.go`:
- Around line 24-27: Add table-driven test entries to the existing
backup_test.go table that exercise the new special-cased keys so they don't
regress: add a happy-path case with config key "etcdBackupMethod" set to a valid
value and another with "hoNamespace" set to a valid value, and assert
ValidatePluginConfig accepts them (same expectations as the existing "migration"
case). Locate the test that calls ValidatePluginConfig in this file (the
table-driven cases around the "migration" entry) and append two entries
mirroring that structure to cover those keys.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 58ba492b-4bde-4887-b6d3-61a9323151c4
⛔ Files ignored due to path filters (25)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/api/config/v1/types_apiserver.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_authentication.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_infrastructure.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_insights.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_tlssecurityprofile.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.swagger_doc_generated.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/operator/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/operator/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azureprivatelinkservice_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/controlplaneversion_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcp.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcpprivateserviceconnect_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hosted_controlplane.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/operator.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (16)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdgo.modpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/restore.gopkg/core/restore_test.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
|
/hold until Dep PRs got merged |
|
/label tide/merge-method-squash |
9a0a887 to
4052ecd
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/core/backup.go (1)
145-151: Consider guarding against nilp.hcpbefore callingcreateEtcdBackup.If the
p.hcplookup at lines 133-143 returnsnil(e.g., onIsNotFound), the code returns early. However, between the early return and line 147, there's no explicit check thatp.hcpis non-nil. Currently the flow is safe because the early return handles this, but the logic would be clearer and more defensive if thecreateEtcdBackupcall were inside thep.hcp != nilblock or had an explicit guard.This is a minor readability point given the current flow is correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/backup.go` around lines 145 - 151, Move or guard the etcd backup creation so it only runs when p.hcp is non-nil: ensure the call to p.createEtcdBackup(ctx, backup) is executed inside the same block that verifies p.hcp (or add an explicit if p.hcp != nil check) before checking p.etcdBackupMethod == common.EtcdBackupMethodEtcdSnapshot; this keeps the logic defensive and makes it clear createEtcdBackup is only invoked when p.hcp is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 109-131: In VerifyInProgress (used by pollCondition) replace the
incorrect local constant check common.BackupInProgressReason with the vendored
hyperv1.BackupAlreadyInProgressReason so the switch will match the controller's
"BackupAlreadyInProgress" reason; also audit the case referencing
common.BackupRejectedReason (which may not exist in the vendored HyperShift
constants) and either remove that case or replace it with the correct vendored
constant if the controller actually emits a rejection reason, ensuring all
switch cases use vendored hyperv1 reason constants.
---
Nitpick comments:
In `@pkg/core/backup.go`:
- Around line 145-151: Move or guard the etcd backup creation so it only runs
when p.hcp is non-nil: ensure the call to p.createEtcdBackup(ctx, backup) is
executed inside the same block that verifies p.hcp (or add an explicit if p.hcp
!= nil check) before checking p.etcdBackupMethod ==
common.EtcdBackupMethodEtcdSnapshot; this keeps the logic defensive and makes it
clear createEtcdBackup is only invoked when p.hcp is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ce55502d-e414-402b-b509-59247817656b
📒 Files selected for processing (15)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/restore.gopkg/core/restore_test.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (7)
- pkg/common/scheme.go
- pkg/core/validation/backup_test.go
- pkg/core/restore.go
- pkg/core/restore_test.go
- pkg/common/utils_test.go
- pkg/common/types.go
- pkg/etcdbackup/orchestrator_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/core/validation/backup.go
- pkg/core/types/types.go
- pkg/common/utils.go
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
4052ecd to
3cf9dfe
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (1)
40-40: Consider adding language identifiers to fenced code blocks.While not impacting functionality, adding language identifiers would resolve linter warnings and improve syntax highlighting:
- Line 40: ASCII diagram (consider
textor no language)- Lines 219, 224: Path structures (consider
textorplaintext)- Line 456: JSON output example (consider
json)Also applies to: 219-219, 224-224, 456-456
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` at line 40, Add explicit language identifiers to the fenced code blocks: mark the ASCII diagram block as ```text (or leave blank) to satisfy the linter, mark the two path-structure blocks as ```text or ```plaintext, and mark the JSON output example as ```json so syntax highlighting and linter warnings are resolved; update the corresponding fenced code block delimiters for the ASCII diagram, the path structure examples, and the JSON output example accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Line 40: Add explicit language identifiers to the fenced code blocks: mark the
ASCII diagram block as ```text (or leave blank) to satisfy the linter, mark the
two path-structure blocks as ```text or ```plaintext, and mark the JSON output
example as ```json so syntax highlighting and linter warnings are resolved;
update the corresponding fenced code block delimiters for the ASCII diagram, the
path structure examples, and the JSON output example accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ab1e9bd0-2870-462b-83b6-33b72b9cd059
📒 Files selected for processing (1)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md
3cf9dfe to
d96df42
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/etcdbackup/orchestrator.go (1)
173-190: Credential Secret cleanup should occur on failure paths too.Based on the integration in
pkg/core/backup.go,CleanupCredentialSecretis only called afterWaitForCompletionsucceeds. If the backup fails or times out, the copied credential Secret remains in the HO namespace. While the Secret contains an STS IAM Role ARN (not rotatable keys), accumulating orphaned Secrets is undesirable.Consider calling cleanup in a
deferor ensuring it's called on all exit paths.♻️ Suggested approach in backup.go
// In waitForEtcdBackupCompletion, defer cleanup regardless of outcome: func (p *BackupPlugin) waitForEtcdBackupCompletion(ctx context.Context) error { if p.etcdOrchestrator == nil || !p.etcdOrchestrator.IsCreated() { return nil } + // Always attempt cleanup, even on failure + defer func() { + if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil { + p.log.Warnf("Failed to cleanup etcd backup credential Secret: %v", cleanupErr) + } + }() if p.etcdSnapshotURL != "" { return nil } snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx) if err != nil { return fmt.Errorf("HCPEtcdBackup failed: %v", err) } p.etcdSnapshotURL = snapshotURL p.log.Infof("HCPEtcdBackup completed, snapshotURL: %s", snapshotURL) - if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil { - p.log.Warnf("Failed to cleanup etcd backup credential Secret: %v", cleanupErr) - } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/etcdbackup/orchestrator.go` around lines 173 - 190, CleanupCredentialSecret is only invoked after WaitForCompletion succeeds, leaving orphaned Secrets on failure/timeouts; update waitForEtcdBackupCompletion (method on BackupPlugin) to always attempt cleanup by deferring a call to etcdOrchestrator.CleanupCredentialSecret(ctx) right after verifying etcdOrchestrator.IsCreated(), so the Secret is removed on all exit paths (success, error, timeout); ensure the deferred call logs any cleanup error (use p.log.Warnf or similar) and continues to return the original error from WaitForCompletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 173-190: CleanupCredentialSecret is only invoked after
WaitForCompletion succeeds, leaving orphaned Secrets on failure/timeouts; update
waitForEtcdBackupCompletion (method on BackupPlugin) to always attempt cleanup
by deferring a call to etcdOrchestrator.CleanupCredentialSecret(ctx) right after
verifying etcdOrchestrator.IsCreated(), so the Secret is removed on all exit
paths (success, error, timeout); ensure the deferred call logs any cleanup error
(use p.log.Warnf or similar) and continues to return the original error from
WaitForCompletion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d116e2dc-774f-4faa-9b51-39b2dd034bae
📒 Files selected for processing (13)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (4)
- pkg/common/utils_test.go
- pkg/core/validation/backup_test.go
- pkg/common/types.go
- pkg/etcdbackup/orchestrator_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/common/scheme.go
- pkg/core/types/types.go
- pkg/common/utils.go
- pkg/core/validation/restore.go
- pkg/core/validation/backup.go
- pkg/core/backup.go
d96df42 to
eb942f0
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
pkg/etcdbackup/orchestrator.go (1)
115-126:⚠️ Potential issue | 🟠 MajorPoll the vendored HyperShift reasons, not local placeholders.
The current switches still look for
common.BackupInProgressReason/common.BackupRejectedReason, while the vendored API exposesBackupAlreadyInProgress,BackupFailed,BackupSucceeded, andEtcdUnhealthy.WaitForCompletion()also dropsEtcdUnhealthyReason, so those states can sit in the poll loop until timeout.Run this to compare the vendored constants with the current polling branches:
#!/bin/bash set -euo pipefail echo "=== vendored HCPEtcdBackup condition reasons ===" sed -n '1,20p' vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.go | nl -ba echo echo "=== current polling branches ===" sed -n '109,165p' pkg/etcdbackup/orchestrator.go | nl -ba echo echo "=== local reason constants referenced here ===" rg -n 'BackupInProgressReason|BackupRejectedReason|BackupAlreadyInProgressReason|EtcdUnhealthyReason' pkg/common/types.go pkg/etcdbackup/orchestrator.goAlso applies to: 155-160
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/etcdbackup/orchestrator.go` around lines 115 - 126, The switch over cond.Reason in the poll loop is using local placeholders (common.BackupInProgressReason, common.BackupRejectedReason) instead of the vendored HyperShift condition reason constants; update the cases to use the vendored names (hyperv1.BackupAlreadyInProgress, hyperv1.BackupSucceeded, hyperv1.BackupFailed, hyperv1.EtcdUnhealthy) so the poll loop and WaitForCompletion() observe the same semantics, and remove/replace any references to common.BackupInProgressReason/common.BackupRejectedReason (also adjust the similar branch around lines 155-160) so all checks consistently reference hyperv1 constants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 117-127: Update the paragraph to say the restore reads the etcd
snapshot URL from the HostedCluster annotation
hypershift.openshift.io/etcd-snapshot-url (not
status.lastSuccessfulEtcdBackupURL); explain that pkg/core/backup.go persists
the URL into that annotation because Velero strips status on restore, and update
the note about unstructured access to clarify we read the annotation until the
HyperShift API vendor adds a formal status field.
In `@pkg/core/backup.go`:
- Around line 146-151: The current logic calls p.createEtcdBackup as soon as any
HCP resolves (guarded only by p.etcdBackupMethod), which can start HCPEtcdBackup
before platform validation and without a guaranteed wait/cleanup path; change
this so HCPEtcdBackup is only created after platform validation and in a branch
that implements completion/cleanup (move or gate the call into the
HostedControlPlane/HostedCluster handling after validation succeeds), or add an
explicit validation check and registration of a wait/cleanup handler before
invoking p.createEtcdBackup; reference p.etcdBackupMethod, p.createEtcdBackup,
HCPEtcdBackup, HostedControlPlane and HostedCluster to locate and update the
code paths accordingly.
- Around line 284-289: When CreateEtcdBackup or VerifyInProgress fails after
copying credentials, ensure you delete the remapped BSL Secret in the host
operator namespace before returning; update the error paths after
p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc) and
p.etcdOrchestrator.VerifyInProgress(ctx) so they call the same cleanup that
waitForEtcdBackupCompletion() uses (or explicitly delete the remapped secret by
name in hoNamespace/p.hcp.Namespace) and then return the original error; make
the deletion idempotent (ignore NotFound) and log any deletion failure but still
return the original create/verify error.
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 83-104: The code generates a random crName causing non-idempotent
creates; instead derive crName deterministically from the Velero backup (e.g.,
use backup.Name or backup.UID to build the CR name) and before calling
o.client.Create check for an existing HCPEtcdBackup by that deterministic name
via o.client.Get (or list by owner/label) so you reuse the existing CR on
re-entry; update crName, use o.client.Get/Create accordingly, and set
o.BackupName = crName when reusing or after successful create (referencing
variables crName, backup.Name, backup.UID, o.client.Get/Create, and
o.BackupName).
---
Duplicate comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 115-126: The switch over cond.Reason in the poll loop is using
local placeholders (common.BackupInProgressReason, common.BackupRejectedReason)
instead of the vendored HyperShift condition reason constants; update the cases
to use the vendored names (hyperv1.BackupAlreadyInProgress,
hyperv1.BackupSucceeded, hyperv1.BackupFailed, hyperv1.EtcdUnhealthy) so the
poll loop and WaitForCompletion() observe the same semantics, and remove/replace
any references to common.BackupInProgressReason/common.BackupRejectedReason
(also adjust the similar branch around lines 155-160) so all checks consistently
reference hyperv1 constants.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e6bca3fb-03c5-479b-9731-d1f5ead6e10a
📒 Files selected for processing (13)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (3)
- pkg/core/validation/backup_test.go
- pkg/common/utils_test.go
- pkg/common/types.go
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/core/types/types.go
- pkg/common/utils.go
- pkg/core/validation/restore.go
- pkg/common/scheme.go
- pkg/core/validation/backup.go
| // If the destination Secret already exists, it is reused. The credential data | ||
| // contains an STS IAM Role ARN (not rotatable keys), so it is safe to reuse. | ||
| func (o *Orchestrator) copyCredentialSecret(ctx context.Context, bsl *velerov1.BackupStorageLocation, fromNS, toNS, backupName string) (string, error) { | ||
| if bsl.Spec.Credential == nil { |
There was a problem hiding this comment.
ARO is installing velero through the CLI and this field does not get set. I can add a patch in the install job but I think a better approach might be to set the secret file in config. It is typically a secret cloud-credentials with key cloud
There was a problem hiding this comment.
There was a problem hiding this comment.
Thanks for flagging this, Tony! This PR targets self-managed HyperShift deployments. ARO and ROSA support (including credential handling differences like BSL without spec.credential) is tracked separately in CNTRLPLANE-3167.
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
eb942f0 to
3042537
Compare
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. This pull request references CNTRLPLANE-2707 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
3042537 to
47d531e
Compare
mgencur
left a comment
There was a problem hiding this comment.
Thanks, in general looks good. Just a few questions.
| // Migration is a flag to indicate if the backup is for migration purposes. | ||
| Migration bool | ||
| // Readopt Nodes is a flag to indicate if the nodes should be reprovisioned or not during restore. | ||
| ReadoptNodes bool |
There was a problem hiding this comment.
Is this related to Etcd backups? I don't think so. Can you provide some info about this change, why is this done and why in this PR?
There was a problem hiding this comment.
No this change is dead code, I was using this for not unpause the nodes after backup. But this is not being used anymore
| status = map[string]interface{}{} | ||
| unstructuredContent["status"] = status | ||
| } | ||
| status["lastSuccessfulEtcdBackupURL"] = p.etcdSnapshotURL |
There was a problem hiding this comment.
Alright. So this etcdSnashotURL in status will be there temporarily for informational purposes, and after Restore it will be gone. Correct?
There was a problem hiding this comment.
Yes, this field is purely informative and only useful during the backup for the plugin.
During the backup this URL is grabbed and set as annotation, this is due to Velero clear all the status once restores an object.
The restore plugin will grab the annotation URL, create a pre-sign URL and set it as hostedcluster.spec.etcd.managed.storage.restoreSnapshotURL. That will make the HO to restore the snapshot during the deployment
| switch p.etcdBackupMethod { | ||
| case common.EtcdBackupMethodEtcdSnapshot: | ||
| // Skip etcd pods entirely, snapshot is handled by HCPEtcdBackup. | ||
| // This prevents both FSBackup and CSI VolumeSnapshots of etcd volumes. |
There was a problem hiding this comment.
Is this etcd Pod supposed to be re-created by control plane operator after restore, instead of using Restore for it? Just for my understanding
There was a problem hiding this comment.
Yes, during the restoration we recreate the HC and HCP, that will make, at some point during the reconciliation, the etcd statefulSet to be recreated. One of the etcd init cotainers will download the new snapshot and perform an etcdutl snapshot restore + etcdutl snapshot status, with a log entry similar to:
Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 45.1M 100 45.1M 0 0 39.7M 0 0:00:01 0:00:01 --:--:-- 39.7M
/bin/sh: line 13: file: command not found
INFO: using etcdutl (etcd 3.6+)
+----------+----------+------------+------------+---------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE | VERSION |
+----------+----------+------------+------------+---------+
| a447b018 | 43299 | 5245 | 47 MB | 3.6.0 |
+----------+----------+------------+------------+---------+
2026-04-09T21:53:35Z info snapshot/v3_snapshot.go:305 restoring snapshot {"path": "/tmp/snapshot", "wal-dir": "/var/lib/data/member/wal", "data-dir": "/var/lib/data", "snap-dir": "/var/lib/data/member/snap", "initial-memory-map-size": 10737418240}
2026-04-09T21:53:35Z info bbolt backend/backend.go:203 Opening db file (/var/lib/data/member/snap/db) with mode -rw------- and with options: {Timeout: 0s, NoGrowSync: false, NoFreelistSync: true, PreLoadFreelist: false, FreelistType: , ReadOnly: false, MmapFlags: 8000, InitialMmapSize: 10737418240, PageSize: 0, NoSync: false, OpenFile: 0x0, Mlock: false, Logger: 0xc000120138}
2026-04-09T21:53:35Z info bbolt bbolt@v1.4.3/db.go:322 Opening bbolt db (/var/lib/data/member/snap/db) successfully
2026-04-09T21:53:35Z info schema/membership.go:138 Trimming membership information from the backend...
2026-04-09T21:53:35Z info bbolt backend/backend.go:203 Opening db file (/var/lib/data/member/snap/db) with mode -rw------- and with options: {Timeout: 0s, NoGrowSync: false, NoFreelistSync: true, PreLoadFreelist: false, FreelistType: , ReadOnly: false, MmapFlags: 8000, InitialMmapSize: 10737418240, PageSize: 0, NoSync: false, OpenFile: 0x0, Mlock: false, Logger: 0xc0001201d8}
2026-04-09T21:53:35Z info bbolt bbolt@v1.4.3/db.go:322 Opening bbolt db (/var/lib/data/member/snap/db) successfully
2026-04-09T21:53:35Z info membership/cluster.go:424 added member {"cluster-id": "cdf818194e3a8c32", "local-member-id": "0", "added-peer-id": "8e9e05c52164694d", "added-peer-peer-urls": ["http://localhost:2380"], "added-peer-is-learner": false}
2026-04-09T21:53:35Z info bbolt backend/backend.go:203 Opening db file (/var/lib/data/member/snap/db) with mode -rw------- and with options: {Timeout: 0s, NoGrowSync: false, NoFreelistSync: true, PreLoadFreelist: false, FreelistType: , ReadOnly: false, MmapFlags: 8000, InitialMmapSize: 10737418240, PageSize: 0, NoSync: false, OpenFile: 0x0, Mlock: false, Logger: 0xc000120288}
2026-04-09T21:53:35Z info bbolt bbolt@v1.4.3/db.go:322 Opening bbolt db (/var/lib/data/member/snap/db) successfully
2026-04-09T21:53:35Z info snapshot/v3_snapshot.go:333 restored snapshot {"path": "/tmp/snapshot", "wal-dir": "/var/lib/data/member/wal", "data-dir": "/var/lib/data", "snap-dir": "/var/lib/data/member/snap", "initial-memory-map-size": 10737418240}
| } | ||
|
|
||
| // VerifyInProgress polls the HCPEtcdBackup until the controller acknowledges it. | ||
| func (o *Orchestrator) VerifyInProgress(ctx context.Context) error { |
There was a problem hiding this comment.
Nit: This function is called VerifyInProgress but it returns success also when the backup succeeds.
|
/lgtm |
|
@jparrill thanks for the comments |
Update github.com/openshift/hypershift/api to v0.0.0-20260406110001-bcf6adaf131f. This brings in the HCPEtcdBackup CRD types, HCPEtcdBackupConfig in ManagedEtcdSpec.Backup, and related condition/reason constants needed for CNTRLPLANE-2685 integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
47d531e to
a7c6a91
Compare
Add etcdSnapshot backup method that creates and monitors HCPEtcdBackup CRs during Velero backup. When etcdBackupMethod=etcdSnapshot is configured in the plugin ConfigMap, the plugin: - Creates an HCPEtcdBackup CR in the HCP namespace using BSL storage config - Copies BSL credentials to the HO namespace (remapping key for controller) - Polls the CR until backup completes or fails - Excludes etcd pods and PVCs from Velero backup (no CSI/FS backup needed) - Stores the etcd snapshot alongside the Velero backup data in the BSL The default method remains volumeSnapshot (unchanged behavior). Also cleans up dead config parameters (readoptNodes, managedServices, awsRegenPrivateLink) and registers apiextensionsv1 in the scheme for CRD existence checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Document the full HCPEtcdBackup integration including architecture, backup/restore flows, configuration, credential handling, storage layout, dependency chain (PRs #8139, #8010, #8017, #8040, #8114, enhancement #1945), and troubleshooting guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
…alidation Cover Execute paths for all item kinds, orchestrator lifecycle (CreateEtcdBackup, VerifyInProgress, WaitForCompletion), and platform validation for both backup and restore validators. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
a7c6a91 to
3711f8f
Compare
|
@jparrill: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/hold cancel |
|
/verified bypass More info in here: https://redhat-internal.slack.com/archives/G01QS0P2F6W/p1775814810815969?thread_ts=1775808691.857879&cid=G01QS0P2F6W |
|
@jparrill: The DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
|
||
| const ( | ||
| verifyTimeout = 30 * time.Second | ||
| completionTimeout = 10 * time.Minute |
There was a problem hiding this comment.
I do not have a lot of info concerning this. I'm OK to say this is OK, but isn't the 10 minutes a little bit tight to backup a big cluster etcd?
Couldn't we make this configurable, eventually in a follow-up?
|
|
||
| crdExists, err := common.CRDExists(ctx, "hcpetcdbackups.hypershift.openshift.io", p.client) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to check for HCPEtcdBackup CRD: %w", err) |
There was a problem hiding this comment.
'nit`: either the error is too vague either the function checks other things and not only existence... I see the idea but I would rename something here.
There was a problem hiding this comment.
More about Exist function in my comment to the rabbit :)
| return fmt.Errorf("failed to get OADP namespace: %w", err) | ||
| } | ||
|
|
||
| p.etcdOrchestrator = etcdbackup.NewOrchestrator(p.log, p.client, p.hoNamespace, oadpNS) |
There was a problem hiding this comment.
Don't undertand this. At line 262 we check etcdOrchestrator AND etcdOrchestrator. IsCreated what if the pointer is not null but the IsCreated() return false. Here we overwrite p.etcdOrchestrator?
I'm lost here. What is checking isCreated? Is it really useful?
| return nil | ||
| } | ||
|
|
||
| snapshotURL, err := p.etcdOrchestrator.WaitForCompletion(ctx) |
|
|
||
| if err := p.etcdOrchestrator.CreateEtcdBackup(ctx, backup, p.hcp.Namespace, hc); err != nil { | ||
| if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil { | ||
| p.log.Warnf("Failed to cleanup credential Secret after create error: %v", cleanupErr) |
|
|
||
| if err := p.etcdOrchestrator.VerifyInProgress(ctx); err != nil { | ||
| if cleanupErr := p.etcdOrchestrator.CleanupCredentialSecret(ctx); cleanupErr != nil { | ||
| p.log.Warnf("Failed to cleanup credential Secret after verify error: %v", cleanupErr) |
|
/lgtm |
Summary
Adds
etcdSnapshotas a new etcd backup method for HyperShift HostedClusters. This PR covers only the backup flow; the restore flow (with pre-signed URL generation) will follow in a separate PR.etcdBackupMethodconfiguration key (volumeSnapshotdefault,etcdSnapshotfor native etcd snapshots)HCPEtcdBackupCR that triggers a nativeetcdctl snapshot saveand uploads the snapshot to the Velero BSL object storeetcdSnapshotto prevent CSI VolumeSnapshots{prefix}/backups/{backup-name}/etcd-backup/)Dependencies
LastSuccessfulEtcdBackupURLfield in HostedClusterStatus (CNTRLPLANE-3173)Once both are merged, a vendor update is required to remove local constants (
BackupInProgressReason,BackupRejectedReason) in favor of API-defined ones.Related
Changes
fix(deps)HCPEtcdBackuptypes)feat(backup)docsTest plan
etcdBackupMethodandhoNamespacerecognized as valid config keysgo build ./...passesgo test ./...passesetcdBackupMethod=etcdSnapshot, run OADP backup, verify HCPEtcdBackup CR created and snapshot uploadedvolumeSnapshotmethod (default) behavior is unchanged🤖 Generated with Claude Code