CNTRLPLANE-2685, CNTRLPLANE-2707: inject pre-signed etcd snapshot URL during OADP restore#239
Conversation
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds HCPEtcdBackup orchestration for etcd snapshot backups/restores, a new S3 SigV4 presign package, credential copy/remap logic, plugin config and scheme/type updates, many unit/integration tests, design docs, and removal of a legacy AWS restore helper. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jparrill The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/etcdbackup/orchestrator.go`:
- Around line 58-107: The HCPEtcdBackup name is made unstable by the random
suffix (crName) so retries create new CRs; in CreateEtcdBackup remove the
utilrand.String(4) suffix and derive a stable CR name from the Velero backup
(e.g., "oadp-"+backup.Name) so subsequent calls hit the
apierrors.IsAlreadyExists reuse path; update the crName usage in the
HCPEtcdBackup ObjectMeta, logging, and the assignments to
o.BackupName/o.BackupNamespace accordingly to ensure idempotent create behavior.
- Around line 207-240: mapAWSBSLToStorage currently returns empty Bucket/Region
which produces invalid HCPEtcdBackup specs; update the validation so we reject
AWS BSLs that lack required fields. In mapBSLToStorage (and/or change
mapAWSBSLToStorage signature to return (*hyperv1.HCPEtcdBackupStorage, error)),
check that bsl.Spec.ObjectStorage.Bucket and bsl.Spec.Config["region"] are
non-empty for AWS providers and return a descriptive error (e.g. "invalid AWS
BSL: missing bucket" or "missing region") instead of constructing a storage
object; ensure callers handle the error path so we never create/copy credentials
for an invalid HCPEtcdBackup.
- Around line 115-128: The switch handling backup condition reasons must treat
controller terminal states as errors: in both VerifyInProgress and
WaitForCompletion add a case for common.BackupAlreadyInProgressReason that
returns false with a descriptive error (e.g., "backup already in progress:
<cond.Message>") so it doesn't fall through and keep polling; additionally, add
hyperv1.EtcdUnhealthyReason to the WaitForCompletion switch to return false with
an error ("etcd unhealthy: <cond.Message>") instead of continuing to poll.
Ensure you reference the existing switch on cond.Reason in both functions and
use the same logging/error pattern as the other terminal cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b4188f4e-0fd5-4272-a534-316dd9138bd5
⛔ Files ignored due to path filters (25)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/api/config/v1/types_apiserver.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_authentication.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_infrastructure.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_insights.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_tlssecurityprofile.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.swagger_doc_generated.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/operator/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/operator/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azureprivatelinkservice_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/controlplaneversion_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcp.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcpprivateserviceconnect_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hosted_controlplane.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/operator.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (19)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdgo.modpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/restore.gopkg/core/restore_test.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.gopkg/s3presign/presign.gopkg/s3presign/presign_test.gotests/integration/s3presign/presign_integration_test.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
b168ebd to
5c53288
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
pkg/etcdbackup/orchestrator.go (1)
155-165:⚠️ Potential issue | 🟡 MinorAdd
EtcdUnhealthyReasontoWaitForCompletionterminal failures.
EtcdUnhealthyReasonis handled inVerifyInProgress(line 126-127) but missing fromWaitForCompletion. If etcd becomes unhealthy during the backup,WaitForCompletionwill continue polling until timeout instead of failing fast.Suggested fix
// Terminal failures switch cond.Reason { case hyperv1.BackupFailedReason: return false, fmt.Errorf("HCPEtcdBackup failed: %s", cond.Message) case common.BackupRejectedReason: return false, fmt.Errorf("HCPEtcdBackup rejected: %s", cond.Message) + case hyperv1.EtcdUnhealthyReason: + return false, fmt.Errorf("etcd unhealthy: %s", cond.Message) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/etcdbackup/orchestrator.go` around lines 155 - 165, The WaitForCompletion polling switch on cond.Reason currently handles hyperv1.BackupFailedReason and common.BackupRejectedReason but omits EtcdUnhealthyReason; add a case for the EtcdUnhealthyReason (the same constant used/handled in VerifyInProgress) in the switch inside WaitForCompletion so that when cond.Reason == EtcdUnhealthyReason the function returns false and an error (e.g., fmt.Errorf("HCPEtcdBackup etcd unhealthy: %s", cond.Message)) to fail fast instead of continuing to poll.
🧹 Nitpick comments (2)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (1)
470-473: Add language specifier to fenced code block.The example output block lacks a language specifier.
Suggested fix
The output should be an array containing a **pre-signed HTTPS URL**, e.g.: -``` +```text ["https://my-bucket.s3.us-east-1.amazonaws.com/path/to/snapshot.db?X-Amz-Algorithm=AWS4-HMAC-SHA256&..."]</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdaround lines
470 - 473, The fenced code block that shows the example output (the array with
the pre-signed HTTPS URL) is missing a language specifier; update that fenced
block to include a language tag (e.g., "text") so the snippet is rendered
correctly — locate the block containing ["https://my-bucket..."] and changetotext.</details> </blockquote></details> <details> <summary>pkg/core/restore.go (1)</summary><blockquote> `231-287`: **Pre-signed URL expiry may be insufficient for slow restores.** The 1-hour expiry (`DefaultPresignExpiry`) should suffice for typical restores, but large etcd snapshots or slow networks could exceed this. The `etcd-init` container will fail if the URL expires mid-download. Consider documenting this constraint or making expiry configurable for edge cases. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@pkg/core/restore.go` around lines 231 - 287, presignS3URL currently always uses s3presign.DefaultPresignExpiry which can expire during slow restores; make the expiry configurable and use that value when calling s3presign.GeneratePresignedGetURL. Update RestorePlugin.presignS3URL to read a configurable expiry (e.g., from the Backup object annotation or a RestorePlugin field), parse it as a time.Duration with a sensible fallback to s3presign.DefaultPresignExpiry, and pass that duration into the PresignOptions. Adjust any related types (RestorePlugin config or annotation key) and ensure GeneratePresignedGetURL is called with PresignOptions.Expiry set to the parsed value (not always DefaultPresignExpiry). ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@pkg/common/utils.go:
- Around line 31-34: SetK8sSAFilePath currently mutates the package-global
k8sSAFilePath without synchronization which races with reads in
GetCurrentNamespace; protect access by introducing a package-level sync.RWMutex
(e.g., k8sSAFileMu) and use k8sSAFileMu.Lock() / Unlock() in SetK8sSAFilePath
and k8sSAFileMu.RLock() / RUnlock() around reads of k8sSAFilePath inside
GetCurrentNamespace (and any other readers) so concurrent tests/usage are safe.In
@pkg/s3presign/presign.go:
- Around line 249-257: The uriEncode function currently uses url.PathEscape
which leaves AWS SigV4 reserved characters unencoded; replace its usage in
uriEncode with a custom percent-encoder that encodes every byte except the SigV4
unreserved characters (A-Z, a-z, 0-9, '-', '_', '.', '~'), preserving segment
boundaries by continuing to split on '/' (as in segments := strings.Split(path,
"/")). For each segment implement byte-wise encoding: if a rune/byte is in the
unreserved set, append as-is; otherwise append percent-encoded (%HH) uppercase
hex for each UTF-8 byte; then join segments with "/" and return the result so
signatures match AWS SigV4 rules.
Duplicate comments:
In@pkg/etcdbackup/orchestrator.go:
- Around line 155-165: The WaitForCompletion polling switch on cond.Reason
currently handles hyperv1.BackupFailedReason and common.BackupRejectedReason but
omits EtcdUnhealthyReason; add a case for the EtcdUnhealthyReason (the same
constant used/handled in VerifyInProgress) in the switch inside
WaitForCompletion so that when cond.Reason == EtcdUnhealthyReason the function
returns false and an error (e.g., fmt.Errorf("HCPEtcdBackup etcd unhealthy: %s",
cond.Message)) to fail fast instead of continuing to poll.
Nitpick comments:
In@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md:
- Around line 470-473: The fenced code block that shows the example output (the
array with the pre-signed HTTPS URL) is missing a language specifier; update
that fenced block to include a language tag (e.g., "text") so the snippet is
rendered correctly — locate the block containing ["https://my-bucket..."] and
changetotext.In
@pkg/core/restore.go:
- Around line 231-287: presignS3URL currently always uses
s3presign.DefaultPresignExpiry which can expire during slow restores; make the
expiry configurable and use that value when calling
s3presign.GeneratePresignedGetURL. Update RestorePlugin.presignS3URL to read a
configurable expiry (e.g., from the Backup object annotation or a RestorePlugin
field), parse it as a time.Duration with a sensible fallback to
s3presign.DefaultPresignExpiry, and pass that duration into the PresignOptions.
Adjust any related types (RestorePlugin config or annotation key) and ensure
GeneratePresignedGetURL is called with PresignOptions.Expiry set to the parsed
value (not always DefaultPresignExpiry).</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `ba6b6f6f-7044-4f04-9b63-1049542a4b99` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between b168ebd83adc4ca29d331d83a64fc8b27881a744 and 5c532880c1b8359ba870a7da9cf684927879be72. </details> <details> <summary>📒 Files selected for processing (18)</summary> * `docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` * `pkg/common/scheme.go` * `pkg/common/types.go` * `pkg/common/utils.go` * `pkg/common/utils_test.go` * `pkg/core/backup.go` * `pkg/core/restore.go` * `pkg/core/restore_test.go` * `pkg/core/types/types.go` * `pkg/core/validation/backup.go` * `pkg/core/validation/backup_test.go` * `pkg/core/validation/restore.go` * `pkg/etcdbackup/orchestrator.go` * `pkg/etcdbackup/orchestrator_test.go` * `pkg/platform/aws/aws.go` * `pkg/s3presign/presign.go` * `pkg/s3presign/presign_test.go` * `tests/integration/s3presign/presign_integration_test.go` </details> <details> <summary>💤 Files with no reviewable changes (1)</summary> * pkg/platform/aws/aws.go </details> <details> <summary>✅ Files skipped from review due to trivial changes (3)</summary> * pkg/common/scheme.go * pkg/core/restore_test.go * pkg/common/utils_test.go </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (5)</summary> * pkg/core/validation/backup_test.go * pkg/core/validation/restore.go * pkg/core/types/types.go * pkg/etcdbackup/orchestrator_test.go * tests/integration/s3presign/presign_integration_test.go </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
5c53288 to
a970ce6
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 423-428: The prose instructs operators to verify
HostedCluster.status.lastSuccessfulEtcdBackupURL but the example command uses
the annotation path
metadata.annotations.hypershift.openshift.io/etcd-snapshot-url; update the
example so the jsonpath targets the status field instead (or change the prose to
match the annotation), e.g., reference
HostedCluster.status.lastSuccessfulEtcdBackupURL in the text and adjust the oc
get jsonpath to query that status field rather than the
metadata.annotations.hypershift.openshift.io/etcd-snapshot-url key so both prose
and command are consistent.
In `@pkg/core/restore.go`:
- Around line 246-249: The code blindly uses backup.Spec.StorageLocation when
fetching the BackupStorageLocation (see bslName, p.client.Get and
types.NamespacedName{Name: bslName, Namespace: oadpNS}, bsl) which fails if the
field is empty; change the logic to detect an empty backup.Spec.StorageLocation
and then lookup the BSL by the Velero label "velero.io/storage-location" in the
oadp namespace (e.g., list BackupStorageLocation objects filtered by that label
and pick the one matching the backup’s label), then fall back to using its name
for the Get; apply the same fallback in the CreateEtcdBackup path that calls
fetchBSL so both codepaths handle Velero’s default storage location correctly.
In `@pkg/s3presign/presign.go`:
- Around line 179-213: The presigned URL is constructed with the raw urlPath
while the canonicalRequest uses uriEncode(urlPath), causing mismatch for keys
with reserved chars; update the presigned URL construction (where presignedURL
is built) to use the same encoded path used in canonicalRequest (e.g., call
uriEncode(urlPath) or store it as encodedPath) so the request target matches
what was signed (ensure sortedQueryString(params) remains unchanged and params
includes X-Amz-Signature).
In `@tests/integration/s3presign/presign_integration_test.go`:
- Around line 103-115: The createBucket helper currently calls aws s3 mb with
--create-bucket-configuration which fails for non-us-east-1 regions; update
createBucket (function name) to call e.awsCmd(t, ...) with "s3api",
"create-bucket", "--bucket", e.bucket and, for non-us-east-1, include
"--create-bucket-configuration", fmt.Sprintf("LocationConstraint=%s", e.region)
(keep the original us-east-1 branch using "s3", "mb" for compatibility), and
preserve the existing error handling around the e.awsCmd call so failures still
call t.Fatalf.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ce04944a-3fc4-4a81-87ef-e046a664a66f
📒 Files selected for processing (7)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdpkg/common/utils.gopkg/core/restore.gopkg/core/restore_test.gopkg/s3presign/presign.gopkg/s3presign/presign_test.gotests/integration/s3presign/presign_integration_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/core/restore_test.go
|
@jparrill: This pull request references CNTRLPLANE-2685 which is a valid jira issue. This pull request references CNTRLPLANE-2707 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
mgencur
left a comment
There was a problem hiding this comment.
Reviewed the last commit. Posted a few questions.
| ``` | ||
|
|
||
| > **Important**: The `restoreSnapshotURL` field only takes effect during HostedCluster bootstrap (initial etcd creation). Patching it on an already running cluster will not trigger an etcd restore. To test a full restore, the HostedCluster must be deleted and recreated via the OADP restore flow. | ||
| > **Important**: The `restoreSnapshotURL` field only takes effect during HostedCluster bootstrap (initial etcd creation). The `etcd-init` container skips the restore if `/var/lib/data` is not empty. To test a full restore, the HostedCluster must be deleted and recreated via the OADP restore flow. |
There was a problem hiding this comment.
Hmm. The control-plane might be lost and the actual hosted cluster still be running (AFAIK). Then restoring the control plane will ignore the restoreSnapshotURL? What happens in that case? We currently don't delete the data plane during backup/restore tests.
There was a problem hiding this comment.
Correct — if the etcd PV already has data, the etcd-init container skips the restore entirely (line 3 of etcd-init.sh):
[ "$(ls -A /var/lib/data)" ] && echo "/var/lib/data not empty, not restoring snapshot" && exit 0This is by design per the RestoreSnapshotURL API godoc: "This snapshot will be restored on initial startup, only when the etcd PV is empty."
In the OADP restore flow, the PVC is deleted and recreated fresh, so the PV will be empty and the restore will proceed. If the data plane is still running and the control plane is being restored, the etcd PV is new (empty) so the snapshot restore will execute correctly.
| return nil, fmt.Errorf("error converting item to HostedCluster: %v", err) | ||
| } | ||
| if hc.Spec.Etcd.Managed != nil { | ||
| hc.Spec.Etcd.Managed.Storage.RestoreSnapshotURL = []string{snapshotURL} |
There was a problem hiding this comment.
Is it a concern that anybody who can access this RestoreSnapshotURL can download the snapshot without furthe credentials? probably not
There was a problem hiding this comment.
The pre-signed URL is time-limited (1 hour TTL by default) and is only stored in spec.etcd.managed.storage.restoreSnapshotURL on the HostedCluster object, which requires RBAC to read. Once the etcd-init container uses it, the URL expires and becomes useless. Additionally, the etcd snapshot itself is already stored in the S3 bucket which the BSL credentials have access to, so the pre-signed URL doesn't grant any additional access beyond what those credentials already provide. We're also working on CEL immutability for this field (hypershift#8186) to prevent day-2 tampering.
a970ce6 to
9fe273c
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/core/backup.go (1)
134-144:⚠️ Potential issue | 🟠 MajorReset cached state when the backup changes and ensure credential cleanup on wait failure.
These short-circuits cache
p.hcp,p.etcdOrchestrator, andp.etcdSnapshotURLon the plugin instance without checking whichBackupis being processed. Since Velero reuses one item-action instance across backups, a later backup can skip HCPEtcdBackup creation and reuse the previous snapshot URL/HCP. Additionally, ifWaitForCompletion()fails, the credential secret is never cleaned up (cleanup only happens after successful wait on line 323–327).Suggested fix
type BackupPlugin struct { log logrus.FieldLogger ctx context.Context @@ etcdOrchestrator *etcdbackup.Orchestrator hoNamespace string etcdBackupMethod string etcdSnapshotURL string // populated after HCPEtcdBackup completes + currentBackupKey types.NamespacedName } @@ func (p *BackupPlugin) Execute(item runtime.Unstructured, backup *velerov1.Backup) (runtime.Unstructured, []velero.ResourceIdentifier, error) { p.log.Debug("Entering Hypershift backup plugin") ctx := context.Context(p.ctx) + + backupKey := types.NamespacedName{Name: backup.Name, Namespace: backup.Namespace} + if p.currentBackupKey != backupKey { + p.currentBackupKey = backupKey + p.hcp = nil + p.etcdOrchestrator = nil + p.etcdSnapshotURL = "" + } if returnEarly, err := common.ShouldEndPluginExecution(ctx, backup, p.client, p.log); returnEarly {For cleanup on failure, call
CleanupCredentialSecret()before returning the error fromWaitForCompletion().Also applies to: 262–264, 314–316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/backup.go` around lines 134 - 144, Reset the cached plugin state when processing a different Backup and ensure credentials are cleaned up on wait failure: clear p.hcp, p.etcdOrchestrator, and p.etcdSnapshotURL at the start of handling a new backup (e.g., when entering the code that sets p.hcp from common.GetHCP or when backup.Spec changes) so cached values from previous backups aren't reused; and if WaitForCompletion() returns an error, call CleanupCredentialSecret() before returning the error so the credential secret is always removed on failure. Reference p.hcp, p.etcdOrchestrator, p.etcdSnapshotURL for the caches and WaitForCompletion()/CleanupCredentialSecret() for the cleanup path.
♻️ Duplicate comments (2)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md (1)
423-428:⚠️ Potential issue | 🟡 MinorMake the command match the field being described.
The text says to verify
status.lastSuccessfulEtcdBackupURL, but the jsonpath reads the annotation instead. That sends operators to the wrong field.Suggested fix
oc get hostedcluster <name> -n <hosted-cluster-namespace> \ - -o jsonpath='{.metadata.annotations.hypershift\.openshift\.io/etcd-snapshot-url}' + -o jsonpath='{.status.lastSuccessfulEtcdBackupURL}'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md` around lines 423 - 428, The verification command references the annotation but the text expects the status field; change the jsonpath to read the HostedCluster status.lastSuccessfulEtcdBackupURL (replace the current jsonpath '{.metadata.annotations.hypershift\.openshift\.io/etcd-snapshot-url}' with one that targets '{.status.lastSuccessfulEtcdBackupURL}') so the command actually checks the lastSuccessfulEtcdBackupURL status field mentioned in the text.pkg/core/restore.go (1)
246-249:⚠️ Potential issue | 🟠 MajorHandle backups that rely on Velero’s default BSL.
backup.Spec.StorageLocationis optional in Velero. When it is empty, this does aGetwith an empty BSL name and valid restores using the default storage location fail before presigning. Fall back to the recordedvelero.io/storage-locationlabel, or another default-BSL resolution path, before the lookup.Suggested fix
bslName := backup.Spec.StorageLocation + if bslName == "" && backup.Labels != nil { + bslName = backup.Labels["velero.io/storage-location"] + } + if bslName == "" { + return "", fmt.Errorf("backup %q does not record a BackupStorageLocation", backup.Name) + } if err := p.client.Get(ctx, types.NamespacedName{Name: bslName, Namespace: oadpNS}, bsl); err != nil { return "", fmt.Errorf("error getting BackupStorageLocation %q: %w", bslName, err) }Run this to verify the concern against the vendored Velero API and the current lookup path:
#!/bin/bash set -euo pipefail echo "Velero BackupSpec.StorageLocation definition:" fd -p 'backup_types.go' vendor/github.com/vmware-tanzu/velero | xargs -r rg -n -C2 'StorageLocation' echo echo "Current restore lookup:" sed -n '239,249p' pkg/core/restore.go echo echo "Repository references to the storage-location label:" rg -n -C2 'velero\.io/storage-location' .🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/restore.go` around lines 246 - 249, When resolving the BackupStorageLocation in pkg/core/restore.go, guard against an empty backup.Spec.StorageLocation (bslName) by first checking backup.ObjectMeta.Labels["velero.io/storage-location"] and using that as bslName if StorageLocation is empty before calling p.client.Get; update the lookup in the restore flow (the code that sets bslName and calls p.client.Get) to use the label fallback and only call p.client.Get with a non-empty name, and if still empty consider the existing default-BSL resolution path or return a clear error mentioning both StorageLocation and velero.io/storage-location.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 470-473: The fenced code block showing the example output is
missing a language tag (triggering markdownlint MD040); update that fenced block
in HCPEtcdBackup-implementation.md to include the JSON language tag (```json) so
the snippet is treated as JSON and properly linted—ensure the block starts with
```json and keep the array example content unchanged.
In `@pkg/core/backup.go`:
- Around line 318-327: The WaitForCompletion error path currently returns before
removing the copied BSL credentials; call
p.etcdOrchestrator.CleanupCredentialSecret(ctx) when WaitForCompletion returns
an error (and log any cleanup warning) before returning the formatted error.
Concretely, in the error branch after p.etcdOrchestrator.WaitForCompletion(ctx)
where you have "if err != nil { ... }", invoke CleanupCredentialSecret(ctx), log
any cleanupErr with p.log.Warnf("Failed to cleanup etcd backup credential
Secret: %v", cleanupErr), then return the original fmt.Errorf("HCPEtcdBackup
failed: %v", err); keep the existing successful-path code that sets
p.etcdSnapshotURL and logs the success unchanged.
In `@pkg/core/restore.go`:
- Around line 196-208: The code currently round-trips the resource through
runtime.DefaultUnstructuredConverter.FromUnstructured/ToUnstructured into a
HostedCluster (hc) which can drop unknown fields; instead mutate the nested
unstructured map directly: get the map via input.Item.UnstructuredContent(),
traverse/create the nested maps for "spec" -> "etcd" -> "managed" -> "storage",
and set "restoreSnapshotURL" to a string slice containing snapshotURL (i.e.
[]string{snapshotURL}) in that storage map, then call
input.Item.SetUnstructuredContent if you modified the root map and keep the
existing log statement; remove the FromUnstructured/ToUnstructured/HostedCluster
conversions to preserve unknown fields.
---
Outside diff comments:
In `@pkg/core/backup.go`:
- Around line 134-144: Reset the cached plugin state when processing a different
Backup and ensure credentials are cleaned up on wait failure: clear p.hcp,
p.etcdOrchestrator, and p.etcdSnapshotURL at the start of handling a new backup
(e.g., when entering the code that sets p.hcp from common.GetHCP or when
backup.Spec changes) so cached values from previous backups aren't reused; and
if WaitForCompletion() returns an error, call CleanupCredentialSecret() before
returning the error so the credential secret is always removed on failure.
Reference p.hcp, p.etcdOrchestrator, p.etcdSnapshotURL for the caches and
WaitForCompletion()/CleanupCredentialSecret() for the cleanup path.
---
Duplicate comments:
In `@docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.md`:
- Around line 423-428: The verification command references the annotation but
the text expects the status field; change the jsonpath to read the HostedCluster
status.lastSuccessfulEtcdBackupURL (replace the current jsonpath
'{.metadata.annotations.hypershift\.openshift\.io/etcd-snapshot-url}' with one
that targets '{.status.lastSuccessfulEtcdBackupURL}') so the command actually
checks the lastSuccessfulEtcdBackupURL status field mentioned in the text.
In `@pkg/core/restore.go`:
- Around line 246-249: When resolving the BackupStorageLocation in
pkg/core/restore.go, guard against an empty backup.Spec.StorageLocation
(bslName) by first checking
backup.ObjectMeta.Labels["velero.io/storage-location"] and using that as bslName
if StorageLocation is empty before calling p.client.Get; update the lookup in
the restore flow (the code that sets bslName and calls p.client.Get) to use the
label fallback and only call p.client.Get with a non-empty name, and if still
empty consider the existing default-BSL resolution path or return a clear error
mentioning both StorageLocation and velero.io/storage-location.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2d193e69-6c15-4931-a354-6fbdaaa53ac8
⛔ Files ignored due to path filters (25)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/api/config/v1/types_apiserver.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_authentication.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_infrastructure.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_insights.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/types_tlssecurityprofile.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/config/v1/zz_generated.swagger_doc_generated.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/operator/v1/types_network.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/api/operator/v1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azureprivatelinkservice_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/controlplaneversion_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/etcdbackup_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcp.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/gcpprivateserviceconnect_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hosted_controlplane.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/operator.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yamlis excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (21)
docs/references/HCPEtcdBackup/HCPEtcdBackup-implementation.mdgo.modpkg/common/scheme.gopkg/common/types.gopkg/common/utils.gopkg/common/utils_test.gopkg/core/backup.gopkg/core/backup_test.gopkg/core/restore.gopkg/core/restore_test.gopkg/core/types/types.gopkg/core/validation/backup.gopkg/core/validation/backup_test.gopkg/core/validation/restore.gopkg/core/validation/restore_test.gopkg/etcdbackup/orchestrator.gopkg/etcdbackup/orchestrator_test.gopkg/platform/aws/aws.gopkg/s3presign/presign.gopkg/s3presign/presign_test.gotests/integration/s3presign/presign_integration_test.go
💤 Files with no reviewable changes (1)
- pkg/platform/aws/aws.go
✅ Files skipped from review due to trivial changes (5)
- pkg/common/scheme.go
- pkg/common/utils_test.go
- pkg/core/restore_test.go
- tests/integration/s3presign/presign_integration_test.go
- pkg/etcdbackup/orchestrator.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/core/validation/backup.go
- pkg/core/validation/restore.go
- go.mod
- pkg/core/types/types.go
- pkg/common/types.go
- pkg/s3presign/presign.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/core/restore_test.go (2)
55-62: Global SA path mutation may create future test-order couplingBoth tests mutate
common.SetK8sSAFilePath(...)(package-global state). It’s restored correctly, but this still becomes fragile if any tests in package start usingt.Parallel(). Consider centralizing this in a helper guarded by a package test mutex, or injecting the path into the code under test to avoid global mutation.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 278-285
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/restore_test.go` around lines 55 - 62, The tests mutate package-global state via common.SetK8sSAFilePath (using common.DefaultK8sSAFilePath) which can cause test-order coupling when tests run in parallel; refactor by either (A) creating a test helper (e.g., setK8sSAPathForTest) that takes t *testing.T, uses a package-level sync.Mutex to serialize modifications, sets the path to t.TempDir()+"/namespace", and registers t.Cleanup to restore the original path, then replace the direct calls in restore_test.go (and at 278-285) with this helper; or (B) better yet, change the code under test to accept the SA file path as an injected parameter (constructor or function arg) and update tests to pass the temp path instead of calling common.SetK8sSAFilePath.
286-294: Avoid panic-prone type assertions in unstructured extractionThe chained
.(map[string]any)/.([]any)assertions can panic and obscure test intent when object shape changes. Preferunstructured.Nested*helpers so failures are explicit assertions instead of runtime panics.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."Refactor sketch
- extractRestoreSnapshotURL := func(output *veleroapiv1.RestoreItemActionExecuteOutput) ([]any, bool) { - spec := output.UpdatedItem.UnstructuredContent()["spec"].(map[string]any) - etcd := spec["etcd"].(map[string]any) - managed := etcd["managed"].(map[string]any) - storage := managed["storage"].(map[string]any) - urls, ok := storage["restoreSnapshotURL"].([]any) - return urls, ok - } + extractRestoreSnapshotURL := func(t *testing.T, output *veleroapiv1.RestoreItemActionExecuteOutput) ([]any, bool) { + t.Helper() + urls, found, err := unstructured.NestedSlice( + output.UpdatedItem.UnstructuredContent(), + "spec", "etcd", "managed", "storage", "restoreSnapshotURL", + ) + if err != nil { + t.Fatalf("restoreSnapshotURL has unexpected type: %v", err) + } + return urls, found + }// call-site adjustments outside this hunk: urls, ok := extractRestoreSnapshotURL(t, output)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/restore_test.go` around lines 286 - 294, The helper extractRestoreSnapshotURL currently uses chained type assertions on output.UpdatedItem.UnstructuredContent() which can panic; replace it with the k8s unstructured helpers (e.g., unstructured.NestedMap / unstructured.NestedSlice or unstructured.NestedStringSlice) to safely traverse the hierarchy from the RestoreItemActionExecuteOutput UnstructuredContent, returning the slice and a bool or failing the test with clear assertions; update the function signature (extractRestoreSnapshotURL) to accept the output map or testing.T as needed and use unstructured.NestedMap(output, "spec", "etcd", "managed", "storage") and then unstructured.NestedSlice on "restoreSnapshotURL" to avoid runtime panics and make failures explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/core/restore_test.go`:
- Around line 55-62: The tests mutate package-global state via
common.SetK8sSAFilePath (using common.DefaultK8sSAFilePath) which can cause
test-order coupling when tests run in parallel; refactor by either (A) creating
a test helper (e.g., setK8sSAPathForTest) that takes t *testing.T, uses a
package-level sync.Mutex to serialize modifications, sets the path to
t.TempDir()+"/namespace", and registers t.Cleanup to restore the original path,
then replace the direct calls in restore_test.go (and at 278-285) with this
helper; or (B) better yet, change the code under test to accept the SA file path
as an injected parameter (constructor or function arg) and update tests to pass
the temp path instead of calling common.SetK8sSAFilePath.
- Around line 286-294: The helper extractRestoreSnapshotURL currently uses
chained type assertions on output.UpdatedItem.UnstructuredContent() which can
panic; replace it with the k8s unstructured helpers (e.g.,
unstructured.NestedMap / unstructured.NestedSlice or
unstructured.NestedStringSlice) to safely traverse the hierarchy from the
RestoreItemActionExecuteOutput UnstructuredContent, returning the slice and a
bool or failing the test with clear assertions; update the function signature
(extractRestoreSnapshotURL) to accept the output map or testing.T as needed and
use unstructured.NestedMap(output, "spec", "etcd", "managed", "storage") and
then unstructured.NestedSlice on "restoreSnapshotURL" to avoid runtime panics
and make failures explicit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9046ed8c-8764-4498-ae3a-a10bb3b04ce7
📒 Files selected for processing (1)
pkg/core/restore_test.go
4e6d26e to
d1dda3e
Compare
Document the full HCPEtcdBackup integration including architecture, backup/restore flows, configuration, credential handling, storage layout, dependency chain (PRs #8139, #8010, #8017, #8040, #8114, enhancement #1945), and troubleshooting guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
d1dda3e to
1bcf307
Compare
sdminonne
left a comment
There was a problem hiding this comment.
Just one comment but in general not that important and basically test related
| // Delete the bucket | ||
| out, err = e.awsCmd(t, "s3", "rb", fmt.Sprintf("s3://%s", e.bucket)) | ||
| if err != nil { | ||
| t.Logf("WARNING: failed to delete bucket %s: %v\n%s", e.bucket, err, out) |
There was a problem hiding this comment.
Just FYI if this fails with a warning we're going to leak the S3 right? Guess it should fail the test with a message and s3 bucket to be deleted
During restore, read the etcd snapshot S3 URL from the hypershift.openshift.io/etcd-snapshot-url annotation (persisted by the backup plugin since Velero strips status fields), convert it to a pre-signed HTTPS GET URL using AWS SigV4, and inject it into spec.etcd.managed.storage.restoreSnapshotURL so the etcd-init container can download the snapshot via curl. The s3presign package implements SigV4 signing using only the Go stdlib (no AWS SDK dependency), supporting virtual-hosted/path-style addressing, custom S3 endpoints, and STS session tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
…l Execute coverage Add table-driven tests for presignS3URL and restore Execute flow covering etcd snapshot URL injection, including s3://, https://, and no-annotation cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
1bcf307 to
07ca889
Compare
|
@jparrill: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
Summary
hypershift.openshift.io/etcd-snapshot-urlannotation (persisted by the backup plugin, since Velero strips status fields during restore)s3://URLs to pre-signed HTTPS GET URLs using AWS SigV4 (pure Go stdlib, no AWS SDK dependency) so theetcd-initcontainer can download the snapshot viacurlspec.etcd.managed.storage.restoreSnapshotURLon the HostedClusterDependencies
LastSuccessfulEtcdBackupURLfield in HostedCluster statusrestoreSnapshotURLCEL immutabilityNote for reviewers
New files
pkg/s3presign/presign.gopkg/s3presign/presign_test.gopkg/core/restore_test.gotests/integration/s3presign/presign_integration_test.goTest plan
go test ./pkg/s3presign/— 21 unit tests passgo test ./pkg/core/ -run TestPresign— 4 unit tests passgo test -tags=integration ./tests/integration/s3presign/— 4 integration tests pass against real S3etcdSnapshotmethod → delete HC → OADP restore → etcd-init downloads snapshot via pre-signed URL → etcd restores successfullyKnown issues
etcd-init.shfails on OCP 4.21+ because etcd 3.6 removedetcdctl snapshot restore(must useetcdutl). This is a HyperShift issue, not in this plugin.🤖 Generated with Claude Code