feat: add etcd data re-encryption after encryption key rotation#8219
feat: add etcd data re-encryption after encryption key rotation#8219muraee wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request introduces a new Sequence Diagram(s)sequenceDiagram
participant HC as HostedCluster
participant HCP as HostedControlPlane
participant Rec as Reencryption<br/>Reconciler
participant Secret as Encryption<br/>Config Secret
participant KAS as KAS<br/>Deployment
participant SVM as StorageVersion<br/>Migrator
HC->>Rec: Trigger (HCP/Secret/KAS event)
Rec->>HCP: Fetch HCP spec
alt No SecretEncryption configured
Rec->>HCP: Remove EtcdDataEncryptionUpToDate condition
Rec->>Rec: Exit
else SecretEncryption configured
Rec->>Rec: Compute active key fingerprint
alt Fingerprint cannot be computed
Rec->>Rec: Exit without changes
else Fingerprint computed
Rec->>Secret: Fetch encryption config Secret
alt Secret missing
Rec->>Rec: Requeue
else Secret exists
Rec->>Secret: Read EncryptionMigratedKeyHashAnnotation
alt Annotation absent
Rec->>Secret: Set annotation to active fingerprint
Rec->>Rec: Exit
else Annotation matches active fingerprint
Rec->>Rec: Exit without changes
else Annotation differs (key rotation detected)
Rec->>KAS: Check Deployment ready
alt KAS not ready
Rec->>HCP: Set EtcdDataEncryptionUpToDate=False<br/>(ReEncryptionWaitingForKAS)
Rec->>Rec: Requeue
else KAS ready
Rec->>SVM: Check migrator synced
alt Migrator not synced
Rec->>Rec: Requeue
else Migrator synced
Rec->>SVM: EnsureMigration for each<br/>encrypted resource
SVM->>SVM: Perform re-encryption
SVM-->>Rec: Return migration status
alt Migrations failed
Rec->>SVM: PruneMigration (for failed resources)
Rec->>HCP: Set EtcdDataEncryptionUpToDate=False<br/>(ReEncryptionFailedReason)
Rec->>Rec: Requeue
else Migrations in progress
Rec->>HCP: Set EtcdDataEncryptionUpToDate=False<br/>(ReEncryptionInProgressReason)
Rec->>Rec: Requeue
else All migrations succeeded
Rec->>HCP: Set EtcdDataEncryptionUpToDate=True<br/>(ReEncryptionCompletedReason)
Rec->>Secret: Update annotation to active fingerprint
end
end
end
end
end
end
end
🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…ion-migrator Vendor the KubeStorageVersionMigrator from library-go and the generated informer/lister packages from kube-storage-version-migrator. These are needed by the new HCCO re-encryption controller. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d5e5097 to
0d54a40
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go`:
- Around line 98-111: The code is writing the active key fingerprint and
unconditionally setting rekey-needed when the annotation was previously absent;
change the logic in the block that uses computeActiveKeyFingerprint,
EncryptionActiveKeyHashAnnotation and EncryptionRekeyNeededAnnotation so that
you always persist the fingerprint into secret.Annotations when fingerprint !=
"", but only set secret.Annotations[EncryptionRekeyNeededAnnotation] = "true" if
an existingHash was present (existingHash != "") and different from the new
fingerprint; do not mark rekey-needed when writing the hash for the first time.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 245-252: Reorder the finalization so the HCP status is updated to
show re-encryption completed before removing the rekey-needed annotation: call
setReEncryptionCondition(hcp, metav1.ConditionTrue,
hyperv1.ReEncryptionCompletedReason, ...) and persist that status update (as
done in the parent Reconcile flow) first, and only after the status update
succeeds call r.removeRekeyAnnotation(ctx, encryptionSecret); this ensures that
if the status update fails the reconcile will retry (via Reconcile's rekeyNeeded
logic) and that a later annotation-removal failure won't leave the condition
stuck in InProgress.
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 897-904: The current logic only sets EtcdDataEncryptionUpToDate on
hcluster when the HCP condition exists, which leaves a stale condition when
encryption is disabled, hcp is nil, or the HCP removes the condition; update the
block around hcp and hcluster.Spec.SecretEncryption to handle the
missing/disabled case by removing the bubbled condition instead of doing
nothing: check if hcp is nil or hcluster.Spec.SecretEncryption is nil or
meta.FindStatusCondition(hcp.Status.Conditions,
string(hyperv1.EtcdDataEncryptionUpToDate)) returns nil, and in those cases call
meta.RemoveStatusCondition(&hcluster.Status.Conditions,
string(hyperv1.EtcdDataEncryptionUpToDate)); otherwise (when cond exists)
continue to set cond.ObservedGeneration = hcluster.Generation and
meta.SetStatusCondition as currently done.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 3f62b72e-be23-40c7-8620-e6a59a833dde
⛔ Files ignored due to path filters (18)
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/errors.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/inprocess.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/inprocess_processor.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/kubestorageversionmigrator.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/metrics.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/types.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/factory.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/generic.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/internalinterfaces/factory_interfaces.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/interface.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/interface.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/storagestate.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/storageversionmigration.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/expansion_generated.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/storagestate.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/storageversionmigration.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (8)
api/hypershift/v1beta1/hostedcluster_conditions.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.gocontrol-plane-operator/hostedclusterconfigoperator/api/scheme.gocontrol-plane-operator/hostedclusterconfigoperator/cmd.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go (1)
103-111:⚠️ Potential issue | 🟠 MajorDon't mark re-encryption needed on the first hash write.
When
EncryptionActiveKeyHashAnnotationis absent (empty string), the conditionexistingHash != fingerprintevaluates totrue, which setsencryption-rekey-needed=true. This will spuriously trigger re-encryption migrations for new clusters and for existing clusters upgrading to this version.The hash should always be persisted, but
rekey-neededshould only be set when a previous hash existed and differs from the new one.Suggested fix
if fingerprint != "" { if secret.Annotations == nil { secret.Annotations = map[string]string{} } existingHash := secret.Annotations[EncryptionActiveKeyHashAnnotation] - if existingHash != fingerprint { - secret.Annotations[EncryptionActiveKeyHashAnnotation] = fingerprint - secret.Annotations[EncryptionRekeyNeededAnnotation] = "true" + secret.Annotations[EncryptionActiveKeyHashAnnotation] = fingerprint + if existingHash != "" && existingHash != fingerprint { + secret.Annotations[EncryptionRekeyNeededAnnotation] = "true" } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go` around lines 103 - 111, The current logic in secrtencryption.go sets EncryptionRekeyNeededAnnotation when existingHash != fingerprint even if there was no previous hash, causing spurious re-encryption; update the block around secret.Annotations and EncryptionActiveKeyHashAnnotation so you always persist the new fingerprint but only set EncryptionRekeyNeededAnnotation="true" if an existing non-empty hash existed and is different (i.e., existingHash != "" && existingHash != fingerprint). Keep the nil-check for secret.Annotations and ensure the new fingerprint is written to EncryptionActiveKeyHashAnnotation in all cases.control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go (1)
245-252:⚠️ Potential issue | 🟡 MinorConsider ordering: update status before removing annotation.
If the annotation removal at line 247 succeeds but the subsequent HCP status update in the parent
Reconcile(line 138) fails, the next reconcile will findrekeyNeeded != "true"(line 167-168) and return early. This leaves the condition potentially stuck in its previous state (e.g.,InProgress) rather thanCompleted.To ensure consistency, consider splitting this into two reconcile cycles or restructuring to ensure status is persisted before annotation removal is committed.
Alternative approach: set condition before annotation removal
// All migrations succeeded — remove the rekey-needed annotation and set condition True. log.Info("All re-encryption migrations completed successfully") + setReEncryptionCondition(hcp, metav1.ConditionTrue, hyperv1.ReEncryptionCompletedReason, + "All etcd data has been re-encrypted with the active encryption key") + + // Note: If annotation removal fails after status update succeeds in parent Reconcile, + // the next reconcile will re-process and retry annotation removal (migrations already done). if err := r.removeRekeyAnnotation(ctx, encryptionSecret); err != nil { return reconcile.Result{}, fmt.Errorf("failed to remove rekey-needed annotation: %w", err) } - setReEncryptionCondition(hcp, metav1.ConditionTrue, hyperv1.ReEncryptionCompletedReason, - "All etcd data has been re-encrypted with the active encryption key") return reconcile.Result{}, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go` around lines 245 - 252, Reorder the success path so the HCP status is updated to ReEncryptionCompleted before removing the rekey-needed annotation: call setReEncryptionCondition(hcp, metav1.ConditionTrue, hyperv1.ReEncryptionCompletedReason, ...) and ensure the status update is persisted (via the same status update code used in Reconcile) prior to calling removeRekeyAnnotation(ctx, encryptionSecret); if persisting status can fail, return the error and only remove the annotation after status update succeeds, ensuring the Reconcile loop sees the completed condition even if annotation removal later fails; references: setReEncryptionCondition, removeRekeyAnnotation, Reconcile, rekeyNeeded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go`:
- Around line 103-111: The current logic in secrtencryption.go sets
EncryptionRekeyNeededAnnotation when existingHash != fingerprint even if there
was no previous hash, causing spurious re-encryption; update the block around
secret.Annotations and EncryptionActiveKeyHashAnnotation so you always persist
the new fingerprint but only set EncryptionRekeyNeededAnnotation="true" if an
existing non-empty hash existed and is different (i.e., existingHash != "" &&
existingHash != fingerprint). Keep the nil-check for secret.Annotations and
ensure the new fingerprint is written to EncryptionActiveKeyHashAnnotation in
all cases.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 245-252: Reorder the success path so the HCP status is updated to
ReEncryptionCompleted before removing the rekey-needed annotation: call
setReEncryptionCondition(hcp, metav1.ConditionTrue,
hyperv1.ReEncryptionCompletedReason, ...) and ensure the status update is
persisted (via the same status update code used in Reconcile) prior to calling
removeRekeyAnnotation(ctx, encryptionSecret); if persisting status can fail,
return the error and only remove the annotation after status update succeeds,
ensuring the Reconcile loop sees the completed condition even if annotation
removal later fails; references: setReEncryptionCondition,
removeRekeyAnnotation, Reconcile, rekeyNeeded.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 969834bc-33b1-4893-9f05-17b72afb37c2
⛔ Files ignored due to path filters (18)
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/errors.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/inprocess.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/inprocess_processor.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/kubestorageversionmigrator.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/metrics.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/types.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/factory.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/generic.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/internalinterfaces/factory_interfaces.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/interface.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/interface.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/storagestate.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/storageversionmigration.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/expansion_generated.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/storagestate.gois excluded by!vendor/**,!**/vendor/**vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/storageversionmigration.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (8)
api/hypershift/v1beta1/hostedcluster_conditions.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.gocontrol-plane-operator/hostedclusterconfigoperator/api/scheme.gocontrol-plane-operator/hostedclusterconfigoperator/cmd.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
✅ Files skipped from review due to trivial changes (1)
- control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
- control-plane-operator/hostedclusterconfigoperator/cmd.go
- control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
- hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
- api/hypershift/v1beta1/hostedcluster_conditions.go
0d54a40 to
e29501b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go (1)
245-251:⚠️ Potential issue | 🟠 MajorPersist
ReEncryptionCompletedbefore clearingencryption-rekey-needed.
removeRekeyAnnotation()runs before the outerReconcile()call writeshcp.Status. If the secret update succeeds but that later status update fails, the next pass seesrekey-neededabsent and exits early, leavingEtcdDataEncryptionUpToDatestuck at its previous value. Move annotation removal to a second phase after the completed condition has been successfully persisted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go` around lines 245 - 251, The code currently removes the encryption-rekey-needed annotation (removeRekeyAnnotation) before the controller persists the updated HCP status (setReEncryptionCondition), which can cause the requeue logic to skip remaining work if the status update later fails; change the order so you first set and persist the ReEncryptionCompleted condition on the HostedControlPlane (call setReEncryptionCondition and ensure the status update is saved via the status writer used in Reconcile), verify that the status update succeeded, and only then call removeRekeyAnnotation to delete the encryption-rekey-needed annotation; use the existing functions removeRekeyAnnotation, setReEncryptionCondition and the same status update flow in Reconcile to implement this two-phase commit.hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (1)
897-904:⚠️ Potential issue | 🟠 MajorRemove the bubbled condition when HCP no longer reports it.
This only copies
EtcdDataEncryptionUpToDatewhen the HCP currently has it. If encryption is disabled, the HCP is gone, or the reencryption controller removes the HCP condition, the old HostedCluster value sticks around and becomes stale.Suggested fix
- // Bubble up the EtcdDataEncryptionUpToDate condition from HCP when encryption is configured. - if hcp != nil && hcluster.Spec.SecretEncryption != nil { - cond := meta.FindStatusCondition(hcp.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate)) - if cond != nil { - cond.ObservedGeneration = hcluster.Generation - meta.SetStatusCondition(&hcluster.Status.Conditions, *cond) - } - } + // Bubble up the EtcdDataEncryptionUpToDate condition from HCP when encryption is configured. + if hcluster.Spec.SecretEncryption == nil || hcp == nil { + meta.RemoveStatusCondition(&hcluster.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate)) + } else if cond := meta.FindStatusCondition(hcp.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate)); cond != nil { + cond.ObservedGeneration = hcluster.Generation + meta.SetStatusCondition(&hcluster.Status.Conditions, *cond) + } else { + meta.RemoveStatusCondition(&hcluster.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate)) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go` around lines 897 - 904, The current bubble-up only sets EtcdDataEncryptionUpToDate when HCP has the condition, but never removes it when HCP no longer reports it; update the block in hostedcluster_controller.go that references hcp, hcluster, meta.FindStatusCondition and meta.SetStatusCondition to also handle the nil case: if hcp is nil, hcluster.Spec.SecretEncryption is nil, or meta.FindStatusCondition(...) returns nil, call meta.RemoveStatusCondition(&hcluster.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate)) to delete the stale condition; otherwise keep the existing behavior of updating ObservedGeneration and calling meta.SetStatusCondition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 127-130: The Get calls that fetch the HostedControlPlane (hcp :=
resourcemanifests.HostedControlPlane(...) followed by r.cpClient.Get(...)) and
the similar Gets for the encryption config Secret and KAS Deployment should
treat apierrors.IsNotFound(err) as a transient condition instead of returning a
hard error; update the error handling in the r.cpClient.Get(...) blocks
(including the HCP fetch and the other Get calls around 158-176) to check
apierrors.IsNotFound(err) and return a non-error reconcile result (e.g.,
reconcile.Result{} or a short requeue with nil error) so the controller
idles/publishes waiting state instead of entering error backoff.
- Around line 165-166: The code uses
encryptionSecret.Annotations[kas.EncryptionActiveKeyHashAnnotation] (and the
same pattern at the other occurrence around lines ~200–201) as the writeKey for
EnsureMigration which is a deterministic key fingerprint and can collide when
rotating back to a previously used key; change this to a monotonic rotation
token annotation (e.g., kas.EncryptionRotationTokenAnnotation) that is
incremented (or replaced with a monotonic nonce) each time a rotation is
initiated and persisted on encryptionSecret, and update the code that reads the
annotation (the places referencing
encryptionSecret.Annotations[kas.EncryptionActiveKeyHashAnnotation]) to read the
rotation token instead and pass that token as the writeKey into EnsureMigration
so each rotation event is unique even if the underlying key value repeats.
---
Duplicate comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 245-251: The code currently removes the encryption-rekey-needed
annotation (removeRekeyAnnotation) before the controller persists the updated
HCP status (setReEncryptionCondition), which can cause the requeue logic to skip
remaining work if the status update later fails; change the order so you first
set and persist the ReEncryptionCompleted condition on the HostedControlPlane
(call setReEncryptionCondition and ensure the status update is saved via the
status writer used in Reconcile), verify that the status update succeeded, and
only then call removeRekeyAnnotation to delete the encryption-rekey-needed
annotation; use the existing functions removeRekeyAnnotation,
setReEncryptionCondition and the same status update flow in Reconcile to
implement this two-phase commit.
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 897-904: The current bubble-up only sets
EtcdDataEncryptionUpToDate when HCP has the condition, but never removes it when
HCP no longer reports it; update the block in hostedcluster_controller.go that
references hcp, hcluster, meta.FindStatusCondition and meta.SetStatusCondition
to also handle the nil case: if hcp is nil, hcluster.Spec.SecretEncryption is
nil, or meta.FindStatusCondition(...) returns nil, call
meta.RemoveStatusCondition(&hcluster.Status.Conditions,
string(hyperv1.EtcdDataEncryptionUpToDate)) to delete the stale condition;
otherwise keep the existing behavior of updating ObservedGeneration and calling
meta.SetStatusCondition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: fe17a0ce-bc5e-4d0c-87ad-e9027725f712
⛔ Files ignored due to path filters (4)
control-plane-operator/controllers/hostedcontrolplane/testdata/kube-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_kas_secret_encryption_config_secret.yamlis excluded by!**/testdata/**docs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (9)
api/hypershift/v1beta1/hostedcluster_conditions.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.gocontrol-plane-operator/hostedclusterconfigoperator/api/scheme.gocontrol-plane-operator/hostedclusterconfigoperator/cmd.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.gogo.modhypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
✅ Files skipped from review due to trivial changes (1)
- control-plane-operator/hostedclusterconfigoperator/api/scheme.go
🚧 Files skipped from review as they are similar to previous changes (4)
- control-plane-operator/hostedclusterconfigoperator/cmd.go
- control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
- api/hypershift/v1beta1/hostedcluster_conditions.go
- control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8219 +/- ##
==========================================
+ Coverage 34.63% 34.71% +0.07%
==========================================
Files 767 768 +1
Lines 93186 93540 +354
==========================================
+ Hits 32277 32470 +193
- Misses 58236 58372 +136
- Partials 2673 2698 +25
🚀 New features to boost your workflow:
|
e29501b to
e1b396f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go (2)
254-263:⚠️ Potential issue | 🟡 MinorCompleted status can still be lost on the failure path.
Line 261 persists annotation removal before the caller persists the updated HCP status. If
removeRekeyAnnotation()succeeds and the laterStatus().Update()fails, the next reconcile hitsrekeyNeeded != "true"and exits early, leavingEtcdDataEncryptionUpToDatestale instead ofReEncryptionCompleted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go` around lines 254 - 263, The code removes the rekey-needed annotation before the updated HCP status is persisted, risking loss of the Completed condition if the subsequent status update fails; change the order so you persist the updated HCP status with the ReEncryptionCompleted condition (call the HCP status update via r.Client.Status().Update or the existing status update helper) first, check that it succeeded, and only then call removeRekeyAnnotation(encryptionSecret); reference setReEncryptionCondition, the HCP Status().Update call, and removeRekeyAnnotation to locate and reorder the operations.
169-171:⚠️ Potential issue | 🟠 MajorUse a rotation token here, not the active-key fingerprint.
activeKeyHashidentifies the current key, not the rotation event. If a cluster rotates A → B → A, this passes the samewriteKeyback intoEnsureMigration, so an older completed migration for key A can satisfy the check even though the latest rotation still needs objects rewritten from B to A.Also applies to: 208-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go` around lines 169 - 171, The code is using the active-key fingerprint (encryptionSecret.Annotations[kas.EncryptionActiveKeyHashAnnotation]) to drive EnsureMigration, which breaks for A→B→A rotations; instead read and use the rotation token annotation (e.g., encryptionSecret.Annotations[kas.EncryptionRotationTokenAnnotation]) as the writeKey passed to EnsureMigration so each rotation event is unique; update the usage in the block that reads rekeyNeeded and activeKeyHash and the duplicate occurrence around the EnsureMigration call (the lines referencing kas.EncryptionActiveKeyHashAnnotation and the EnsureMigration/writeKey variable) to use the rotation-token annotation and pass that token through.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go`:
- Around line 188-199: The fingerprinting currently includes the Secret name
(secretEncryption.AESCBC.ActiveKey.Name) which causes renames/copies to change
the fingerprint; change the input construction in the AESCBC case to only
fingerprint the key bytes: compute the SHA256 of
activeKeySecret.Data[hyperv1.AESCBCKeySecretKey] and set input to include only
the algorithm and the hash (e.g. "aescbc/<sha256>"), removing any use of
ActiveKey.Name; ensure you reference activeKeySecret and
hyperv1.AESCBCKeySecretKey when locating the key bytes and preserve existing
error handling around cpContext.Client.Get.
---
Duplicate comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 254-263: The code removes the rekey-needed annotation before the
updated HCP status is persisted, risking loss of the Completed condition if the
subsequent status update fails; change the order so you persist the updated HCP
status with the ReEncryptionCompleted condition (call the HCP status update via
r.Client.Status().Update or the existing status update helper) first, check that
it succeeded, and only then call removeRekeyAnnotation(encryptionSecret);
reference setReEncryptionCondition, the HCP Status().Update call, and
removeRekeyAnnotation to locate and reorder the operations.
- Around line 169-171: The code is using the active-key fingerprint
(encryptionSecret.Annotations[kas.EncryptionActiveKeyHashAnnotation]) to drive
EnsureMigration, which breaks for A→B→A rotations; instead read and use the
rotation token annotation (e.g.,
encryptionSecret.Annotations[kas.EncryptionRotationTokenAnnotation]) as the
writeKey passed to EnsureMigration so each rotation event is unique; update the
usage in the block that reads rekeyNeeded and activeKeyHash and the duplicate
occurrence around the EnsureMigration call (the lines referencing
kas.EncryptionActiveKeyHashAnnotation and the EnsureMigration/writeKey variable)
to use the rotation-token annotation and pass that token through.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: b08bffff-7931-4567-990b-d445f0cabf23
⛔ Files ignored due to path filters (4)
control-plane-operator/controllers/hostedcontrolplane/testdata/kube-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_kas_secret_encryption_config_secret.yamlis excluded by!**/testdata/**docs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (9)
api/hypershift/v1beta1/hostedcluster_conditions.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.gocontrol-plane-operator/hostedclusterconfigoperator/api/scheme.gocontrol-plane-operator/hostedclusterconfigoperator/cmd.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.gogo.modhypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
✅ Files skipped from review due to trivial changes (3)
- control-plane-operator/hostedclusterconfigoperator/api/scheme.go
- go.mod
- control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
- api/hypershift/v1beta1/hostedcluster_conditions.go
e1b396f to
d9f1f6a
Compare
|
I now have all the information needed for a complete root cause analysis. Here is the report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorRoot CauseThe first commit in the PR (`fab20778d3`) has a title that **violates the Conventional Commits specification** enforced by the repository's `.gitlint` configuration.The commit title is: The The prefix The second commit ( RecommendationsThese are some recommendations based on the failure analysis1. **Reword the first commit's title** to use an allowed Conventional Commits prefix. The most appropriate prefix for a vendoring commit is `chore` or `build`: ``` chore: vendor library-go migrators and kube-storage-version-migrator informer/lister ``` or: ``` build: vendor library-go migrators and kube-storage-version-migrator informer/lister ```
EvidenceThis is some collected evidence that helped reach the mentioned root cause
|
d9f1f6a to
968dac4
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go (1)
135-136:⚠️ Potential issue | 🟡 MinorHandle deleted
HostedControlPlaneas a no-op.If the watched HCP is already gone, this returns an error and the controller backs off during normal teardown. Treat
apierrors.IsNotFound(err)here the same way as the Secret/Deployment lookups and just exit cleanly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go` around lines 135 - 136, When fetching the HostedControlPlane (hcp) with r.cpClient.Get in the Reconcile handler, treat a not-found error as a benign delete and exit cleanly instead of returning an error; update the error handling around r.cpClient.Get(ctx, client.ObjectKeyFromObject(hcp), hcp) to check apierrors.IsNotFound(err) and return reconcile.Result{}, nil for that case, while preserving the existing fmt.Errorf path for other errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 185-200: The code treats a missing
EncryptionMigratedKeyHashAnnotation as "first deployment" and initializes it to
activeKeyHash (via setMigratedKeyHash), which incorrectly skips re-encryption
when kas-secret-encryption-config is recreated during a pending rotation; modify
the logic in the reconcile path that reads EncryptionMigratedKeyHashAnnotation
and compares it to activeKeyHash so that both the "initialize migrated key hash"
branch and the migratedKeyHash == activeKeyHash fast-path are only taken when
the rekey-needed signal (hypershift.openshift.io/encryption-rekey-needed) is not
present/true; i.e., check the hypershift.openshift.io/encryption-rekey-needed
annotation before calling setMigratedKeyHash or returning early on equal hashes
so pending rekey flows continue to perform re-encryption.
- Around line 363-367: The code uses
activeKeySecret.Data[hyperv1.AESCBCKeySecretKey] without validating it and then
hashes potentially empty bytes; update the logic around the c.Get(...) result in
reencryption.go to check that activeKeySecret.Data contains a non-empty entry
for hyperv1.AESCBCKeySecretKey before computing input (the fingerprint), and if
the key is missing or empty return a clear error (or handle it upstream) instead
of generating a fingerprint for empty bytes; adjust any callers that rely on
input/EncryptionMigratedKeyHashAnnotation to expect this error path.
---
Duplicate comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 135-136: When fetching the HostedControlPlane (hcp) with
r.cpClient.Get in the Reconcile handler, treat a not-found error as a benign
delete and exit cleanly instead of returning an error; update the error handling
around r.cpClient.Get(ctx, client.ObjectKeyFromObject(hcp), hcp) to check
apierrors.IsNotFound(err) and return reconcile.Result{}, nil for that case,
while preserving the existing fmt.Errorf path for other errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 68ffe0d4-c58d-4278-b70b-957f7c8b917b
⛔ Files ignored due to path filters (4)
control-plane-operator/controllers/hostedcontrolplane/testdata/kube-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_kas_secret_encryption_config_secret.yamlis excluded by!**/testdata/**docs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (8)
api/hypershift/v1beta1/hostedcluster_conditions.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.gocontrol-plane-operator/hostedclusterconfigoperator/api/scheme.gocontrol-plane-operator/hostedclusterconfigoperator/cmd.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.gogo.modhypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
✅ Files skipped from review due to trivial changes (3)
- control-plane-operator/hostedclusterconfigoperator/api/scheme.go
- hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
- control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go
🚧 Files skipped from review as they are similar to previous changes (1)
- go.mod
|
Re: @coderabbitai's comment on AESCBC key validation: Done. Added validation to return empty fingerprint when key data is missing or empty. AI-assisted response via Claude Code |
|
Re: @coderabbitai's comment on missing migrated-hash during rotation: The The scenario described (secret recreated during pending rotation) requires manual deletion of an operator-managed secret, which would also disrupt KAS itself. In that edge case, the controller initializes the migrated hash to the current active key, which is correct — the CPO will recreate the secret with the current encryption config, and KAS will use the current key going forward. If re-encryption for the previous key was incomplete, the next key rotation will trigger a full re-encryption. AI-assisted response via Claude Code |
968dac4 to
2b0d02c
Compare
|
Tip For best results, initiate chat on the files or code changes.
|
|
Tip For best results, initiate chat on the files or code changes.
The fingerprint comparison approach (
|
There was a problem hiding this comment.
Review Summary
This is a solid PR overall — clean controller design, good test coverage, and the security-conscious approach to fingerprinting (SHA-256 of key identity, no raw material in annotations) is well done. A few things worth addressing before merge.
Things to fix
The most important issue: computeActiveKeyFingerprint silently returns ("", nil) for any unrecognized SecretEncryption.Type (reencryption.go:371-372). If someone adds a new encryption type to the API and forgets to update this function, re-encryption will just never trigger — no error, no log, nothing. This is especially surprising given that line 351 already correctly returns an error for unsupported KMS providers. The outer default case should do the same.
The PR description needs updating — it describes the CPO computing fingerprints and setting EncryptionActiveKeyHashAnnotation / EncryptionRekeyNeededAnnotation annotations, but the actual implementation has the HCCO computing fingerprints directly from the HCP spec using EncryptionMigratedKeyHashAnnotation. The description should match what was built.
ReEncryptionWaitingForKAS at hostedcluster_conditions.go:336 is missing the Reason suffix that the other three constants have (ReEncryptionInProgressReason, ReEncryptionCompletedReason, ReEncryptionFailedReason). Should be ReEncryptionWaitingForKASReason.
In reencryption.go:140-146, if r.reconcile() fails and the subsequent Status().Update() also fails, the original reconcile error gets silently dropped. Only the status update error bubbles up, which will make debugging painful.
The comment on EncryptionMigratedKeyHashAnnotation (reencryption.go:44-48) says it's "Written by this controller after all StorageVersionMigrations complete," but it's also written at line 192 during first-deployment initialization with no migrations involved. Worth fixing so it doesn't mislead someone later.
Worth considering
The no-op AddEventHandler at reencryption.go:113-116 looks like dead code, but it actually initializes the cacheSynced function pointer inside KubeStorageVersionMigrator — without it HasSynced() would panic. The comment should explain why it's there so nobody helpfully removes it.
If the EncryptionMigratedKeyHashAnnotation gets accidentally deleted after a key rotation (reencryption.go:187-194), the controller treats it as a fresh first deployment and just stamps the current hash without re-encrypting. The log at line 191 says "first deployment" which would be confusing in that scenario. A warning-level log with a more descriptive message would help.
On the testing side: the EtcdDataEncryptionUpToDate condition bubble-up logic in hostedcluster_controller.go:897-906 has three branches and is the user-facing surface of this feature, but it's untested. Similarly, EnsureMigration returning an infrastructure error is never exercised in tests since the mock always returns nil for the error. And there are no tests for NotFound scenarios (encryption config secret absent, KAS deployment absent, AESCBC key secret absent) — these are handled gracefully in the code but would benefit from coverage.
Minor: the condition documentation at hostedcluster_conditions.go:250-251 is inconsistent — the **False** description omits the WaitingForKASConvergence reason, and Absent: uses different formatting than **True**/**False** above it.
What's good
The controller design is clean — separating outer Reconcile for status orchestration from inner reconcile for business logic is a nice pattern. The test suite is thorough with 9 table-driven cases plus dedicated fingerprint and prune tests, all following project conventions (Gherkin naming, gomega, WithStatusSubresource). The condition bubble-up is actually more robust than existing patterns like ValidAWSKMSConfig/ValidAzureKMSConfig since it properly removes the condition when the HCP is nil or the condition is absent. Good reuse of existing utilities throughout.
| default: | ||
| return "", nil |
There was a problem hiding this comment.
critical: Silent empty return for unknown encryption types
The outer default case silently returns ("", nil) for any unrecognized SecretEncryption.Type. If a new encryption type is added to the API without updating this function, re-encryption will silently never trigger — no error, no log, no condition. The user gets no indication that their data isn't being re-encrypted.
Contrast with line 351 where an unsupported KMS provider correctly returns an error:
return "", fmt.Errorf("unsupported KMS provider: %s", secretEncryption.KMS.Provider)The outer default should follow the same pattern:
default:
return "", fmt.Errorf("unsupported encryption type: %s", secretEncryption.Type)There was a problem hiding this comment.
Done. Changed the outer default to return an error for unsupported encryption types. Added a corresponding test case in TestComputeActiveKeyFingerprint.
AI-assisted response via Claude Code
| ReEncryptionInProgressReason = "ReEncryptionInProgress" | ||
| ReEncryptionCompletedReason = "ReEncryptionCompleted" | ||
| ReEncryptionFailedReason = "ReEncryptionFailed" | ||
| ReEncryptionWaitingForKAS = "WaitingForKASConvergence" |
There was a problem hiding this comment.
important: Missing Reason suffix
The other three re-encryption reason constants all use the *Reason suffix (ReEncryptionInProgressReason, ReEncryptionCompletedReason, ReEncryptionFailedReason), but this one omits it. The overwhelming majority of reason constants in this file follow this convention.
| ReEncryptionWaitingForKAS = "WaitingForKASConvergence" | |
| ReEncryptionWaitingForKASReason = "WaitingForKASConvergence" |
(All references would need updating too.)
There was a problem hiding this comment.
Done. Renamed ReEncryptionWaitingForKAS to ReEncryptionWaitingForKASReason for consistency with the other three reason constants. Updated all references.
AI-assisted response via Claude Code
| result, err := r.reconcile(ctx, hcp) | ||
|
|
||
| // Update HCP status if changed. | ||
| if !reflect.DeepEqual(hcp.Status, originalHCP.Status) { | ||
| log.Info("Updating HCP status with re-encryption condition") | ||
| if updateErr := r.cpClient.Status().Update(ctx, hcp); updateErr != nil { | ||
| return reconcile.Result{}, fmt.Errorf("failed to update HCP status: %w", updateErr) |
There was a problem hiding this comment.
important: Original reconcile error lost when status update also fails
If r.reconcile() returns an error AND has mutated hcp.Status, and then Status().Update() also fails (e.g., conflict), the original reconcile error is silently dropped — only the status update error is returned. This makes debugging harder because the root cause error is lost.
Consider combining both errors:
if updateErr := r.cpClient.Status().Update(ctx, hcp); updateErr != nil {
if err != nil {
return reconcile.Result{}, fmt.Errorf("failed to update HCP status: %w (original error: %v)", updateErr, err)
}
return reconcile.Result{}, fmt.Errorf("failed to update HCP status: %w", updateErr)
}Or check the reconcile error first and return it immediately (the hcpstatus controller follows a similar pattern where it checks reconcile error before status update). In practice, the double-failure case is rare, but when it happens the lost error makes diagnosis very difficult.
There was a problem hiding this comment.
Done. Now preserves the original reconcile error in the status update failure message: fmt.Errorf("failed to update HCP status: %w (original error: %v)", updateErr, err).
AI-assisted response via Claude Code
| // EncryptionMigratedKeyHashAnnotation stores the fingerprint of the encryption | ||
| // key that etcd data was last re-encrypted with. Written by this controller | ||
| // after all StorageVersionMigrations complete. Compared against the active key | ||
| // fingerprint (computed from HCP spec) to decide whether re-encryption is needed. | ||
| EncryptionMigratedKeyHashAnnotation = "hypershift.openshift.io/encryption-migrated-key-hash" |
There was a problem hiding this comment.
important: Comment is factually incomplete
The comment says this annotation is "Written by this controller after all StorageVersionMigrations complete," but the annotation is also written at line 192 during first-deployment initialization without any migrations running. A future maintainer reading only this comment would incorrectly conclude that the annotation's presence always implies a full re-encryption cycle completed.
Consider something like:
Stores the fingerprint of the encryption key that etcd data is known to be encrypted with. Initialized on first deployment (assuming data is already encrypted with the current key), and updated after all StorageVersionMigrations complete following a key rotation.
There was a problem hiding this comment.
Done. Updated the annotation comment to describe the full lifecycle: initialized on first deployment assuming data is already encrypted, and updated after all StorageVersionMigrations complete following a key rotation. Also noted the comparison against the active key fingerprint.
AI-assisted response via Claude Code
| // Register an event handler on the SVM informer so the migrator's cache stays | ||
| // in sync. The controller reconciles on a requeue interval while migrations are | ||
| // in progress, so we just need a no-op handler for cache sync. | ||
| if _, err := migratorInstance.AddEventHandler(cache.ResourceEventHandlerFuncs{}); err != nil { |
There was a problem hiding this comment.
suggestion: Comment should explain the nil-deref risk if this is removed
The no-op cache.ResourceEventHandlerFuncs{} looks removable, but AddEventHandler actually initializes the cacheSynced function pointer inside KubeStorageVersionMigrator (see kubestorageversionmigrator.go:47). Without this call, HasSynced() at line 224 would panic with a nil pointer dereference.
Consider updating the comment:
// Register an event handler on the SVM informer. This call is required
// because AddEventHandler initializes the migrator's internal cacheSynced
// function; without it, HasSynced() will panic. The handler itself is a
// no-op because the controller reconciles on a requeue interval while
// migrations are in progress.There was a problem hiding this comment.
Done. Updated the comment to explain that AddEventHandler initializes the migrator's internal cacheSynced function pointer, and without it HasSynced() will panic with a nil dereference.
AI-assisted response via Claude Code
| if migratedKeyHash == "" { | ||
| // First deployment or upgrade — no prior migration record. | ||
| // Initialize the migrated hash to the active hash without re-encrypting, | ||
| // since the data is already encrypted with the current key. | ||
| log.Info("Initializing migrated key hash (first deployment)", "activeKeyHash", activeKeyHash) | ||
| if err := r.setMigratedKeyHash(ctx, encryptionSecret, activeKeyHash); err != nil { | ||
| return reconcile.Result{}, fmt.Errorf("failed to initialize migrated key hash: %w", err) | ||
| } |
There was a problem hiding this comment.
suggestion: Silent re-initialization risk if annotation is deleted
If the EncryptionMigratedKeyHashAnnotation is accidentally deleted (by a user, a bug, or a controller recreating the secret) after a key rotation has occurred but before re-encryption completed, the controller will treat this as "first deployment" and initialize the migrated hash to the active hash — silently skipping re-encryption of data still encrypted with the old key.
Consider logging at Warning level instead of Info here, so accidental annotation deletion is at least observable. Or adding a brief comment noting the assumption (that the annotation absence genuinely means first deployment, not data loss).
There was a problem hiding this comment.
Done. Changed to log.V(1).Info(...) so it appears in debug output, and added a caveat comment noting that if the annotation is accidentally deleted after a key rotation, this will skip re-encryption for that rotation but the next rotation will trigger a full re-encryption.
AI-assisted response via Claude Code
| // **False** means re-encryption is in progress or has failed. | ||
| // Absent: encryption is not configured or no key rotation has occurred. |
There was a problem hiding this comment.
suggestion: Condition doc could be slightly more precise
Two minor points:
**False**can also meanWaitingForKASConvergence(a distinct pre-condition state), not just "in progress or failed." Consider:**False** means the KAS deployment has not yet converged, re-encryption is in progress, or re-encryption has failed. Check the Reason field for specifics.Absent:uses different formatting than**True**and**False**above it. Consider**Absent**for consistency.
There was a problem hiding this comment.
Done. Added **Absent** with bold formatting for consistency, and updated the **False** description to mention the WaitingForKASConvergence reason alongside in-progress and failed states.
AI-assisted response via Claude Code
everettraven
left a comment
There was a problem hiding this comment.
I agree with @cblecker 's comments related to GoDoc verbosity on the new condition type for clarifying what each state+reason pairing represents, but other than that this LGTM from an API perspective.
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2b0d02c to
851c998
Compare
|
Approved from API perspective. /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: everettraven, muraee The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@muraee: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Now I have all the information needed. Let me compile the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThree independent issues block this PR. Tide reports a merge conflict with the target branch — this is the "error" state, not a reflection of the other failures. Lint fails because a single Root Cause1. Merge Conflict (Tide — error state) 2. Lint Failure — 3. Verify Failure — stale generated docs
Recommendations
Evidence
|
Summary
StorageVersionMigrationCRs to re-encrypt all existing etcd data with the new active keyEtcdDataEncryptionUpToDatecondition on HCP/HostedClusterKubeStorageVersionMigratorand kube-storage-version-migrator informer/lister packagesDetails
Problem
HyperShift supports encryption key rotation (setting a new active key + backup key), but has no mechanism to re-encrypt existing etcd data with the new key. Old data remains encrypted with the previous key indefinitely, violating compliance requirements (OCPSTRAT-2527, OCPSTRAT-2540).
Solution
Key change detection is performed entirely within the HCCO re-encryption controller — no CPO changes needed. The controller computes a SHA-256 fingerprint of the active key identity from the HCP spec on each reconcile and compares it against a
hypershift.openshift.io/encryption-migrated-key-hashannotation on thekas-secret-encryption-configsecret. When the hashes differ, re-encryption is triggered.API (
api/hypershift/v1beta1/hostedcluster_conditions.go):EtcdDataEncryptionUpToDatecondition typeReEncryptionInProgress,ReEncryptionCompleted,ReEncryptionFailed,WaitingForKASConvergenceHCCO (
control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/):reencryptioncontroller using library-go'sKubeStorageVersionMigratorcomputeActiveKeyFingerprint()— SHA-256 hash of active key identity (supports AWS KMS, Azure KMS, IBM Cloud KMS, AESCBC)StorageVersionMigrationCRs in the guest cluster for all encrypted resources (secrets, configmaps, routes, oauth tokens)PruneMigrationEtcdDataEncryptionUpToDatecondition on HCP with detailed progress messagesHyperShift Operator (
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go):EtcdDataEncryptionUpToDatecondition from HCP to HostedCluster (only when encryption is configured and condition exists)Vendoring
library-go/pkg/operator/encryption/controllers/migrators/—KubeStorageVersionMigrator,Migratorinterfacekube-storage-version-migrator/pkg/clients/informer/andlister/— SVM informer/listerRef: OCPSTRAT-2527, OCPSTRAT-2540
Enhancement: openshift/enhancements#1969
Test plan
computeActiveKeyFingerprint()— all provider types, determinism, different keys produce different hashes, unsupported type returns errorgo build ./...passesgo test -racepasses for all modified packages🤖 Generated with Claude Code