OCPSTRAT-2527, OCPSTRAT-2540: Enhancement: etcd data re-encryption for key rotation in HyperShift by muraee · Pull Request #1969 · openshift/enhancements

muraee · 2026-04-09T16:17:03Z

Summary

Add enhancement proposal for etcd data re-encryption after encryption key rotation in HyperShift
Introduces a new HCCO controller that leverages KubeStorageVersionMigrator from library-go to create StorageVersionMigration CRs in the guest cluster, transparently re-encrypting all encrypted resources with the active key
Adds EtcdDataEncryptionUpToDate condition on HCP/HostedCluster for progress tracking
Guards against premature backup key removal
Supports all encryption types (Azure KMS, AWS KMS, IBM Cloud KMS, AESCBC)

Tracks: OCPSTRAT-2527, OCPSTRAT-2540
Related: ARO-21568, ARO-21456

Test plan

Unit tests for key fingerprint computation and controller reconciliation logic
Integration tests for StorageVersionMigration CR lifecycle
E2E tests for Azure KMS and AESCBC key rotation with re-encryption

🤖 Generated with Claude Code

ardaguclu

I specifically focused on Why KubeStorageVersionMigrator Instead of MigrationController section. It looks good to me. I dropped a comment more about agreement instead of any objection.

Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-04-16T12:46:34Z

@muraee: This pull request references OCPSTRAT-2527 which is a valid jira issue.

This pull request references OCPSTRAT-2540 which is a valid jira issue.

Details

In response to this:

Summary

Add enhancement proposal for etcd data re-encryption after encryption key rotation in HyperShift

Introduces a new HCCO controller that leverages KubeStorageVersionMigrator from library-go to create StorageVersionMigration CRs in the guest cluster, transparently re-encrypting all encrypted resources with the active key

Adds EtcdDataEncryptionUpToDate condition on HCP/HostedCluster for progress tracking

Guards against premature backup key removal

Supports all encryption types (Azure KMS, AWS KMS, IBM Cloud KMS, AESCBC)

Tracks: OCPSTRAT-2527, OCPSTRAT-2540
Related: ARO-21568, ARO-21456

Test plan

Unit tests for key fingerprint computation and controller reconciliation logic

Integration tests for StorageVersionMigration CR lifecycle

E2E tests for Azure KMS and AESCBC key rotation with re-encryption

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

csrwng · 2026-04-16T19:09:54Z

Just a couple of thoughts for kms in general:

we currently run the kube-storage-version-migrator in the data plane. I really see no good reason to do so, since we don't need workers to migrate storage, but having no workers would block migration if we leave it where it is.
(orthogonal but relevant) we're not consistently encrypting resources. Latest code seems to enable encryption for secrets if doing aescbc encryption, but secrets,configmaps,routes,oauthaccesstokens,oauthauthorizetokens when doing kms encryption. Furthermore, routes,oauthaccesstokens, and oauthauthorizetokens are not getting encrypted in any case because we would need the kms sidecar on the openshift apiservers that serve those resources.
we should fix the API. Currently, we only allow main/backup key of the same type (all aescbc, or all kms, etc). It should be possible to have multiple keys of different types given that it's what upstream kubernetes allows.
we're not doing key rotation properly. When we introduce a new main key, that key should first be added as a backup/read key to all instances of kube-apiserver. Once that has rolled out, then we can start making the new key the write key. Otherwise, when we introduce the new write key to the first instance of kube-apiserver, it could start encrypting with the new key, the other instances that have not been updated, will potentially start crashlooping because they cannot decode the secrets encoded with the new key.

csrwng · 2026-04-16T19:15:35Z

Could we reflect in HostedCluster status which keys are actively being used and reject spec changes that could potentially result in data loss?

…ackupKey - Deploy kube-storage-version-migrator in HCP namespace (control plane) instead of data plane, enabling re-encryption with zero worker nodes - Disable data-plane operator via annotation removal in cluster-kube-storage-version-migrator-operator repo - Add status.secretEncryption.activeKey field to HC/HCP with full key spec for rotation detection and EncryptionConfiguration resilience - Deprecate backupKey spec fields in favor of status-based tracking - Update workflow, architecture, risks, and support procedures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document per-provider behavior for cloud-initiated key rotation: Azure Key Vault requires explicit keyVersion (always detectable), AWS KMS automatic rotation keeps same ARN (not detectable but safe), IBM Cloud has explicit fields, AESCBC is user-managed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

csrwng · 2026-05-01T19:02:48Z

Thank you for the latest changes @muraee !

Some general comments:

We shouldn't rely on VAP for enforcing the behavior we want. I think we should build the feature assuming that there is no VAP and only add the VAP as a nice API UX improvement. To this end, we should have a way of ignoring a change to the spec while we're in the process of migrating storage. Or immediately accept the new spec, but keep the previously incomplete provider config so we can decrypt it and safely migrate to the new one.
We're not yet addressing the issue of breaking existing instances of kube apiserver while adding a new encryption provider. If we immediately start encrypting with whatever the user put in the spec, and roll that change out to the kube apiserver, only the first instance where the change is applied will be available, while the other 2 instances will potentially crash loop until their config changes. Rolling out a new key/provider should be done in 2 stages.
While we're deprecating the back up key configuration, we cannot ignore it completely. If you have a back up key in your spec, we need to include it in the KAS's configuration so we can decrypt with it.
For the status, it seems that what we need is a list of encryption configurations instead of just one active one. That would allow us to keep track of back up keys, incomplete migrations, etc. and also keep track of the state when rolling out a new write configuration.

Redesign the re-encryption workflow to use a safe two-stage KAS rollout: Stage 1 adds the new key as a read-only provider (all replicas can decrypt the new key before any writes with it), Stage 2 promotes it to write provider. This prevents decryption failures during rolling updates. Introduce a rolloutPhase-driven state machine (ReadOnlyDeploy → WritePromote → Migrating) in status.secretEncryption, with a snapshotted targetKey that the controller uses for the duration of a rotation. Mid-rotation spec changes are safely queued at the controller level (VAP-independent), completing the current rotation before starting a new one. Reframe the VAP as a UX improvement (clear admission-time error) rather than a safety requirement. Clarify that spec.backupKey can only be honored as a fallback when status.activeKey is nil, since during a rotation both KMS sidecar slots are occupied by activeKey and targetKey. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

muraee · 2026-05-04T13:40:38Z

Thanks for the thorough review @csrwng! I've updated the enhancement to address all four points. Here's a summary of the changes:

Point 1 (VAP-independent safety): The system is now designed to be safe without the VAP. When a rotation is in progress (rolloutPhase is not empty), the HCCO controller continues with the snapshotted targetKey and ignores subsequent spec changes. Once the current rotation completes, the controller detects the new spec key mismatch and starts a fresh rotation. The VAP is reframed as a UX improvement — it gives users a clear admission-time error ("wait for re-encryption to complete") rather than silently queuing the change.

Point 2 (Two-stage KAS rollout): The workflow now uses a three-phase state machine driven by status.secretEncryption.rolloutPhase:

Stage 1 (ReadOnlyDeploy): The new key is added as a read-only provider. The old key remains the write provider. During this rolling update, no pod writes with the new key, so all reads succeed across all replicas.
Stage 2 (WritePromote): The new key is promoted to write provider, old key becomes read-only. Pods still on the previous config already have the new key as a read-only provider (from Stage 1), so they can decrypt data written by new-config pods.
Stage 3 (Migrating): All KAS replicas have converged — StorageVersionMigration CRs are created to re-encrypt existing data.

The HCCO waits for full KAS convergence (updatedReplicas == replicas == readyReplicas) between each phase transition.

Point 3 (backupKey honored): spec.backupKey is used as a fallback when status.secretEncryption.activeKey is not yet set (upgrade transition — status not yet populated). However, once the status is initialized, backupKey cannot be honored alongside the status-driven keys because during a rotation both KMS sidecar slots are occupied by status.activeKey and status.targetKey — there is no third sidecar to service a spec.backupKey. For AESCBC (keys are inline, no sidecar needed) this constraint doesn't strictly apply, but for consistency and since backupKey is deprecated, we treat it uniformly as a transition-period fallback.

Point 4 (list vs single key status): Instead of a full list, the status now uses two named fields — activeKey (confirmed encrypted) and targetKey (being rolled out) — plus a rolloutPhase enum. This maps directly to the two-sidecar hardware constraint for KMS providers while providing the state tracking needed for the rollout and mid-rotation safety. The rolloutPhase field drives the CPO's EncryptionConfiguration generation via a simple lookup table.

During ReadOnlyDeploy (no data encrypted with target key yet), spec changes update targetKey in-place and restart the phase, allowing immediate correction of wrong-key mistakes. During WritePromote or Migrating, spec changes are queued until the current rotation completes (3 simultaneous keys would violate the two-sidecar constraint). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

csrwng

Review of the etcd data re-encryption enhancement. Please see inline comments for detailed feedback.

enxebre · 2026-05-06T08:14:39Z

+This enhancement adds etcd data re-encryption support to HyperShift
+by reusing library-go's `KubeStorageVersionMigrator` struct (which
+creates and monitors `StorageVersionMigration` CRs via the
+`migration.k8s.io/v1alpha1` API) within a new HCCO controller,


did you mean storagemigration.k8s.io/v1beta1? https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/
we wouldn't want to rely on an alpha API for this

@enxebre These are actually two different API groups:

migration.k8s.io/v1alpha1 — out-of-tree CRD from kube-storage-version-migrator, deployed by OpenShift's cluster-kube-storage-version-migrator-operator. This is what library-go's KubeStorageVersionMigrator uses (still current — no PRs to migrate it).

storagemigration.k8s.io/v1beta1 — built-in in-tree API (KEP-4192), promoted to beta in K8s 1.35 = OCP 4.22. Requires the StorageVersionMigrator feature gate.

We're using migration.k8s.io/v1alpha1 because we vendor library-go's KubeStorageVersionMigrator, which handles CR creation, stale detection, annotation tracking, status monitoring, and pruning (~130 lines of production-tested code).

Switching to storagemigration.k8s.io/v1beta1 would mean either waiting for library-go to migrate (no signs of that happening) or reimplementing KubeStorageVersionMigrator from scratch against the new API. Is that something we want to take on, or is using the existing library-go implementation with migration.k8s.io/v1alpha1 acceptable for now?

Move key change detection and two-stage KAS rollout (ReadOnlyDeploy, WritePromote) to the CPO main reconciler, which naturally owns KAS Deployment lifecycle. The HCCO re-encryption controller now only handles the Migrating phase — creating StorageVersionMigration CRs in the guest cluster and monitoring completion. The CPO main reconciler drives the state machine and writes status.secretEncryption fields via Status().Patch(). The adapt function reads rolloutPhase to generate the appropriate EncryptionConfiguration. The handoff to HCCO occurs when the CPO sets rolloutPhase=Migrating after Stage 2 convergence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

KMS sidecars are only configured on KAS today — routes, oauthaccesstokens, and oauthauthorizetokens are listed in KMSEncryptedObjects() but aren't actually encrypted because the OpenShift API servers lack KMS sidecars. AESCBC only encrypting secrets (not configmaps) is also a known gap. The re-encryption controller creates SVMs for whatever KMSEncryptedObjects() returns, so it will automatically cover new resources when the upstream encryption scope is expanded. Expanding the encryption scope and multi-API-server convergence tracking are explicitly listed as non-goals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove top-level RolloutPhase field — the current rotation phase is now history[0].state, following the ControlPlaneVersionStatus pattern where history[0] represents the current operation. History entries are created when a rotation starts (state= ReadOnlyDeploy), updated through phases (WritePromote, Migrating), and finalized (Completed or Interrupted). The CPO's adapt function reads history[0].state to generate the EncryptionConfiguration. API conventions applied: omitzero, +listType=atomic, +kubebuilder:validation markers, lowercase godoc, +k8s:deepcopy-gen. EncryptionMigrationState now covers both phases and outcomes: ReadOnlyDeploy, WritePromote, Migrating, Completed, Interrupted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add CEL union discriminator validation to SecretEncryptionKeyStatus (provider field gates which sub-struct is set). Apply lowercase godoc, +required/+optional markers, omitzero, and +k8s:deepcopy-gen to all status types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix Non-Goal 4: backupKey is deprecated but still functional as a fallback when status.activeKey is not yet initialized. Add explicit upgrade strategy for existing clusters: - No backupKey set (never rotated): initialize status.activeKey directly without re-encryption - backupKey set (rotation was in progress): trigger full re-encryption to ensure all data uses the active key, then transition to the status-driven mechanism Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Goal 6 (metrics for alerting) and Goal 7 (rotation history). Add Prometheus metrics section with three metrics emitted by the HCCO re-encryption controller: migration state gauge, duration histogram, and failure counter. Include suggested alert rules for stuck migrations and persistent failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add SVM CR naming and conflict handling note (Group 11) - Add EncryptionConfiguration YAML examples per phase (Group 13) - Add CVO garbage-collection note for data-plane operator cleanup on upgrade (Group 15) - Add forward compatibility notes for cross-type migration and etcd sharding (Groups 16, 17) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add +union on the struct, +unionDiscriminator on the provider field, and +unionMember on each variant field. Update CEL XValidation rules to match the HyperShift convention (required when matching provider, forbidden otherwise). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move back to a single HCCO controller for the entire re-encryption lifecycle. The phase is now derived from observable state (spec vs status keys, EncryptionConfiguration contents, KAS convergence, SVM completion) on each reconciliation — history[0].state is a recorded reflection, not the source of truth. The CPO's adaptSecretEncryptionConfig() independently derives the correct EncryptionConfiguration from the same observable state, implementing the two-stage rollout without needing a stored phase. This eliminates the CPO main reconciler role added in the previous commit, avoiding split-state conflicts between CPO and HCCO writing to the same status fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CVO does not garbage-collect resources when manifests are removed from a profile. The data-plane operator Deployment is added to ResourcesToRemove() and the HCCO explicitly deletes it from the hosted cluster on upgrade. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

IBM Cloud does not want additional control-plane deployments. The control-plane migrator is only deployed on non-IBM platforms via a CPO component platform predicate. On IBM Cloud, the data-plane operator continues to run. For non-IBM platforms, the data-plane operator Deployment is added to resourcesToRemove() which generates a cleanup manifest with release.openshift.io/delete annotation — the CVO in the hosted cluster processes it to delete the data-plane operator. Removes the separate cluster-kube-storage-version-migrator-operator repo change — all changes are now in the HyperShift repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

csrwng · 2026-05-07T17:58:30Z

/approve

openshift-ci · 2026-05-07T17:58:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~enhancements/hypershift/OWNERS~~ [csrwng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-05-07T18:43:17Z

@muraee: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from csrwng and sjenning April 9, 2026 16:17

muraee force-pushed the hypershift-etcd-reencryption branch from ae1dfec to eabd02a Compare April 10, 2026 09:33

bryan-cox reviewed Apr 10, 2026

View reviewed changes

muraee force-pushed the hypershift-etcd-reencryption branch from eabd02a to 7b4c875 Compare April 10, 2026 14:39

ardaguclu reviewed Apr 13, 2026

View reviewed changes

Comment thread enhancements/hypershift/etcd-data-reencryption-on-key-rotation.md

muraee mentioned this pull request Apr 13, 2026

feat: add etcd data re-encryption after encryption key rotation openshift/hypershift#8219

Open

7 tasks

Enhancement: etcd data re-encryption for key rotation in HyperShift

82ddb23

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

muraee force-pushed the hypershift-etcd-reencryption branch from 7b4c875 to 82ddb23 Compare April 14, 2026 10:12

muraee changed the title ~~Enhancement: etcd data re-encryption for key rotation in HyperShift~~ OCPSTRAT-2527, OCPSTRAT-2540: Enhancement: etcd data re-encryption for key rotation in HyperShift Apr 16, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2026

enxebre reviewed Apr 17, 2026

View reviewed changes

Comment thread enhancements/hypershift/etcd-data-reencryption-on-key-rotation.md Outdated

muraee force-pushed the hypershift-etcd-reencryption branch from 5c2acd4 to 4401ea1 Compare April 21, 2026 14:09

csrwng reviewed May 6, 2026

View reviewed changes