OCPSTRAT-2527, OCPSTRAT-2540: Enhancement: etcd data re-encryption for key rotation in HyperShift#1969
OCPSTRAT-2527, OCPSTRAT-2540: Enhancement: etcd data re-encryption for key rotation in HyperShift#1969muraee wants to merge 17 commits intoopenshift:masterfrom
Conversation
ae1dfec to
eabd02a
Compare
eabd02a to
7b4c875
Compare
ardaguclu
left a comment
There was a problem hiding this comment.
I specifically focused on Why KubeStorageVersionMigrator Instead of MigrationController section. It looks good to me. I dropped a comment more about agreement instead of any objection.
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7b4c875 to
82ddb23
Compare
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a re-encryption controller in the HCCO that triggers StorageVersionMigration after an encryption key rotation, ensuring all existing etcd data is re-encrypted with the new active key. Components: - API: EtcdDataEncryptionUpToDate condition type and reasons - CPO: key fingerprint computation and rekey-needed annotation on kas-secret-encryption-config secret - HCCO: new reencryption controller using library-go's KubeStorageVersionMigrator to drive StorageVersionMigration CRs - HyperShift Operator: condition bubble-up from HCP to HostedCluster Ref: OCPSTRAT-2527, OCPSTRAT-2540 Enhancement: openshift/enhancements#1969 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@muraee: This pull request references OCPSTRAT-2527 which is a valid jira issue. This pull request references OCPSTRAT-2540 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Just a couple of thoughts for kms in general:
|
|
Could we reflect in HostedCluster status which keys are actively being used and reject spec changes that could potentially result in data loss? |
…ackupKey - Deploy kube-storage-version-migrator in HCP namespace (control plane) instead of data plane, enabling re-encryption with zero worker nodes - Disable data-plane operator via annotation removal in cluster-kube-storage-version-migrator-operator repo - Add status.secretEncryption.activeKey field to HC/HCP with full key spec for rotation detection and EncryptionConfiguration resilience - Deprecate backupKey spec fields in favor of status-based tracking - Update workflow, architecture, risks, and support procedures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5c2acd4 to
4401ea1
Compare
Document per-provider behavior for cloud-initiated key rotation: Azure Key Vault requires explicit keyVersion (always detectable), AWS KMS automatic rotation keeps same ARN (not detectable but safe), IBM Cloud has explicit fields, AESCBC is user-managed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thank you for the latest changes @muraee ! Some general comments:
|
Redesign the re-encryption workflow to use a safe two-stage KAS rollout: Stage 1 adds the new key as a read-only provider (all replicas can decrypt the new key before any writes with it), Stage 2 promotes it to write provider. This prevents decryption failures during rolling updates. Introduce a rolloutPhase-driven state machine (ReadOnlyDeploy → WritePromote → Migrating) in status.secretEncryption, with a snapshotted targetKey that the controller uses for the duration of a rotation. Mid-rotation spec changes are safely queued at the controller level (VAP-independent), completing the current rotation before starting a new one. Reframe the VAP as a UX improvement (clear admission-time error) rather than a safety requirement. Clarify that spec.backupKey can only be honored as a fallback when status.activeKey is nil, since during a rotation both KMS sidecar slots are occupied by activeKey and targetKey. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the thorough review @csrwng! I've updated the enhancement to address all four points. Here's a summary of the changes: Point 1 (VAP-independent safety): The system is now designed to be safe without the VAP. When a rotation is in progress ( Point 2 (Two-stage KAS rollout): The workflow now uses a three-phase state machine driven by
The HCCO waits for full KAS convergence ( Point 3 (backupKey honored): Point 4 (list vs single key status): Instead of a full list, the status now uses two named fields — |
During ReadOnlyDeploy (no data encrypted with target key yet), spec changes update targetKey in-place and restart the phase, allowing immediate correction of wrong-key mistakes. During WritePromote or Migrating, spec changes are queued until the current rotation completes (3 simultaneous keys would violate the two-sidecar constraint). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
csrwng
left a comment
There was a problem hiding this comment.
Review of the etcd data re-encryption enhancement. Please see inline comments for detailed feedback.
| This enhancement adds etcd data re-encryption support to HyperShift | ||
| by reusing library-go's `KubeStorageVersionMigrator` struct (which | ||
| creates and monitors `StorageVersionMigration` CRs via the | ||
| `migration.k8s.io/v1alpha1` API) within a new HCCO controller, |
There was a problem hiding this comment.
did you mean storagemigration.k8s.io/v1beta1? https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/
we wouldn't want to rely on an alpha API for this
There was a problem hiding this comment.
@enxebre These are actually two different API groups:
migration.k8s.io/v1alpha1— out-of-tree CRD fromkube-storage-version-migrator, deployed by OpenShift'scluster-kube-storage-version-migrator-operator. This is what library-go'sKubeStorageVersionMigratoruses (still current — no PRs to migrate it).storagemigration.k8s.io/v1beta1— built-in in-tree API (KEP-4192), promoted to beta in K8s 1.35 = OCP 4.22. Requires theStorageVersionMigratorfeature gate.
We're using migration.k8s.io/v1alpha1 because we vendor library-go's KubeStorageVersionMigrator, which handles CR creation, stale detection, annotation tracking, status monitoring, and pruning (~130 lines of production-tested code).
Switching to storagemigration.k8s.io/v1beta1 would mean either waiting for library-go to migrate (no signs of that happening) or reimplementing KubeStorageVersionMigrator from scratch against the new API. Is that something we want to take on, or is using the existing library-go implementation with migration.k8s.io/v1alpha1 acceptable for now?
Move key change detection and two-stage KAS rollout (ReadOnlyDeploy, WritePromote) to the CPO main reconciler, which naturally owns KAS Deployment lifecycle. The HCCO re-encryption controller now only handles the Migrating phase — creating StorageVersionMigration CRs in the guest cluster and monitoring completion. The CPO main reconciler drives the state machine and writes status.secretEncryption fields via Status().Patch(). The adapt function reads rolloutPhase to generate the appropriate EncryptionConfiguration. The handoff to HCCO occurs when the CPO sets rolloutPhase=Migrating after Stage 2 convergence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KMS sidecars are only configured on KAS today — routes, oauthaccesstokens, and oauthauthorizetokens are listed in KMSEncryptedObjects() but aren't actually encrypted because the OpenShift API servers lack KMS sidecars. AESCBC only encrypting secrets (not configmaps) is also a known gap. The re-encryption controller creates SVMs for whatever KMSEncryptedObjects() returns, so it will automatically cover new resources when the upstream encryption scope is expanded. Expanding the encryption scope and multi-API-server convergence tracking are explicitly listed as non-goals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove top-level RolloutPhase field — the current rotation phase is now history[0].state, following the ControlPlaneVersionStatus pattern where history[0] represents the current operation. History entries are created when a rotation starts (state= ReadOnlyDeploy), updated through phases (WritePromote, Migrating), and finalized (Completed or Interrupted). The CPO's adapt function reads history[0].state to generate the EncryptionConfiguration. API conventions applied: omitzero, +listType=atomic, +kubebuilder:validation markers, lowercase godoc, +k8s:deepcopy-gen. EncryptionMigrationState now covers both phases and outcomes: ReadOnlyDeploy, WritePromote, Migrating, Completed, Interrupted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CEL union discriminator validation to SecretEncryptionKeyStatus (provider field gates which sub-struct is set). Apply lowercase godoc, +required/+optional markers, omitzero, and +k8s:deepcopy-gen to all status types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix Non-Goal 4: backupKey is deprecated but still functional as a fallback when status.activeKey is not yet initialized. Add explicit upgrade strategy for existing clusters: - No backupKey set (never rotated): initialize status.activeKey directly without re-encryption - backupKey set (rotation was in progress): trigger full re-encryption to ensure all data uses the active key, then transition to the status-driven mechanism Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Goal 6 (metrics for alerting) and Goal 7 (rotation history). Add Prometheus metrics section with three metrics emitted by the HCCO re-encryption controller: migration state gauge, duration histogram, and failure counter. Include suggested alert rules for stuck migrations and persistent failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add SVM CR naming and conflict handling note (Group 11) - Add EncryptionConfiguration YAML examples per phase (Group 13) - Add CVO garbage-collection note for data-plane operator cleanup on upgrade (Group 15) - Add forward compatibility notes for cross-type migration and etcd sharding (Groups 16, 17) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add +union on the struct, +unionDiscriminator on the provider field, and +unionMember on each variant field. Update CEL XValidation rules to match the HyperShift convention (required when matching provider, forbidden otherwise). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move back to a single HCCO controller for the entire re-encryption lifecycle. The phase is now derived from observable state (spec vs status keys, EncryptionConfiguration contents, KAS convergence, SVM completion) on each reconciliation — history[0].state is a recorded reflection, not the source of truth. The CPO's adaptSecretEncryptionConfig() independently derives the correct EncryptionConfiguration from the same observable state, implementing the two-stage rollout without needing a stored phase. This eliminates the CPO main reconciler role added in the previous commit, avoiding split-state conflicts between CPO and HCCO writing to the same status fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CVO does not garbage-collect resources when manifests are removed from a profile. The data-plane operator Deployment is added to ResourcesToRemove() and the HCCO explicitly deletes it from the hosted cluster on upgrade. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IBM Cloud does not want additional control-plane deployments. The control-plane migrator is only deployed on non-IBM platforms via a CPO component platform predicate. On IBM Cloud, the data-plane operator continues to run. For non-IBM platforms, the data-plane operator Deployment is added to resourcesToRemove() which generates a cleanup manifest with release.openshift.io/delete annotation — the CVO in the hosted cluster processes it to delete the data-plane operator. Removes the separate cluster-kube-storage-version-migrator-operator repo change — all changes are now in the HyperShift repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@muraee: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
KubeStorageVersionMigratorfrom library-go to createStorageVersionMigrationCRs in the guest cluster, transparently re-encrypting all encrypted resources with the active keyEtcdDataEncryptionUpToDatecondition on HCP/HostedCluster for progress trackingTracks: OCPSTRAT-2527, OCPSTRAT-2540
Related: ARO-21568, ARO-21456
Test plan
🤖 Generated with Claude Code