CNTRLPLANE-2121: Add KMS key rotation section by tjungblu · Pull Request #2000 · openshift/enhancements

tjungblu · 2026-05-07T07:18:20Z

No description provided.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

openshift-ci-robot · 2026-05-07T07:18:24Z

@tjungblu: This pull request references CNTRLPLANE-2121 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-07T07:18:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jan--f for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

p0lyn0mial · 2026-05-07T08:04:33Z

+#### KMS Key Rotation

-When a KMS plugin rotates its `key_id` (KEK), this triggers neither a new encryption key secret nor a new revision. The mechanism for detecting and handling `key_id` rotation is under evaluation and not covered in this enhancement.
+At a high level the API server still encrypts etcd values with a **data encryption key (DEK)** per write, but in KMS mode that DEK is wrapped by the KMS plugin using a **key encryption key (KEK)** held in Vault (Transit). The plugin renews or rotates DEKs as it sees fit; the OpenShift encryption controllers do not track DEK rotation as the plugin makes those rotations opaque during encryption and decryption.


The plugin renews or rotates DEKs as it sees fit

is this true ?

i thought the admin initiates the process

it depends how you configure it, there's an auto rotation option. Probably also not relevant for this section, I'll remove it.

I think it is better to remove this statement. We don't involve this phase.

Just a note: Plugin renews/rotates KEK that is used to encrypt/decrypt DEK which is created and managed by kube-apiserver. AFAIK, kube-apiserver has a mechanism to detect any changes of KEK (similar to what we want to do here) and it creates new DEK when it detects KEK is rotated.

p0lyn0mial · 2026-05-07T08:21:44Z

+
+The detection works like this:
+- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
+- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.


are we proposing to add a new rotation controller ?

what alternatives do you see here? I don't think this should be part of the key_controller and it is not exactly part of the migrationController either.

In my opinion, migration controller remains the sole controller that creates StorageVersionMigration. Rotation controller can signal something and migration controller initiates migration. For instance, did we explore the idea of removal of the migrated annotations from the key Secrets by rotation controller, so that maybe mechanism simply works.

there's some more discussion in slack, if you want to join Arda: https://redhat-internal.slack.com/archives/C0A0DMK3N7K/p1778138357610079

p0lyn0mial · 2026-05-07T08:24:23Z

+The detection works like this:
+- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
+- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
+- Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource. 


The operator uses it as the durable record by comparing

Today the operators don't interpret StorageVersionMigration resources.

yeah that's what's being proposed, obviously it doesn't know about the key id today

p0lyn0mial · 2026-05-07T08:31:00Z

+
+The detection works like this:
+- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
+- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.


please explain the impact on the existing migrator controller.
please explain the impact on the existing flow:

1. key_controller creates a new key Secret (no migrated-resources annotation yet) 2. migration_controller sees the write key lacks migration annotations and calls EnsureMigration() 3. After migration, stamps the Secret with migrated-timestamp + migrated-resources

please not that the key_controller will not create new key secret for the existing kms provider - updates will be delivered in-place.

my understanding is that the migration controller is not used, because it only migrates non-KMS-plugin data. There's no encryption config.

p0lyn0mial · 2026-05-07T08:31:29Z

+The detection works like this:
+- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
+- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
+- Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource. 


I think I would appreciate more details on how we will detect key_id change and trigger migration.

If removal of the migration annotation does not work properly, this new annotation (i.e. encryption.apiserver.operator.openshift.io/kms-key-id) might make sense. But we should be careful about the interaction between all controllers to not break the state machine.
I agree that we need more detail to discuss.

Since key_id is not needed in anywhere of the encryption controllers, we may not need to carry it. It is already persisted and sync'ed in operator condition. We just need a signal to initiate migration.

But we should be careful about the interaction between all controllers to not break the state machine.

For instance, based on the invariants, controllers stop reconciling when there is an active migration. What is the impact of this new annotation on the invariants.

ardaguclu · 2026-05-07T09:34:38Z

+#### KMS Key Rotation

-When a KMS plugin rotates its `key_id` (KEK), this triggers neither a new encryption key secret nor a new revision. The mechanism for detecting and handling `key_id` rotation is under evaluation and not covered in this enhancement.
+At a high level the API server still encrypts etcd values with a **data encryption key (DEK)** per write, but in KMS mode that DEK is wrapped by the KMS plugin using a **key encryption key (KEK)** held in Vault (Transit). The plugin renews or rotates DEKs as it sees fit; the OpenShift encryption controllers do not track DEK rotation as the plugin makes those rotations opaque during encryption and decryption.


I think it is better to remove this statement. We don't involve this phase.

Just a note: Plugin renews/rotates KEK that is used to encrypt/decrypt DEK which is created and managed by kube-apiserver. AFAIK, kube-apiserver has a mechanism to detect any changes of KEK (similar to what we want to do here) and it creates new DEK when it detects KEK is rotated.

ardaguclu · 2026-05-07T09:38:35Z

+The KMS v2 plugin exposes the effective KEK as `Status.key_id`, for Vault this aligns with the transit key version the plugin is using. The key can be rotated outside of OpenShift, which creates the need to detect changes to the KEK to trigger a storage migration of existing resources to the new key.
+
+The detection works like this:
+- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.


so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.

We rely on the convergence of the same revision number for all the apiservers (i.e. kas-o should not perform any action when kas-1, kas-2 are on revision 3 but kas-3 is on revision 2). This mechanism exists today https://github.com/openshift/library-go/blob/32460ef097300b3b53cc336e72636c9230c7bb93/pkg/operator/encryption/controllers/migration_controller.go#L210-L214

this is true for the plugin configuration, but the KEK is an potentially inconsistent object that you query from yet another system.

So it may be that kas-1 sees KEK1, but kas-2 sees KEK2 already. This convergence is not facilitated with static pod revisions.

When we own the key, that's obviously working the way you describe correctly.

ardaguclu · 2026-05-07T09:41:14Z

+
+The detection works like this:
+- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
+- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.


In my opinion, migration controller remains the sole controller that creates StorageVersionMigration. Rotation controller can signal something and migration controller initiates migration. For instance, did we explore the idea of removal of the migrated annotations from the key Secrets by rotation controller, so that maybe mechanism simply works.

ardaguclu · 2026-05-07T09:48:47Z

+The detection works like this:
+- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
+- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
+- Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource. 


If removal of the migration annotation does not work properly, this new annotation (i.e. encryption.apiserver.operator.openshift.io/kms-key-id) might make sense. But we should be careful about the interaction between all controllers to not break the state machine.
I agree that we need more detail to discuss.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

openshift-ci · 2026-05-08T13:17:28Z

@tjungblu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

CNTRLPLANE-2121: Add KMS key rotation section

0a7ffc5

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026

openshift-ci Bot requested review from joepvd and sadasu May 7, 2026 07:18

p0lyn0mial reviewed May 7, 2026

View reviewed changes

ardaguclu reviewed May 7, 2026

View reviewed changes

revised after todays meeting

d5c7067

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

tjungblu force-pushed the kms_rotation branch from 74d0289 to d5c7067 Compare May 8, 2026 13:05

Conversation

tjungblu commented May 7, 2026

Uh oh!

openshift-ci-robot commented May 7, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented May 7, 2026 •

edited by openshift-ci Bot

Loading