Skip to content

CNTRLPLANE-2121: Add KMS key rotation section#2000

Open
tjungblu wants to merge 2 commits intoopenshift:masterfrom
tjungblu:kms_rotation
Open

CNTRLPLANE-2121: Add KMS key rotation section#2000
tjungblu wants to merge 2 commits intoopenshift:masterfrom
tjungblu:kms_rotation

Conversation

@tjungblu
Copy link
Copy Markdown
Contributor

@tjungblu tjungblu commented May 7, 2026

No description provided.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 7, 2026

@tjungblu: This pull request references CNTRLPLANE-2121 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from joepvd and sadasu May 7, 2026 07:18
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jan--f for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

#### KMS Key Rotation

When a KMS plugin rotates its `key_id` (KEK), this triggers neither a new encryption key secret nor a new revision. The mechanism for detecting and handling `key_id` rotation is under evaluation and not covered in this enhancement.
At a high level the API server still encrypts etcd values with a **data encryption key (DEK)** per write, but in KMS mode that DEK is wrapped by the KMS plugin using a **key encryption key (KEK)** held in Vault (Transit). The plugin renews or rotates DEKs as it sees fit; the OpenShift encryption controllers do not track DEK rotation as the plugin makes those rotations opaque during encryption and decryption.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin renews or rotates DEKs as it sees fit

is this true ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought the admin initiates the process

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends how you configure it, there's an auto rotation option. Probably also not relevant for this section, I'll remove it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to remove this statement. We don't involve this phase.

Just a note: Plugin renews/rotates KEK that is used to encrypt/decrypt DEK which is created and managed by kube-apiserver. AFAIK, kube-apiserver has a mechanism to detect any changes of KEK (similar to what we want to do here) and it creates new DEK when it detects KEK is rotated.


The detection works like this:
- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we proposing to add a new rotation controller ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what alternatives do you see here? I don't think this should be part of the key_controller and it is not exactly part of the migrationController either.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, migration controller remains the sole controller that creates StorageVersionMigration. Rotation controller can signal something and migration controller initiates migration. For instance, did we explore the idea of removal of the migrated annotations from the key Secrets by rotation controller, so that maybe mechanism simply works.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's some more discussion in slack, if you want to join Arda: https://redhat-internal.slack.com/archives/C0A0DMK3N7K/p1778138357610079

The detection works like this:
- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
- Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator uses it as the durable record by comparing

Today the operators don't interpret StorageVersionMigration resources.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's what's being proposed, obviously it doesn't know about the key id today


The detection works like this:
- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain the impact on the existing migrator controller.
please explain the impact on the existing flow:

1. key_controller creates a new key Secret (no migrated-resources annotation yet)
2. migration_controller sees the write key lacks migration annotations and calls EnsureMigration()
3. After migration, stamps the Secret with migrated-timestamp + migrated-resources

please not that the key_controller will not create new key secret for the existing kms provider - updates will be delivered in-place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that the migration controller is not used, because it only migrates non-KMS-plugin data. There's no encryption config.

The detection works like this:
- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
- Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would appreciate more details on how we will detect key_id change and trigger migration.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If removal of the migration annotation does not work properly, this new annotation (i.e. encryption.apiserver.operator.openshift.io/kms-key-id) might make sense. But we should be careful about the interaction between all controllers to not break the state machine.
I agree that we need more detail to discuss.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since key_id is not needed in anywhere of the encryption controllers, we may not need to carry it. It is already persisted and sync'ed in operator condition. We just need a signal to initiate migration.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we should be careful about the interaction between all controllers to not break the state machine.

For instance, based on the invariants, controllers stop reconciling when there is an active migration. What is the impact of this new annotation on the invariants.

#### KMS Key Rotation

When a KMS plugin rotates its `key_id` (KEK), this triggers neither a new encryption key secret nor a new revision. The mechanism for detecting and handling `key_id` rotation is under evaluation and not covered in this enhancement.
At a high level the API server still encrypts etcd values with a **data encryption key (DEK)** per write, but in KMS mode that DEK is wrapped by the KMS plugin using a **key encryption key (KEK)** held in Vault (Transit). The plugin renews or rotates DEKs as it sees fit; the OpenShift encryption controllers do not track DEK rotation as the plugin makes those rotations opaque during encryption and decryption.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to remove this statement. We don't involve this phase.

Just a note: Plugin renews/rotates KEK that is used to encrypt/decrypt DEK which is created and managed by kube-apiserver. AFAIK, kube-apiserver has a mechanism to detect any changes of KEK (similar to what we want to do here) and it creates new DEK when it detects KEK is rotated.

The KMS v2 plugin exposes the effective KEK as `Status.key_id`, for Vault this aligns with the transit key version the plugin is using. The key can be rotated outside of OpenShift, which creates the need to detect changes to the KEK to trigger a storage migration of existing resources to the new key.

The detection works like this:
- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.

We rely on the convergence of the same revision number for all the apiservers (i.e. kas-o should not perform any action when kas-1, kas-2 are on revision 3 but kas-3 is on revision 2). This mechanism exists today https://github.com/openshift/library-go/blob/32460ef097300b3b53cc336e72636c9230c7bb93/pkg/operator/encryption/controllers/migration_controller.go#L210-L214

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true for the plugin configuration, but the KEK is an potentially inconsistent object that you query from yet another system.

So it may be that kas-1 sees KEK1, but kas-2 sees KEK2 already. This convergence is not facilitated with static pod revisions.

When we own the key, that's obviously working the way you describe correctly.


The detection works like this:
- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, migration controller remains the sole controller that creates StorageVersionMigration. Rotation controller can signal something and migration controller initiates migration. For instance, did we explore the idea of removal of the migrated annotations from the key Secrets by rotation controller, so that maybe mechanism simply works.

The detection works like this:
- The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
- When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects.
- Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If removal of the migration annotation does not work properly, this new annotation (i.e. encryption.apiserver.operator.openshift.io/kms-key-id) might make sense. But we should be careful about the interaction between all controllers to not break the state machine.
I agree that we need more detail to discuss.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

@tjungblu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants