CNTRLPLANE-2121: Add KMS key rotation section#2000
CNTRLPLANE-2121: Add KMS key rotation section#2000tjungblu wants to merge 2 commits intoopenshift:masterfrom
Conversation
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
|
@tjungblu: This pull request references CNTRLPLANE-2121 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| #### KMS Key Rotation | ||
|
|
||
| When a KMS plugin rotates its `key_id` (KEK), this triggers neither a new encryption key secret nor a new revision. The mechanism for detecting and handling `key_id` rotation is under evaluation and not covered in this enhancement. | ||
| At a high level the API server still encrypts etcd values with a **data encryption key (DEK)** per write, but in KMS mode that DEK is wrapped by the KMS plugin using a **key encryption key (KEK)** held in Vault (Transit). The plugin renews or rotates DEKs as it sees fit; the OpenShift encryption controllers do not track DEK rotation as the plugin makes those rotations opaque during encryption and decryption. |
There was a problem hiding this comment.
The plugin renews or rotates DEKs as it sees fit
is this true ?
There was a problem hiding this comment.
i thought the admin initiates the process
There was a problem hiding this comment.
it depends how you configure it, there's an auto rotation option. Probably also not relevant for this section, I'll remove it.
There was a problem hiding this comment.
I think it is better to remove this statement. We don't involve this phase.
Just a note: Plugin renews/rotates KEK that is used to encrypt/decrypt DEK which is created and managed by kube-apiserver. AFAIK, kube-apiserver has a mechanism to detect any changes of KEK (similar to what we want to do here) and it creates new DEK when it detects KEK is rotated.
|
|
||
| The detection works like this: | ||
| - The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite. | ||
| - When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects. |
There was a problem hiding this comment.
are we proposing to add a new rotation controller ?
There was a problem hiding this comment.
what alternatives do you see here? I don't think this should be part of the key_controller and it is not exactly part of the migrationController either.
There was a problem hiding this comment.
In my opinion, migration controller remains the sole controller that creates StorageVersionMigration. Rotation controller can signal something and migration controller initiates migration. For instance, did we explore the idea of removal of the migrated annotations from the key Secrets by rotation controller, so that maybe mechanism simply works.
There was a problem hiding this comment.
there's some more discussion in slack, if you want to join Arda: https://redhat-internal.slack.com/archives/C0A0DMK3N7K/p1778138357610079
| The detection works like this: | ||
| - The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite. | ||
| - When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects. | ||
| - Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource. |
There was a problem hiding this comment.
The operator uses it as the durable record by comparing
Today the operators don't interpret StorageVersionMigration resources.
There was a problem hiding this comment.
yeah that's what's being proposed, obviously it doesn't know about the key id today
|
|
||
| The detection works like this: | ||
| - The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite. | ||
| - When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects. |
There was a problem hiding this comment.
please explain the impact on the existing migrator controller.
please explain the impact on the existing flow:
1. key_controller creates a new key Secret (no migrated-resources annotation yet)
2. migration_controller sees the write key lacks migration annotations and calls EnsureMigration()
3. After migration, stamps the Secret with migrated-timestamp + migrated-resources
please not that the key_controller will not create new key secret for the existing kms provider - updates will be delivered in-place.
There was a problem hiding this comment.
my understanding is that the migration controller is not used, because it only migrates non-KMS-plugin data. There's no encryption config.
| The detection works like this: | ||
| - The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite. | ||
| - When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects. | ||
| - Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource. |
There was a problem hiding this comment.
I think I would appreciate more details on how we will detect key_id change and trigger migration.
There was a problem hiding this comment.
If removal of the migration annotation does not work properly, this new annotation (i.e. encryption.apiserver.operator.openshift.io/kms-key-id) might make sense. But we should be careful about the interaction between all controllers to not break the state machine.
I agree that we need more detail to discuss.
There was a problem hiding this comment.
Since key_id is not needed in anywhere of the encryption controllers, we may not need to carry it. It is already persisted and sync'ed in operator condition. We just need a signal to initiate migration.
There was a problem hiding this comment.
But we should be careful about the interaction between all controllers to not break the state machine.
For instance, based on the invariants, controllers stop reconciling when there is an active migration. What is the impact of this new annotation on the invariants.
| #### KMS Key Rotation | ||
|
|
||
| When a KMS plugin rotates its `key_id` (KEK), this triggers neither a new encryption key secret nor a new revision. The mechanism for detecting and handling `key_id` rotation is under evaluation and not covered in this enhancement. | ||
| At a high level the API server still encrypts etcd values with a **data encryption key (DEK)** per write, but in KMS mode that DEK is wrapped by the KMS plugin using a **key encryption key (KEK)** held in Vault (Transit). The plugin renews or rotates DEKs as it sees fit; the OpenShift encryption controllers do not track DEK rotation as the plugin makes those rotations opaque during encryption and decryption. |
There was a problem hiding this comment.
I think it is better to remove this statement. We don't involve this phase.
Just a note: Plugin renews/rotates KEK that is used to encrypt/decrypt DEK which is created and managed by kube-apiserver. AFAIK, kube-apiserver has a mechanism to detect any changes of KEK (similar to what we want to do here) and it creates new DEK when it detects KEK is rotated.
| The KMS v2 plugin exposes the effective KEK as `Status.key_id`, for Vault this aligns with the transit key version the plugin is using. The key can be rotated outside of OpenShift, which creates the need to detect changes to the KEK to trigger a storage migration of existing resources to the new key. | ||
|
|
||
| The detection works like this: | ||
| - The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite. |
There was a problem hiding this comment.
so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite.
We rely on the convergence of the same revision number for all the apiservers (i.e. kas-o should not perform any action when kas-1, kas-2 are on revision 3 but kas-3 is on revision 2). This mechanism exists today https://github.com/openshift/library-go/blob/32460ef097300b3b53cc336e72636c9230c7bb93/pkg/operator/encryption/controllers/migration_controller.go#L210-L214
There was a problem hiding this comment.
this is true for the plugin configuration, but the KEK is an potentially inconsistent object that you query from yet another system.
So it may be that kas-1 sees KEK1, but kas-2 sees KEK2 already. This convergence is not facilitated with static pod revisions.
When we own the key, that's obviously working the way you describe correctly.
|
|
||
| The detection works like this: | ||
| - The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite. | ||
| - When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects. |
There was a problem hiding this comment.
In my opinion, migration controller remains the sole controller that creates StorageVersionMigration. Rotation controller can signal something and migration controller initiates migration. For instance, did we explore the idea of removal of the migrated annotations from the key Secrets by rotation controller, so that maybe mechanism simply works.
| The detection works like this: | ||
| - The plugin monitor sidecar periodically calls KMS v2 `Status` for health checks and writes the observed `key_id` along into the operator status condition. Reports are per KMS plugin, per APIServer and per control-plane instance so the operator can tell whether all API servers see the same KEK version before it starts any cluster-wide rewrite. | ||
| - When all status conditions agree on a new `key_id`, a rotation controller creates `StorageVersionMigration` objects and let the `kube-storage-version-migrator` rewrite stored objects. | ||
| - Each migration object carries one operator-owned annotation, for example `encryption.apiserver.operator.openshift.io/kms-key-id`, set to the plugins `key_id` at the time the migration was requested. The migrator does not interpret it. The operator uses it as the durable record by comparing the live agreed `key_id` to the annotation on the latest successful migration for that resource. |
There was a problem hiding this comment.
If removal of the migration annotation does not work properly, this new annotation (i.e. encryption.apiserver.operator.openshift.io/kms-key-id) might make sense. But we should be careful about the interaction between all controllers to not break the state machine.
I agree that we need more detail to discuss.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
|
@tjungblu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
No description provided.