Skip to content

AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2NodeClass#7916

Closed
jkyros wants to merge 7 commits intoopenshift:mainfrom
jkyros:autoscale-558-expose-kubelet-config
Closed

AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2NodeClass#7916
jkyros wants to merge 7 commits intoopenshift:mainfrom
jkyros:autoscale-558-expose-kubelet-config

Conversation

@jkyros
Copy link
Copy Markdown
Member

@jkyros jkyros commented Mar 11, 2026

What this PR does / why we need it:

This:

  • Exposes spec.kubelet on OpenShiftEC2NodeClass, but DOES NOT reconcile it to ec2nodeclass directly like the other fields
  • Converts it into ignition, merges it with the existing default taint KubeletConfig, and adds it to the in-memory nodepool's list of MachineConfig object references
    • Much like highlander, there can be only one (KubeletConfig) in the MachineConfig pool

Because:

  • We're using the Custom AMIFamily, the spec.Kubelet from the ec2nodeclass does NOT get reconciled into userdata by Karpenter, so we have to stuff the KubeletConfig into the ignition/MachineConfig ourselves.
  • Users need to be able to configure KubeletConfig on their nodes

Which issue(s) this PR fixes:

Fixes AUTOSCALE-558

Special notes for your reviewer:

  • I moved the base taint configMap out of the hostedcluster controller and into the Karpenter operator.
    • The thinking was it probably belonged to karpenter-operator anyway, and if I could access it as a "constant" I didn't need to request its contents from the cluster every reconcile to merge with KubeletConfig
  • The e2e for KubeletConfig might be a little extra with the fixture, but I forgot that the MCO splits systemReserved into that other env file and I thought I broke it because it was missing -- wrote that to figure it out

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added kubelet configuration support for OpenShiftEC2NodeClass, enabling customization of kubelet parameters including pod density, resource reservations, eviction policies, image garbage collection settings, and CPU CFS quota.
  • Tests

    • Added comprehensive test coverage for kubelet configuration reconciliation and end-to-end validation.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 11, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

This:

  • Exposes spec.kubelet on OpenShiftEC2NodeClass, but DOES NOT reconcile it to ec2nodeclass directly like the other fields
  • Exposes config on OpenShiftEC2Nodeclass like it's exposed on non-Karpenter nodepools
  • Reconciles both of those into the ignition payload that is delivered via our ignition controller + "in-memory nodeclass"
  • In the event of a collision ( say, there is a KubeletConfig in config ) the user's direct on-nodeclass configuration wins

Because:

  • We're using the Custom AMIFamily, the spec.Kubelet from the ec2nodeclass does NOT get reconciled into userdata by Karpenter, so we have to stuff the KubeletConfig into the ignition/MachineConfig ourselves.
  • Users need to be able to configure KubeletConfig on their nodes
  • ROSA expressed the desire to be able to reference ROSA-created configurations in the control plane

Which issue(s) this PR fixes:

Fixes AUTOSCALE-558

Special notes for your reviewer:

  • I included both spec.kubelet here and spec.config, but I have them as separate commits, we can wring our hands over whether or not we truly want spec.config and what the precedence order should be
  • spec.config technically allows referencing any MachineConfig configmap from the control plane the way it's set up now, but NOT any from the data plane, since they are local references

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 11, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/api Indicates the PR includes changes for the API area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 11, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 11, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 88e365ea-88b2-4cfc-8d7b-8999eb6e5edf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR extends Karpenter's OpenShift EC2 node class configuration to support kubelet settings management. It introduces a new KubeletConfiguration type to the v1beta1 API, allowing users to specify kubelet parameters via OpenshiftEC2NodeClass.spec.kubelet. The karpenterignition controller reconciles kubelet config ConfigMaps containing merged kubelet settings, while the karpenter controller manages base taint configuration. These configurations are propagated to nodes via MachineConfig. Supporting utilities, client apply-configurations, CRD schema updates, and comprehensive unit and E2E tests are included to validate kubelet propagation to provisioned nodes.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant KarpenterIgnition as KarpenterIgnition<br/>Controller
    participant Management as Management Cluster<br/>(ConfigMap)
    participant MachineConfig as MachineConfig<br/>System
    participant Node as Provisioned<br/>Node

    User->>KarpenterIgnition: Create/Update OpenshiftEC2NodeClass<br/>with spec.kubelet config
    activate KarpenterIgnition
    KarpenterIgnition->>KarpenterIgnition: Merge user kubelet config<br/>with base taint configuration
    KarpenterIgnition->>Management: Reconcile kubelet config<br/>ConfigMap (per-NodeClass)
    deactivate KarpenterIgnition
    
    activate Management
    Management->>Management: Store merged kubelet<br/>configuration as JSON
    Management->>MachineConfig: ConfigMap propagates to<br/>MachineConfig system
    deactivate Management
    
    User->>KarpenterIgnition: Create NodePool referencing<br/>OpenshiftEC2NodeClass
    activate KarpenterIgnition
    KarpenterIgnition->>KarpenterIgnition: createInMemoryNodePool()<br/>includes kubelet ConfigRef
    KarpenterIgnition->>Node: Provision new node<br/>with kubelet config reference
    deactivate KarpenterIgnition
    
    activate Node
    Node->>Management: Retrieve kubelet config<br/>from ConfigMap
    Node->>Node: Apply kubelet configuration<br/>(MaxPods, CPUCFSQuota, etc.)
    deactivate Node
Loading
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning The e2e test violates single responsibility by combining numerous unrelated behaviors sequentially without meaningful assertion failure messages. Refactor the e2e test into multiple independent subtests using nested t.Run blocks and add meaningful failure messages to all assertions.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately summarizes the main change: exposing KubeletConfig on OpenShiftEC2NodeClass, which is the primary objective reflected throughout the changeset.
Stable And Deterministic Test Names ✅ Passed All test function names and test block titles (via t.Run()) are static and deterministic with no dynamic values.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Mar 11, 2026

/test e2e-aws-autonode

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 11, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

This:

  • Exposes spec.kubelet on OpenShiftEC2NodeClass, but DOES NOT reconcile it to ec2nodeclass directly like the other fields
  • Exposes config on OpenShiftEC2Nodeclass like it's exposed on non-Karpenter nodepools
  • Reconciles both of those into the ignition payload that is delivered via our ignition controller + "in-memory nodeclass"
  • In the event of a collision ( say, there is a KubeletConfig in config ) the user's direct on-nodeclass configuration wins

Because:

  • We're using the Custom AMIFamily, the spec.Kubelet from the ec2nodeclass does NOT get reconciled into userdata by Karpenter, so we have to stuff the KubeletConfig into the ignition/MachineConfig ourselves.
  • Users need to be able to configure KubeletConfig on their nodes
  • ROSA expressed the desire to be able to reference ROSA-created configurations in the control plane

Which issue(s) this PR fixes:

Fixes AUTOSCALE-558

Special notes for your reviewer:

  • I included both spec.kubelet here and spec.config, but I have them as separate commits, we can wring our hands over whether or not we truly want spec.config and what the precedence order should be
  • spec.config technically allows referencing any MachineConfig configmap from the control plane the way it's set up now, but NOT any from the data plane, since they are local references

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Added kubelet configuration options to OpenShiftEC2NodeClass, including MaxPods, resource reservations, eviction policies, and image garbage collection settings.

  • Added support for referencing ConfigMaps containing additional machine configurations per node class.

  • Chores

  • Updated dependencies.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
test/e2e/assets/karpenter-config-checker-ds.yaml (1)

19-19: Consider pinning the alpine image tag for reproducibility.

Same suggestion as karpenter-kubelet-checker-ds.yaml - pin to a specific version for test stability.

♻️ Suggested fix
-        image: alpine
+        image: alpine:3.21
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/assets/karpenter-config-checker-ds.yaml` at line 19, The image
currently uses an unpinned tag ("image: alpine"); change it to a specific pinned
tag (for example the same alpine:<version> used in
karpenter-kubelet-checker-ds.yaml) to ensure reproducible tests, updating the
image field in the karpenter-config-checker daemonset (the "image: alpine" line)
to the exact version string.
test/e2e/assets/karpenter-kubelet-checker-ds.yaml (1)

19-19: Consider pinning the alpine image tag for reproducibility.

Using alpine without a version tag means tests may behave differently over time as the latest tag changes. For e2e test stability, consider pinning to a specific version.

♻️ Suggested fix
-        image: alpine
+        image: alpine:3.21
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/assets/karpenter-kubelet-checker-ds.yaml` at line 19, The container
image is unpinned ("image: alpine"), which can cause non-deterministic test
behavior; update the YAML resource so the image field uses an explicit,
immutable tag (e.g., change "image: alpine" to a specific version like "image:
alpine:3.18" or another vetted version) to ensure reproducible e2e runs and pin
stability for the karpenter-kubelet-checker daemonset.
karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go (1)

260-261: Hardcoded AMD64 architecture.

The Arch field is hardcoded to ArchitectureAMD64. If ARM64 support is planned for Karpenter-managed nodes, this should be parameterized or derived from the NodeClass.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go`
around lines 260 - 261, The Arch field is hardcoded to
hyperv1.ArchitectureAMD64; change it to derive from the NodeClass/NodeClassSpec
rather than a constant: read the architecture value from the NodeClass (e.g.,
nodeClass.Spec.Architecture or equivalent) or accept it as a parameter to the
function that builds the config (where Config: configRefs is set),
validate/normalize values (AMD64/ARM64) and fall back to a safe default if
unspecified, and replace the literal hyperv1.ArchitectureAMD64 with that
derived/parameterized value so Karpenter-managed nodes can support ARM64.
karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml (1)

202-276: Consider adding validation rules for kubelet thresholds.

The spec.kubelet schema is well-structured, but consider adding CEL validation rules to enforce invariants such as:

  • imageGCHighThresholdPercent > imageGCLowThresholdPercent
  • Threshold percentages within 0-100 range

This would provide earlier feedback to users rather than failing at kubelet startup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml`
around lines 202 - 276, The CRD lacks CEL validation for kubelet thresholds; add
CEL rules under the NodeClass CRD validation for spec.kubelet (fields
imageGCHighThresholdPercent and imageGCLowThresholdPercent and any other percent
thresholds) to enforce 0 <= imageGCLowThresholdPercent <= 100, 0 <=
imageGCHighThresholdPercent <= 100, and imageGCHighThresholdPercent >
imageGCLowThresholdPercent (and similar invariants for other *_ThresholdPercent
fields if present); place these expressions in the validation.rules (or the
validation.openAPIV3Schema.x-kubernetes-validations block) next to the kubelet
properties to provide early user feedback.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml`:
- Around line 202-276: The CRD lacks CEL validation for kubelet thresholds; add
CEL rules under the NodeClass CRD validation for spec.kubelet (fields
imageGCHighThresholdPercent and imageGCLowThresholdPercent and any other percent
thresholds) to enforce 0 <= imageGCLowThresholdPercent <= 100, 0 <=
imageGCHighThresholdPercent <= 100, and imageGCHighThresholdPercent >
imageGCLowThresholdPercent (and similar invariants for other *_ThresholdPercent
fields if present); place these expressions in the validation.rules (or the
validation.openAPIV3Schema.x-kubernetes-validations block) next to the kubelet
properties to provide early user feedback.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go`:
- Around line 260-261: The Arch field is hardcoded to hyperv1.ArchitectureAMD64;
change it to derive from the NodeClass/NodeClassSpec rather than a constant:
read the architecture value from the NodeClass (e.g.,
nodeClass.Spec.Architecture or equivalent) or accept it as a parameter to the
function that builds the config (where Config: configRefs is set),
validate/normalize values (AMD64/ARM64) and fall back to a safe default if
unspecified, and replace the literal hyperv1.ArchitectureAMD64 with that
derived/parameterized value so Karpenter-managed nodes can support ARM64.

In `@test/e2e/assets/karpenter-config-checker-ds.yaml`:
- Line 19: The image currently uses an unpinned tag ("image: alpine"); change it
to a specific pinned tag (for example the same alpine:<version> used in
karpenter-kubelet-checker-ds.yaml) to ensure reproducible tests, updating the
image field in the karpenter-config-checker daemonset (the "image: alpine" line)
to the exact version string.

In `@test/e2e/assets/karpenter-kubelet-checker-ds.yaml`:
- Line 19: The container image is unpinned ("image: alpine"), which can cause
non-deterministic test behavior; update the YAML resource so the image field
uses an explicit, immutable tag (e.g., change "image: alpine" to a specific
version like "image: alpine:3.18" or another vetted version) to ensure
reproducible e2e runs and pin stability for the karpenter-kubelet-checker
daemonset.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5ca5591d-b704-486e-95cb-aa9dc52a18ff

📥 Commits

Reviewing files that changed from the base of the PR and between 8403103 and 4d69815.

⛔ Files ignored due to path filters (4)
  • api/karpenter/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/karpenter_types.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/kubeletconfig.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/zz_generated.deepcopy.go is excluded by !**/vendor/**, !vendor/**, !**/zz_generated*
📒 Files selected for processing (14)
  • api/go.mod
  • api/karpenter/v1beta1/karpenter_types.go
  • api/karpenter/v1beta1/kubeletconfig.go
  • api/karpenter/v1beta1/kubeletconfig_test.go
  • client/applyconfiguration/karpenter/v1beta1/kubeletconfiguration.go
  • client/applyconfiguration/karpenter/v1beta1/openshiftec2nodeclassspec.go
  • client/applyconfiguration/utils.go
  • karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go
  • support/karpenter/karpenter.go
  • test/e2e/assets/karpenter-config-checker-ds.yaml
  • test/e2e/assets/karpenter-kubelet-checker-ds.yaml
  • test/e2e/karpenter_test.go

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Mar 11, 2026

/test e2e-aws-autonode

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 11, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

This:

  • Exposes spec.kubelet on OpenShiftEC2NodeClass, but DOES NOT reconcile it to ec2nodeclass directly like the other fields
  • Exposes config on OpenShiftEC2Nodeclass like it's exposed on non-Karpenter nodepools
  • Reconciles both of those into the ignition payload that is delivered via our ignition controller + "in-memory nodeclass"
  • In the event of a collision ( say, there is a KubeletConfig in config ) the user's direct on-nodeclass configuration wins

Because:

  • We're using the Custom AMIFamily, the spec.Kubelet from the ec2nodeclass does NOT get reconciled into userdata by Karpenter, so we have to stuff the KubeletConfig into the ignition/MachineConfig ourselves.
  • Users need to be able to configure KubeletConfig on their nodes
  • ROSA expressed the desire to be able to reference ROSA-created configurations in the control plane

Which issue(s) this PR fixes:

Fixes AUTOSCALE-558

Special notes for your reviewer:

  • I included both spec.kubelet here and spec.config, but I have them as separate commits, we can wring our hands over whether or not we truly want spec.config and what the precedence order should be
  • spec.config technically allows referencing any MachineConfig configmap from the control plane the way it's set up now, but NOT any from the data plane, since they are local references

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Per-nodeclass kubelet configuration added (cluster DNS, MaxPods, PodsPerCore, reservations, eviction and image GC settings).

  • Support for referencing ConfigMaps with additional machine configs per node class.

  • Tests

  • New unit and end-to-end tests and readiness DaemonSets to validate kubelet/config propagation and behavior.

  • Chores

  • Promoted a dependency to a direct requirement.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go`:
- Around line 403-442: Add a deletion-handling branch in
KarpenterIgnitionReconciler.Reconcile() that checks if
openshiftEC2NodeClass.DeletionTimestamp != nil, calls
r.reconcileKubeletConfigMap(ctx, hcp, openshiftEC2NodeClass) to ensure the
per-nodeclass ConfigMap is deleted even when Spec.Kubelet is present, then
remove the OpenshiftEC2NodeClass finalizer (the same finalizer string used by
ec2_nodeclass_controller) and update the object via r.ManagementClient before
returning; this mirrors ec2_nodeclass_controller deletion flow and guarantees
cleanup happens prior to finalizer removal.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4cafd0cf-a4ae-45ef-a46a-107c49448fcb

📥 Commits

Reviewing files that changed from the base of the PR and between 4d69815 and dd1e22c.

📒 Files selected for processing (2)
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go

@jkyros jkyros force-pushed the autoscale-558-expose-kubelet-config branch from dd1e22c to e50993a Compare March 11, 2026 18:33
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 11, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

This:

  • Exposes spec.kubelet on OpenShiftEC2NodeClass, but DOES NOT reconcile it to ec2nodeclass directly like the other fields
  • Exposes config on OpenShiftEC2Nodeclass like it's exposed on non-Karpenter nodepools
  • Reconciles both of those into the ignition payload that is delivered via our ignition controller + "in-memory nodeclass"
  • In the event of a collision ( say, there is a KubeletConfig in config ) the user's direct on-nodeclass configuration wins

Because:

  • We're using the Custom AMIFamily, the spec.Kubelet from the ec2nodeclass does NOT get reconciled into userdata by Karpenter, so we have to stuff the KubeletConfig into the ignition/MachineConfig ourselves.
  • Users need to be able to configure KubeletConfig on their nodes
  • ROSA expressed the desire to be able to reference ROSA-created configurations in the control plane

Which issue(s) this PR fixes:

Fixes AUTOSCALE-558

Special notes for your reviewer:

  • I included both spec.kubelet here and spec.config, but I have them as separate commits, we can wring our hands over whether or not we truly want spec.config and what the precedence order should be
  • spec.config technically allows referencing any MachineConfig configmap from the control plane the way it's set up now, but NOT any from the data plane, since they are local references

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Added kubelet configuration support to OpenShift EC2 NodeClass, enabling customization of pod limits (maxPods, podsPerCore), cluster DNS, node resource reservations, eviction policies, image garbage collection thresholds, and CPU CFS quota settings on provisioned nodes.

  • Tests

  • Added comprehensive test coverage for kubelet configuration management and validation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Mar 11, 2026

/test e2e-aws-autonode

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 11, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

This:

  • Exposes spec.kubelet on OpenShiftEC2NodeClass, but DOES NOT reconcile it to ec2nodeclass directly like the other fields
  • Exposes config on OpenShiftEC2Nodeclass like it's exposed on non-Karpenter nodepools
  • Reconciles both of those into the ignition payload that is delivered via our ignition controller + "in-memory nodeclass"
  • In the event of a collision ( say, there is a KubeletConfig in config ) the user's direct on-nodeclass configuration wins

Because:

  • We're using the Custom AMIFamily, the spec.Kubelet from the ec2nodeclass does NOT get reconciled into userdata by Karpenter, so we have to stuff the KubeletConfig into the ignition/MachineConfig ourselves.
  • Users need to be able to configure KubeletConfig on their nodes
  • ROSA expressed the desire to be able to reference ROSA-created configurations in the control plane

Which issue(s) this PR fixes:

Fixes AUTOSCALE-558

Special notes for your reviewer:

  • I included both spec.kubelet here and spec.config, but I have them as separate commits, we can wring our hands over whether or not we truly want spec.config and what the precedence order should be
  • spec.config technically allows referencing any MachineConfig configmap from the control plane the way it's set up now, but NOT any from the data plane, since they are local references

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features
  • Added kubelet configuration capabilities for OpenShift EC2 node classes, enabling per-nodeclass kubelet customization including MaxPods, PodsPerCore, resource reservations, eviction policies, image garbage collection thresholds, CPU CFS quota management, and cluster DNS configuration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
test/e2e/karpenter_test.go (2)

756-759: Minimal NodePool used only for timeout calculation.

The minimalNP with Type: hyperv1.AWSPlatform is a workaround for eventuallyDaemonSetRollsOut which uses the NodePool's platform type to determine timeouts. This is acceptable but could benefit from a brief comment explaining the purpose.

Optional: Add clarifying comment
-			// nodePool arg to eventuallyDaemonSetRollsOut is only used for timeout calculation (KubeVirt gets longer)
+			// nodePool is only used by eventuallyDaemonSetRollsOut for timeout calculation
+			// (KubeVirt platform gets a longer timeout). We use AWS platform for standard timeouts.
 			minimalNP := &hyperv1.NodePool{}
 			minimalNP.Spec.Platform.Type = hyperv1.AWSPlatform
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/karpenter_test.go` around lines 756 - 759, Add a brief clarifying
comment above the minimalNP construction explaining it is a minimal NodePool
used solely so eventuallyDaemonSetRollsOut can derive timeout values from the
NodePool platform; reference the minimalNP variable, the hyperv1.NodePool type
and the Spec.Platform.Type assignment, and mention that the value is
intentionally set to hyperv1.AWSPlatform as a workaround for timeout calculation
rather than representing a full NodePool.

740-761: Good use of DaemonSet for runtime kubelet verification.

Using the DaemonSet's readiness probe to verify kubelet configuration is an elegant approach - if the probe fails (maxPods≠500), the pods won't become ready, and the test will fail.

However, the checkerDS DaemonSet is not cleaned up after the test. Consider adding a deferred delete to avoid leaving test resources on the cluster.

Suggested cleanup
 		g.Expect(guestClient.Create(ctx, checkerDS)).To(Succeed())
 		t.Logf("Created karpenter-kubelet-checker DaemonSet")
+		defer func() {
+			_ = guestClient.Delete(ctx, checkerDS)
+		}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/karpenter_test.go` around lines 740 - 761, The DaemonSet checkerDS
created for kubelet verification is not cleaned up; add a deferred deletion
immediately after creating checkerDS to ensure test resources are removed: call
guestClient.Delete(ctx, checkerDS) in a defer, optionally wrapping it with
g.Eventually or error handling to ignore not-found errors, so the cleanup runs
even if the test fails (reference checkerDS, guestClient, ctx and the existing
create call just before eventuallyDaemonSetRollsOut).
api/karpenter/v1beta1/karpenter_types.go (1)

42-47: Consider adding validation for percentage fields.

ImageGCHighThresholdPercent and ImageGCLowThresholdPercent should logically be in the range 0-100. Adding kubebuilder validation markers would prevent invalid configurations at admission time.

Suggested validation markers
 	// ImageGCHighThresholdPercent is the percent of disk usage which triggers image garbage collection.
 	// +optional
+	// +kubebuilder:validation:Minimum=0
+	// +kubebuilder:validation:Maximum=100
 	ImageGCHighThresholdPercent *int32 `json:"imageGCHighThresholdPercent,omitempty"`
 	// ImageGCLowThresholdPercent is the percent of disk usage to which image garbage collection attempts to free.
 	// +optional
+	// +kubebuilder:validation:Minimum=0
+	// +kubebuilder:validation:Maximum=100
 	ImageGCLowThresholdPercent *int32 `json:"imageGCLowThresholdPercent,omitempty"`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/karpenter/v1beta1/karpenter_types.go` around lines 42 - 47, Add
kubebuilder validation markers to the ImageGCHighThresholdPercent and
ImageGCLowThresholdPercent fields so admission rejects values outside 0-100;
specifically, annotate the struct fields ImageGCHighThresholdPercent and
ImageGCLowThresholdPercent with an integer range validation (e.g.,
+kubebuilder:validation:Minimum=0 and +kubebuilder:validation:Maximum=100) and
ensure they remain optional pointers as declared.
test/e2e/assets/karpenter-kubelet-checker-ds.yaml (1)

19-19: Consider pinning the alpine image tag for reproducibility.

Using alpine without a version tag may lead to inconsistent test results if the base image changes. For E2E tests, pinning a specific version (e.g., alpine:3.19) ensures reproducible behavior.

Suggested fix
-        image: alpine
+        image: alpine:3.19
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/assets/karpenter-kubelet-checker-ds.yaml` at line 19, Replace the
unpinned image reference "image: alpine" in the DaemonSet container spec with a
specific, reproducible tag (for example "image: alpine:3.19"); update the
container image line in karpenter-kubelet-checker-ds.yaml so the test uses a
pinned Alpine version to avoid drift.
karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go (1)

230-233: Log message may be misleading when ConfigMap was already absent.

supportutil.DeleteIfNeeded returns (exists bool, err error) but the exists value is discarded. The log at line 233 unconditionally states "Deleted kubelet config ConfigMap" even if the ConfigMap was already absent. Consider checking the return value for more accurate logging.

📝 Suggested improvement
-	if _, err := supportutil.DeleteIfNeeded(ctx, r.ManagementClient, cm); err != nil {
+	existed, err := supportutil.DeleteIfNeeded(ctx, r.ManagementClient, cm)
+	if err != nil {
 		return ctrl.Result{}, fmt.Errorf("failed to delete kubelet config configmap %s: %w", configMapName, err)
 	}
-	log.Info("Deleted kubelet config ConfigMap", "name", configMapName)
+	if existed {
+		log.Info("Deleted kubelet config ConfigMap", "name", configMapName)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go`
around lines 230 - 233, supportutil.DeleteIfNeeded returns (exists bool, err
error) but the code discards the exists value and always logs "Deleted kubelet
config ConfigMap"; update the call in karpenterignition_controller.go to capture
the exists boolean (e.g., exists, err := supportutil.DeleteIfNeeded(...)) and
then: on error return as before; if exists is true log that the ConfigMap was
deleted (using log.Info with configMapName), otherwise log that the ConfigMap
was already absent (or skip deletion log). Ensure you reference
supportutil.DeleteIfNeeded, the cm variable, and configMapName when making the
change so the log reflects the actual outcome.
karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go (2)

1171-1218: Test title is misleading – it manually patches the finalizer rather than testing Reconcile.

The test "When spec.kubelet is set it should add the finalizer and create the ConfigMap" calls reconcileKubeletConfigMap (which creates the ConfigMap) but then manually patches the finalizer at lines 1200-1203. This doesn't actually test that Reconcile() adds the finalizer.

Consider either:

  1. Renaming the test to clarify it tests reconcileKubeletConfigMap separately from finalizer logic
  2. Calling the full Reconcile() method to test the integrated behavior
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go`
around lines 1171 - 1218, The test title is misleading because it manually
patches the finalizer after calling reconcileKubeletConfigMap; either rename the
test to state it only verifies reconcileKubeletConfigMap creates the ConfigMap
(and keep the manual finalizer patch) or change the test to exercise the full
controller integration by calling r.Reconcile(...) instead of
reconcileKubeletConfigMap so you validate that Reconcile() both creates the
ConfigMap and adds kubeletConfigFinalizer to the OpenshiftEC2NodeClass; update
assertions to fetch the NodeClass from fakeGuestClient and the ConfigMap from
fakeManagementClient as currently done, and remove the manual Patch(...) step
when calling Reconcile().

1220-1256: Test does not exercise any reconciler logic.

This test manually patches the finalizer removal (lines 1247-1250) without calling any reconciler method. It verifies the fake client accepts the patch but doesn't test actual controller behavior. Consider either removing this test (as it doesn't test meaningful behavior) or rewriting it to call Reconcile() with the appropriate setup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go`
around lines 1220 - 1256, The test currently mutates the fake client directly
instead of invoking the controller; fix by exercising the reconciler: keep the
initial nodeClass with Finalizers set and Spec.Kubelet nil, build r as shown,
then call r.Reconcile(ctx, reconcile.Request{NamespacedName:
types.NamespacedName{Name: testNodeClassName}}) (use the controller-runtime
reconcile.Request and types.NamespacedName), assert no error from Reconcile,
then fetch the object with fakeGuestClient.Get(...) and expect
kubeletConfigFinalizer to be removed; alternatively remove the test if you
prefer not to test Reconcile behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go`:
- Around line 155-178: The finalizer is removed before attempting to delete the
kubelet ConfigMap which can orphan the ConfigMap if deletion fails; modify the
branch that handles openshiftEC2NodeClass.Spec.Kubelet == nil to first call
r.reconcileKubeletConfigMap(ctx, hcp, openshiftEC2NodeClass) and return on
error, and only after that success perform the DeepCopy,
controllerutil.RemoveFinalizer(...) and r.GuestClient.Patch(...) (the same
removal logic currently in the else block). Keep the existing error wrapping
messages and the kubeletConfigFinalizer, openshiftEC2NodeClass,
r.reconcileKubeletConfigMap and r.GuestClient.Patch symbol usages so the
ConfigMap cleanup completes before the finalizer is removed.

---

Nitpick comments:
In `@api/karpenter/v1beta1/karpenter_types.go`:
- Around line 42-47: Add kubebuilder validation markers to the
ImageGCHighThresholdPercent and ImageGCLowThresholdPercent fields so admission
rejects values outside 0-100; specifically, annotate the struct fields
ImageGCHighThresholdPercent and ImageGCLowThresholdPercent with an integer range
validation (e.g., +kubebuilder:validation:Minimum=0 and
+kubebuilder:validation:Maximum=100) and ensure they remain optional pointers as
declared.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go`:
- Around line 1171-1218: The test title is misleading because it manually
patches the finalizer after calling reconcileKubeletConfigMap; either rename the
test to state it only verifies reconcileKubeletConfigMap creates the ConfigMap
(and keep the manual finalizer patch) or change the test to exercise the full
controller integration by calling r.Reconcile(...) instead of
reconcileKubeletConfigMap so you validate that Reconcile() both creates the
ConfigMap and adds kubeletConfigFinalizer to the OpenshiftEC2NodeClass; update
assertions to fetch the NodeClass from fakeGuestClient and the ConfigMap from
fakeManagementClient as currently done, and remove the manual Patch(...) step
when calling Reconcile().
- Around line 1220-1256: The test currently mutates the fake client directly
instead of invoking the controller; fix by exercising the reconciler: keep the
initial nodeClass with Finalizers set and Spec.Kubelet nil, build r as shown,
then call r.Reconcile(ctx, reconcile.Request{NamespacedName:
types.NamespacedName{Name: testNodeClassName}}) (use the controller-runtime
reconcile.Request and types.NamespacedName), assert no error from Reconcile,
then fetch the object with fakeGuestClient.Get(...) and expect
kubeletConfigFinalizer to be removed; alternatively remove the test if you
prefer not to test Reconcile behavior.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go`:
- Around line 230-233: supportutil.DeleteIfNeeded returns (exists bool, err
error) but the code discards the exists value and always logs "Deleted kubelet
config ConfigMap"; update the call in karpenterignition_controller.go to capture
the exists boolean (e.g., exists, err := supportutil.DeleteIfNeeded(...)) and
then: on error return as before; if exists is true log that the ConfigMap was
deleted (using log.Info with configMapName), otherwise log that the ConfigMap
was already absent (or skip deletion log). Ensure you reference
supportutil.DeleteIfNeeded, the cm variable, and configMapName when making the
change so the log reflects the actual outcome.

In `@test/e2e/assets/karpenter-kubelet-checker-ds.yaml`:
- Line 19: Replace the unpinned image reference "image: alpine" in the DaemonSet
container spec with a specific, reproducible tag (for example "image:
alpine:3.19"); update the container image line in
karpenter-kubelet-checker-ds.yaml so the test uses a pinned Alpine version to
avoid drift.

In `@test/e2e/karpenter_test.go`:
- Around line 756-759: Add a brief clarifying comment above the minimalNP
construction explaining it is a minimal NodePool used solely so
eventuallyDaemonSetRollsOut can derive timeout values from the NodePool
platform; reference the minimalNP variable, the hyperv1.NodePool type and the
Spec.Platform.Type assignment, and mention that the value is intentionally set
to hyperv1.AWSPlatform as a workaround for timeout calculation rather than
representing a full NodePool.
- Around line 740-761: The DaemonSet checkerDS created for kubelet verification
is not cleaned up; add a deferred deletion immediately after creating checkerDS
to ensure test resources are removed: call guestClient.Delete(ctx, checkerDS) in
a defer, optionally wrapping it with g.Eventually or error handling to ignore
not-found errors, so the cleanup runs even if the test fails (reference
checkerDS, guestClient, ctx and the existing create call just before
eventuallyDaemonSetRollsOut).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e3b9a8bb-a0e3-4fd3-bc2e-18a8a8236788

📥 Commits

Reviewing files that changed from the base of the PR and between dd1e22c and e50993a.

⛔ Files ignored due to path filters (4)
  • api/karpenter/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/karpenter_types.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/kubeletconfig.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/zz_generated.deepcopy.go is excluded by !**/vendor/**, !vendor/**, !**/zz_generated*
📒 Files selected for processing (13)
  • api/go.mod
  • api/karpenter/v1beta1/karpenter_types.go
  • api/karpenter/v1beta1/kubeletconfig.go
  • api/karpenter/v1beta1/kubeletconfig_test.go
  • client/applyconfiguration/karpenter/v1beta1/kubeletconfiguration.go
  • client/applyconfiguration/karpenter/v1beta1/openshiftec2nodeclassspec.go
  • client/applyconfiguration/utils.go
  • karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go
  • support/karpenter/karpenter.go
  • test/e2e/assets/karpenter-kubelet-checker-ds.yaml
  • test/e2e/karpenter_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • api/go.mod
  • client/applyconfiguration/karpenter/v1beta1/openshiftec2nodeclassspec.go
  • client/applyconfiguration/utils.go

Comment thread karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go Outdated
@jkyros jkyros force-pushed the autoscale-558-expose-kubelet-config branch from e50993a to 994277d Compare March 12, 2026 08:01
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 12, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

This:

  • Exposes spec.kubelet on OpenShiftEC2NodeClass, but DOES NOT reconcile it to ec2nodeclass directly like the other fields
  • Exposes config on OpenShiftEC2Nodeclass like it's exposed on non-Karpenter nodepools
  • Reconciles both of those into the ignition payload that is delivered via our ignition controller + "in-memory nodeclass"
  • In the event of a collision ( say, there is a KubeletConfig in config ) the user's direct on-nodeclass configuration wins

Because:

  • We're using the Custom AMIFamily, the spec.Kubelet from the ec2nodeclass does NOT get reconciled into userdata by Karpenter, so we have to stuff the KubeletConfig into the ignition/MachineConfig ourselves.
  • Users need to be able to configure KubeletConfig on their nodes
  • ROSA expressed the desire to be able to reference ROSA-created configurations in the control plane

Which issue(s) this PR fixes:

Fixes AUTOSCALE-558

Special notes for your reviewer:

  • I included both spec.kubelet here and spec.config, but I have them as separate commits, we can wring our hands over whether or not we truly want spec.config and what the precedence order should be
  • spec.config technically allows referencing any MachineConfig configmap from the control plane the way it's set up now, but NOT any from the data plane, since they are local references

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Per-nodeclass kubelet configuration for OpenShift EC2 node classes (MaxPods, PodsPerCore, resource reservations, eviction policies, image GC thresholds, CPU CFS quota, cluster DNS).

  • Operator support to create/manage per-nodeclass kubelet config and ensure taint configuration is applied.

  • Tests

  • New unit and end-to-end tests validating kubelet config manifest generation, per-nodeclass propagation, ConfigMap lifecycle, and runtime verification.

  • Chores

  • Dependency declaration updated.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 12, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

This:

  • Exposes spec.kubelet on OpenShiftEC2NodeClass, but DOES NOT reconcile it to ec2nodeclass directly like the other fields
  • Converts it into ignition, merges it with the existing default taint KubeletConfig, and adds it to the in-memory nodepool's list of MachineConfig object references
  • Much like highlander, there can be only one (KubeletConfig) in the MachineConfig pool

Because:

  • We're using the Custom AMIFamily, the spec.Kubelet from the ec2nodeclass does NOT get reconciled into userdata by Karpenter, so we have to stuff the KubeletConfig into the ignition/MachineConfig ourselves.
  • Users need to be able to configure KubeletConfig on their nodes

Which issue(s) this PR fixes:

Fixes AUTOSCALE-558

Special notes for your reviewer:

  • I moved the base taint configMap out of the hostedcluster controller and into the Karpenter operator.
  • The thinking was it probably belonged to karpenter-operator anyway, and if I could access it as a "constant" did I didn't need to request its contents from the cluster every reconcile to merge with KubeletConfig
  • If you hate the KubeletConfig -> Yaml helper in the api, I will move it out, there just did seem to be some precedent

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features
  • Added kubelet configuration capabilities for OpenShift EC2 node classes, enabling per-nodeclass kubelet customization including MaxPods, PodsPerCore, resource reservations, eviction policies, image garbage collection thresholds, CPU CFS quota management, and cluster DNS configuration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Mar 12, 2026

/test e2e-aws-autonode

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
karpenter-operator/controllers/karpenter/karpenter_controller.go (1)

295-310: Consider setting an owner reference for cleanup.

The ConfigMap is created without an owner reference to the HCP, which means it won't be automatically garbage collected when the HCP is deleted. This aligns with the TODO at line 272, but should be tracked for follow-up.

💡 Suggested improvement
 func (r *Reconciler) reconcileTaintConfigMap(ctx context.Context, hcp *hyperv1.HostedControlPlane) error {
 	cm := &corev1.ConfigMap{
 		ObjectMeta: metav1.ObjectMeta{
 			Name:      karpenterutil.KarpenterTaintConfigMapName,
 			Namespace: hcp.Namespace,
 		},
 	}
 	_, err := r.CreateOrUpdate(ctx, r.ManagementClient, cm, func() error {
 		cm.Data = map[string]string{
 			"config": karpenterutil.KarpenterTaintConfigManifest(),
 		}
+		return controllerutil.SetControllerReference(hcp, cm, r.ManagementClient.Scheme())
-		return nil
 	})
 	return err
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@karpenter-operator/controllers/karpenter/karpenter_controller.go` around
lines 295 - 310, The ConfigMap created in reconcileTaintConfigMap is missing an
OwnerReference to the HostControlPlane (hcp), so add an owner reference before
calling CreateOrUpdate: set cm.ObjectMeta.OwnerReferences (or use
controllerutil.SetControllerReference(hcp, cm, r.Scheme)) to point to the hcp
object, handle and return any error from setting the owner, and ensure the
reconciler's scheme knows the HCP GVK so the owner can be set correctly; keep
the owner setup adjacent to where cm is instantiated in reconcileTaintConfigMap.
karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go (1)

502-524: Shallow merge allows user-provided registerWithTaints to override the required taint.

The comment at line 513-514 states "registerWithTaints is never a user-settable field so the merge is always purely additive." However, mergeKubeletConfigMaps uses overlay-wins semantics, so if a user includes registerWithTaints in their KubeletConfiguration, it would silently override the required Karpenter taint.

While the API may not expose registerWithTaints as a field on KubeletConfiguration, the JSON round-trip at lines 515-522 would include any extra fields that might be present. Consider either:

  1. Explicitly overwriting registerWithTaints in the merged result (swap merge order or post-process), or
  2. Adding validation to reject user-provided registerWithTaints at admission time

Given the current API doesn't expose this field, the risk is low, but worth noting for future-proofing.

🔧 Defensive fix to ensure taint is always present
 	merged := mergeKubeletConfigMaps(taintBase, userMap)
+	// Ensure the required taint is always present, even if userMap contains registerWithTaints
+	merged["registerWithTaints"] = taintBase["registerWithTaints"]
 	manifest, err := kubeletConfigManifest(configMapName, merged)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go`
around lines 502 - 524, The shallow merge can let user-provided
"registerWithTaints" override the required taint; after creating merged via
mergeKubeletConfigMaps(taintBase, userMap) ensure the required field is enforced
by explicitly setting merged["registerWithTaints"] =
taintBase["registerWithTaints"] (or re-run the merge with userMap overlaying
taintBase reversed) so that the taint from taintBase always wins; reference
taintBase, mergeKubeletConfigMaps, openshiftEC2NodeClass.Spec.Kubelet and merged
when making the change.
test/e2e/assets/karpenter-kubelet-checker-ds.yaml (1)

1-43: Privileged container is acceptable for this E2E test asset.

The static analysis warnings about privileged containers (CKV_K8S_16, CKV_K8S_20, CKV_K8S_23) are expected here. This DaemonSet requires host filesystem access to verify kubelet configuration, which necessitates privileged mode. The test code scopes it to specific test nodes via nodeSelector.

A few minor observations:

  1. Unpinned image tag: Using alpine without a version tag could lead to non-reproducible test runs. Consider pinning to a specific version (e.g., alpine:3.19).

  2. Missing resource limits: Only requests are specified. Adding limits would prevent resource exhaustion in edge cases.

  3. Magic number coupling: The value 427 appears here and must match the test's MaxPods setting. Consider documenting this coupling or extracting to a shared constant if feasible.

🔧 Suggested improvements
       containers:
       - name: karpenter-kubelet-checker
-        image: alpine
+        image: alpine:3.19
         command:
         - /bin/sleep
         - 24h
         resources:
           requests:
             cpu: 100m
             memory: 200Mi
+          limits:
+            cpu: 100m
+            memory: 200Mi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/assets/karpenter-kubelet-checker-ds.yaml` around lines 1 - 43, The
DaemonSet karpenter-kubelet-checker uses an unpinned image, lacks resource
limits, and embeds a magic number in the readinessProbe; update the container
karpenter-kubelet-checker to use a pinned image tag (e.g., alpine:3.19), add
matching resource.limits (cpu and memory) alongside the existing requests to
guard against exhaustion, and replace or document the hardcoded "427" in the
readinessProbe command (grep 'maxPods: 427') by either referencing a shared test
constant or adding a clear comment linking it to the test's MaxPods setting so
the coupling is explicit and maintainable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go`:
- Around line 1156-1163: The test initializes AutoNode.Provisioner using a
pointer but the field is a value type; change the initialization in the block
setting AutoNode.Provisioner from &hyperv1.ProvisionerConfig{...} to
hyperv1.ProvisionerConfig{...} so the types match. Specifically update the
Provisioner assignment in the test where AutoNode, Provisioner,
ProvisionerConfig, KarpenterConfig (with Name: hyperv1.ProvisionerKarpenter and
Platform: hyperv1.AWSPlatform) are constructed to remove the leading &.

---

Nitpick comments:
In `@karpenter-operator/controllers/karpenter/karpenter_controller.go`:
- Around line 295-310: The ConfigMap created in reconcileTaintConfigMap is
missing an OwnerReference to the HostControlPlane (hcp), so add an owner
reference before calling CreateOrUpdate: set cm.ObjectMeta.OwnerReferences (or
use controllerutil.SetControllerReference(hcp, cm, r.Scheme)) to point to the
hcp object, handle and return any error from setting the owner, and ensure the
reconciler's scheme knows the HCP GVK so the owner can be set correctly; keep
the owner setup adjacent to where cm is instantiated in reconcileTaintConfigMap.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go`:
- Around line 502-524: The shallow merge can let user-provided
"registerWithTaints" override the required taint; after creating merged via
mergeKubeletConfigMaps(taintBase, userMap) ensure the required field is enforced
by explicitly setting merged["registerWithTaints"] =
taintBase["registerWithTaints"] (or re-run the merge with userMap overlaying
taintBase reversed) so that the taint from taintBase always wins; reference
taintBase, mergeKubeletConfigMaps, openshiftEC2NodeClass.Spec.Kubelet and merged
when making the change.

In `@test/e2e/assets/karpenter-kubelet-checker-ds.yaml`:
- Around line 1-43: The DaemonSet karpenter-kubelet-checker uses an unpinned
image, lacks resource limits, and embeds a magic number in the readinessProbe;
update the container karpenter-kubelet-checker to use a pinned image tag (e.g.,
alpine:3.19), add matching resource.limits (cpu and memory) alongside the
existing requests to guard against exhaustion, and replace or document the
hardcoded "427" in the readinessProbe command (grep 'maxPods: 427') by either
referencing a shared test constant or adding a clear comment linking it to the
test's MaxPods setting so the coupling is explicit and maintainable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 9ba50d46-4d73-4aa6-90c9-1718b4bb23e9

📥 Commits

Reviewing files that changed from the base of the PR and between e50993a and 994277d.

⛔ Files ignored due to path filters (8)
  • api/karpenter/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • client/applyconfiguration/karpenter/v1beta1/kubeletconfiguration.go is excluded by !client/**
  • client/applyconfiguration/karpenter/v1beta1/openshiftec2nodeclassspec.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml is excluded by !karpenter-operator/controllers/karpenter/assets/*.yaml
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/karpenter_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/kubeletconfig.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (12)
  • api/go.mod
  • api/karpenter/v1beta1/karpenter_types.go
  • api/karpenter/v1beta1/kubeletconfig.go
  • api/karpenter/v1beta1/kubeletconfig_test.go
  • hypershift-operator/controllers/hostedcluster/karpenter.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go
  • support/karpenter/karpenter.go
  • test/e2e/assets/karpenter-kubelet-checker-ds.yaml
  • test/e2e/karpenter_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/karpenter/v1beta1/kubeletconfig_test.go
  • api/karpenter/v1beta1/karpenter_types.go

@jkyros jkyros force-pushed the autoscale-558-expose-kubelet-config branch from 994277d to 3ef1b25 Compare March 12, 2026 08:30
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
karpenter-operator/controllers/karpenter/karpenter_controller_test.go (1)

357-392: Idempotency subtest should verify repeated reconcile stability.

The subtest name says idempotent, but it only reconciles once. Also, Line 389 indexes taints[0] without a prior length assertion. Consider reconciling twice and asserting config is unchanged between runs, plus explicit taint length checks.

Suggested test hardening
 	err := r.reconcileTaintConfigMap(ctx, hcp)
 	g.Expect(err).NotTo(HaveOccurred())

 	cm := &corev1.ConfigMap{}
 	err = fakeManagementClient.Get(ctx, client.ObjectKey{Name: karpenterutil.KarpenterTaintConfigMapName, Namespace: namespace}, cm)
 	g.Expect(err).NotTo(HaveOccurred())
+	firstConfig := cm.Data["config"]
+
+	// Run reconcile again to verify idempotent output
+	err = r.reconcileTaintConfigMap(ctx, hcp)
+	g.Expect(err).NotTo(HaveOccurred())
+	err = fakeManagementClient.Get(ctx, client.ObjectKey{Name: karpenterutil.KarpenterTaintConfigMapName, Namespace: namespace}, cm)
+	g.Expect(err).NotTo(HaveOccurred())
+	g.Expect(cm.Data["config"]).To(Equal(firstConfig))

 	var cr map[string]interface{}
 	err = yaml.Unmarshal([]byte(cm.Data["config"]), &cr)
 	g.Expect(err).NotTo(HaveOccurred())
 	spec, ok := cr["spec"].(map[string]interface{})
 	g.Expect(ok).To(BeTrue())
 	kubeletConfig, ok := spec["kubeletConfig"].(map[string]interface{})
 	g.Expect(ok).To(BeTrue())
 	taints, ok := kubeletConfig["registerWithTaints"].([]interface{})
 	g.Expect(ok).To(BeTrue())
+	g.Expect(taints).To(HaveLen(len(karpenterutil.KarpenterBaseTaints)))
 	taint, ok := taints[0].(map[string]interface{})
 	g.Expect(ok).To(BeTrue())
 	g.Expect(taint["key"]).To(Equal(karpenterutil.KarpenterBaseTaints[0].Key))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@karpenter-operator/controllers/karpenter/karpenter_controller_test.go` around
lines 357 - 392, The subtest claims idempotency but only calls
r.reconcileTaintConfigMap once and indexes taints[0] without checking length;
update the test to call r.reconcileTaintConfigMap twice (using the same ctx and
r) and capture cm.Data["config"] after each reconcile to assert the config
string is unchanged, and before accessing taints[0] assert len(taints) > 0 (or
exact length equals len(karpenterutil.KarpenterBaseTaints)) so the index is
safe; keep references to existingCM, r.reconcileTaintConfigMap, cm, and
karpenterutil.KarpenterBaseTaints when making the changes.
karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go (1)

1173-1258: These subtests bypass the production finalizer flow.

Line 1201–1205 and Line 1248–1252 mutate finalizers directly, so they don't actually verify Reconcile() branch behavior. Please drive these cases through Reconcile() (or reconcileDeletedNodeClass) and assert finalizer/configmap outcomes from controller logic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go`
around lines 1173 - 1258, The tests are mutating finalizers directly
(nodeClass.Finalizers = append(...) / = nil) which bypasses controller logic;
instead invoke the controller path (call r.Reconcile with a ctrl.Request for the
NodeClass or call r.reconcileDeletedNodeClass(ctx, hcp, nodeClass) where
appropriate) after setting up fakeGuestClient/fakeManagementClient so the
reconciler executes the finalizer add/remove and ConfigMap create/delete logic,
then assert the ConfigMap existence via fakeManagementClient.Get and finalizer
presence via fakeGuestClient.Get; replace the direct Patch steps in the two
subtests with a Reconcile/reconcileDeletedNodeClass invocation and subsequent
assertions against the clients.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 737-738: The test uses identical SystemReserved and KubeReserved
values which makes the kubeReserved assertion ambiguous; update the KubeReserved
map in the test (the SystemReserved/KubeReserved map literal where KubeReserved
is set) to use "cpu": "610m" and "memory": "621Mi" (leave SystemReserved as
"510m"/"521Mi") and also update the corresponding kubelet checker manifest
expectations for kubeReserved to expect 610m / 621Mi so the grep-based check
uniquely verifies kubeReserved is applied.

---

Nitpick comments:
In `@karpenter-operator/controllers/karpenter/karpenter_controller_test.go`:
- Around line 357-392: The subtest claims idempotency but only calls
r.reconcileTaintConfigMap once and indexes taints[0] without checking length;
update the test to call r.reconcileTaintConfigMap twice (using the same ctx and
r) and capture cm.Data["config"] after each reconcile to assert the config
string is unchanged, and before accessing taints[0] assert len(taints) > 0 (or
exact length equals len(karpenterutil.KarpenterBaseTaints)) so the index is
safe; keep references to existingCM, r.reconcileTaintConfigMap, cm, and
karpenterutil.KarpenterBaseTaints when making the changes.

In
`@karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go`:
- Around line 1173-1258: The tests are mutating finalizers directly
(nodeClass.Finalizers = append(...) / = nil) which bypasses controller logic;
instead invoke the controller path (call r.Reconcile with a ctrl.Request for the
NodeClass or call r.reconcileDeletedNodeClass(ctx, hcp, nodeClass) where
appropriate) after setting up fakeGuestClient/fakeManagementClient so the
reconciler executes the finalizer add/remove and ConfigMap create/delete logic,
then assert the ConfigMap existence via fakeManagementClient.Get and finalizer
presence via fakeGuestClient.Get; replace the direct Patch steps in the two
subtests with a Reconcile/reconcileDeletedNodeClass invocation and subsequent
assertions against the clients.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro

Run ID: fe03755c-67e4-4ecf-981c-b58a497bb0f6

📥 Commits

Reviewing files that changed from the base of the PR and between 734f003 and de2ca6f.

⛔ Files ignored due to path filters (7)
  • api/karpenter/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • client/applyconfiguration/karpenter/v1beta1/kubeletconfiguration.go is excluded by !client/**
  • client/applyconfiguration/karpenter/v1beta1/openshiftec2nodeclassspec.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml is excluded by !karpenter-operator/controllers/karpenter/assets/*.yaml
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/karpenter_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (10)
  • api/karpenter/v1beta1/karpenter_types.go
  • hypershift-operator/controllers/hostedcluster/karpenter.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go
  • support/karpenter/karpenter.go
  • support/karpenter/karpenter_test.go
  • test/e2e/karpenter_kubelet_checker_pod.yaml
  • test/e2e/karpenter_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • karpenter-operator/controllers/karpenter/karpenter_controller.go

Comment thread test/e2e/karpenter_test.go
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Mar 13, 2026

/test e2e-aws-autonode
EDIT: did I use my quota or what, lol, it just won't run


// KubeletConfiguration configures kubelet settings for nodes provisioned by this NodeClass.
// These settings are injected into the node's ignition configuration via MachineConfig.
type KubeletConfiguration struct {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkyros I'm pretty sure we'll need to expose podPidsLimit here as well. That is currently the only option available to customers using the OCM API, so I would expect them to need it here as well.

Copy link
Copy Markdown
Member

@maxcao13 maxcao13 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be an openshiftec2nodeclass specific field, the upstream ec2nodeclass doesn't support this yet, but it's in "priority/backlog". https://github.com/aws/karpenter-provider-aws/issues/7982

Not that this matters too much to us, since we don't propagate the fields to ec2nodeclass anyways, but just noting here that the API with the new field will look slightly different to customers. If upstream eventually decides to support this, I assume we would want the field naming to be similar to not be confusing.

EDIT: With that other PR, I guess naming doesn't really matter as much as I think :-)

Copy link
Copy Markdown
Member Author

@jkyros jkyros Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #7952 is going to decide the direction here. If we're skewing from upstream already with that, I think this boils down to "what list of fields do we want exposed here", and we're deciding to more-or-less ignore upstream because our fields are already going to be weird and OpenShift-ified, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added podPodsLimit and made it sync the fields Karpenter cares about back to ec2nodeclass (it apparently uses them for scheduling) .

I put the "christmas tree" approach where it syncs from the upstream on top of it so we can see it. If we don't like it, I'll peel it off.

@maxcao13
Copy link
Copy Markdown
Member

maxcao13 commented Mar 13, 2026

/test e2e-aws-autonode
/test e2e-aws-techpreview

EDIT: github bot didn't was scaled to zero during night and couldn't respond? 😛


// KubeletConfiguration configures kubelet settings for nodes provisioned by this NodeClass.
// These settings are injected into the node's ignition configuration via MachineConfig.
type KubeletConfiguration struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be aware of #7952 and vice versa.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will align this with that once we get that figured out. 😄

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
@openshift-merge-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jkyros jkyros force-pushed the autoscale-558-expose-kubelet-config branch from de2ca6f to 6c416be Compare March 21, 2026 08:49
@openshift-ci openshift-ci Bot added the area/ci-tooling Indicates the PR includes changes for CI or tooling label Mar 21, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkyros
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jkyros jkyros force-pushed the autoscale-558-expose-kubelet-config branch from 6c416be to 31de5ae Compare March 21, 2026 10:52
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2026
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Mar 21, 2026

/test e2e-aws-autonode
/test e2e-aws

jkyros added 7 commits March 24, 2026 17:37
…eclass

The Karpenter ec2nodeclass has a spec.kubelet field that we currently
do not expose on OpenShiftEC2Nodeclass. We currently use AMIType=Custom,
and as a result even if we did expose spec.kubelet on
OpenShiftEC2Nodeclass, it wouldn't get reconciled to the node anyway.

We do desire as much parity as possible with the upstream Karpenter API,
so what we _can_ do is expose the spec.kubelet on
OpenShiftEC2Nodeclass, then marshal that into a config that we give to
the node as part of the ignition bootstrapping process. This results in
the user being able to use OpenShiftEC2Nodeclass as they would a normal
ec2nodeclass (with regard to KubeletConfig) while still being able to
bootstrap our nodes the OpenShift way, and that is what we have done
here.

Signed-off-by: John Kyros <jkyros@redhat.com>
This is just the e2e test for KubeletConfig on OpenShiftEC2NodeClass, it
configures KubeletConfig, spins up a node, then runs a daemon on that
node to make sure the settings made it.

Signed-off-by: John Kyros <jkyros@redhat.com>
There is a default taint config that our ignition controller uses to
set the base taints on a node for first boot so workloads don't get
scheduled until Karpenter okays the nodes. The ignition controller
consumes configs via local object references, so it has to be in a
configmap, and rather than have each ec2nodeclass have its own
configmap, we made one central one.

That configmap was originally created in the hypershift operator's
hostedcluster controller where it sets up the karpenter component, with
the thought that all of the nodepools could use it, but we ran into
trouble when we exposed spec.kubelet on the OpenShiftEC2Nodeclass
because that ALSO becomes a KubeletConfig and there can only be one
KubeletConfig in the MachineConfig pool. So we needed to merge them for
nodeclasses that had spec.kubelet configured.

This just moves that default creation into the karpenter operator proper
so that we can get access to its contents directly as a "constant"
without having to retrieve the default one every time we reconcile. It
still does not get cleaned up, we will need to handle that when we
address deconfiguring autonode/karpenter.

Signed-off-by: John Kyros <jkyros@redhat.com>
The original test here was very opaque, it would wait for the daemonset
to come up and if it failed, you didn't know why, you just knew that the
health check on the daemon failed.

I also didn't remember that the MCO split kubeletconfig settings into
multiple files so we were looking for systemReserved in kubelet.conf
that wouldn't end up there.

This is maybe _too_ exhaustive and I worry about having to keep this
updated in the future if the list/implementation changes, but at least
maybe we will catch some weirdness.

Signed-off-by: John Kyros <jkyros@redhat.com>
Signed-off-by: John Kyros <jkyros@redhat.com>
Upstream Karpenter only exposes a few KubeletConfig fields that get used
for bin packing (they are scheduling-related). Users generally want to
use more than that, and upstream, because they aren't using a custom AMI
type, they can specify those in userdata, but downstream they can't. So
this is an attempt at a generator such that we can sync with both the
upstream karpenter API (for parity) AND add addition fields from the
kubelet, while retaining the ability to deny-list any fields we don't
feel comfortable exposing.

Signed-off-by: John Kyros <jkyros@redhat.com>
Signed-off-by: John Kyros <jkyros@redhat.com>
@jkyros jkyros force-pushed the autoscale-558-expose-kubelet-config branch from 31de5ae to 74da398 Compare March 24, 2026 23:30
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Mar 24, 2026

Rebased
/test e2e-aws-autonode

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Mar 25, 2026

Now that this is fixed, we can:
/test e2e-aws-techpreview

Comment on lines +18 to +35
var excludedFields = map[string]string{
// Embedded TypeMeta
"TypeMeta": "embedded type metadata",

// Server / networking
"EnableServer": "server config, not user-facing",
"StaticPodPath": "static pod management",
"PodLogsDir": "pod log directory",
"SyncFrequency": "internal sync frequency",
"FileCheckFrequency": "file check frequency",
"HTTPCheckFrequency": "http check frequency",
"StaticPodURL": "static pod URL",
"StaticPodURLHeader": "static pod URL header",
"Address": "kubelet bind address",
"Port": "kubelet port",
"ReadOnlyPort": "read-only port",

// TLS
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set my "ast atrocity" that generates this from the kubernetes kubeletconfig struct up for deny-list (because I figured we'd want everything except for dangerous ones), but we could also allow-list, or expose them all and then reject them via VAP/webhook/in the controller). I just wanted to vet the generation approach.

I'm sure we could spend too long bikeshedding which ones to disable (aside from ClusterDNS) 😄

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2026

@jkyros: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-techpreview 74da398 link false /test e2e-aws-techpreview
ci/prow/verify-workflows 74da398 link true /test verify-workflows

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented Apr 17, 2026

Now I have all the information needed. Let me compile the final analysis. The key findings are:

  1. verify-workflows (Build 2045115749147611136): Failed due to merge conflicts - the PR branch autoscale-558-expose-kubelet-config conflicts with the current main branch in 12 files.

  2. e2e-aws-techpreview (Build 2036612434927554560): The actual PR-related test (TestKarpenter) passed completely. The failure is TestAutoscaling/Teardown - a pre-existing flaky teardown that timed out waiting for AWS infrastructure resources to be cleaned up (context deadline exceeded after 22 minutes).

  3. tide (state: error): Tide reports error because the PR has merge conflicts (CONFLICTING state) and cannot be merged.

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-verify-workflows / pull-ci-openshift-hypershift-main-e2e-aws-techpreview
  • Build ID: 2045115749147611136 (verify-workflows) / 2036612434927554560 (e2e-aws-techpreview)
  • PR: AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2NodeClass #7916AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2NodeClass
  • Branch: autoscale-558-expose-kubelet-configmain
  • Tide Status: ERROR (merge conflicts prevent merge)

Test Failure Analysis

Error

1) verify-workflows (2045115749147611136): Git merge conflict — "Automatic merge failed; fix conflicts and then commit the result."
   CONFLICT (content) in 12 files: Makefile, api/.golangci.yml, api/karpenter/v1beta1/zz_generated.deepcopy.go,
   client/applyconfiguration/karpenter/v1beta1/openshiftec2nodeclassspec.go, client/applyconfiguration/utils.go,
   hypershift-operator/controllers/hostedcluster/karpenter.go,
   karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml,
   karpenter-operator/controllers/karpenter/karpenter_controller.go,
   karpenter-operator/controllers/nodeclass/ec2_nodeclass_controller.go,
   karpenter-operator/controllers/nodeclass/ec2_nodeclass_controller_test.go,
   test/e2e/karpenter_test.go, vendor/github.com/openshift/hypershift/api/karpenter/v1beta1/zz_generated.deepcopy.go

2) e2e-aws-techpreview (2036612434927554560): TestAutoscaling/Teardown — "Failed to wait for infra resources
   in guest cluster to be deleted: context deadline exceeded" (9 orphaned AWS resources after 22m timeout)

3) tide: ERROR — PR is in CONFLICTING state, cannot be merged.

Summary

There are two independent failures blocking this PR, neither caused by the PR's code changes. The verify-workflows job fails immediately during the git clone/merge step because the PR branch has diverged from main — 12 files now have content conflicts (all in Karpenter-related code paths, indicating other Karpenter PRs have merged to main since this branch was created). The e2e-aws-techpreview job ran the full e2e suite (536 tests, 22 skipped) and all PR-related tests (TestKarpenter) passed completely including the new OpenshiftEC2NodeClass_Kubelet_propagation subtest. The only failure is TestAutoscaling/Teardown, an unrelated test that timed out waiting for AWS infrastructure cleanup (EC2 volumes and an S3 bucket for the autoscaling-rgw6b hosted cluster). The tide ERROR is a direct consequence of the merge conflicts — Tide cannot merge a PR in CONFLICTING state.

Root Cause

Primary issue: Merge conflicts (verify-workflows + tide)

The PR branch autoscale-558-expose-kubelet-config was last updated on 2026-03-24. Since then, other PRs have been merged to main that modify the same Karpenter-related files. The base SHA at branch creation (1180bcaf5 — "Merge pull request #8208 from enxebre/worktree-fix-capacity-reservation-limit") has advanced significantly. The 12 conflicting files are all in the Karpenter API, controller, nodeclass, client, and e2e test paths — precisely the areas this PR touches.

The verify-workflows job attempted to merge commit 74da3986 (PR head) into 1180bcaf5 (main) and git reported Automatic merge failed. This causes the job to fail with no test execution. Tide independently reports ERROR because GitHub's merge state is DIRTY/CONFLICTING.

Secondary issue: TestAutoscaling/Teardown flake (e2e-aws-techpreview)

The TestAutoscaling/Teardown failure is an infrastructure cleanup timeout, not a code bug. The teardown function (validateAWSGuestResourcesDeletedFunc in test/e2e/util/fixture.go) polls the AWS Resource Groups Tagging API every 20 seconds for up to 15 minutes, waiting for cluster-owned resources to be deleted. In this run, 9 AWS resources (6 EC2 volumes from autoscaling node groups additional-lktqz and gshnz, plus 1 S3 image-registry bucket) were not cleaned up within the deadline. This is a known flaky behavior in the HyperShift e2e test framework when AWS resource deletion is slow.

Critically, all functional tests passed:

  • TestKarpenterPASS (6115.25s) — all subtests passed, including the new OpenshiftEC2NodeClass_Kubelet_propagation
  • TestAutoscaling/MainPASS (2611.19s)
  • TestAutoscaling/ValidateHostedClusterPASS
  • TestAutoscaling/EnsureHostedClusterPASS
Recommendations
  1. Rebase the PR branch onto current main — This is the only required action. The 12 merge conflicts must be resolved by rebasing autoscale-558-expose-kubelet-config onto the latest main. Given that all conflicts are in Karpenter-related files, careful attention is needed during the rebase to incorporate upstream changes.

  2. Re-trigger CI after rebase — Once conflicts are resolved, all CI jobs (including verify-workflows) should pass. The tide error will resolve automatically once the PR is no longer in a CONFLICTING state.

  3. The e2e-aws-techpreview failure requires no action — The TestAutoscaling/Teardown timeout is a pre-existing flaky test related to AWS resource cleanup latency, not caused by this PR. The PR's own test (TestKarpenter) passed all subtests. A /retest after the rebase should get a clean run, or the test can be retested individually with /test e2e-aws-techpreview.

  4. No code changes needed — The PR's functional changes are validated by the passing TestKarpenter suite, including the new Kubelet propagation test.

Evidence
Evidence Detail
verify-workflows build log 12 CONFLICT (content) lines during git merge --no-ff 74da3986 into 1180bcaf5 (lines 136–163 of build-log.txt)
verify-workflows timing Started 2026-04-17T12:22:45Z, failed 2026-04-17T12:23:12Z (27 seconds — failed at git clone, no tests ran)
PR merge state GitHub API: mergeStateStatus: "DIRTY", mergeable: "CONFLICTING"
Conflicting files Makefile, api/.golangci.yml, api/karpenter/v1beta1/zz_generated.deepcopy.go, client/applyconfiguration/karpenter/v1beta1/openshiftec2nodeclassspec.go, client/applyconfiguration/utils.go, hypershift-operator/controllers/hostedcluster/karpenter.go, karpenter-operator/controllers/karpenter/assets/*.yaml, karpenter-operator/controllers/karpenter/karpenter_controller.go, karpenter-operator/controllers/nodeclass/ec2_nodeclass_controller.go, karpenter-operator/controllers/nodeclass/ec2_nodeclass_controller_test.go, test/e2e/karpenter_test.go, vendor/.../zz_generated.deepcopy.go
TestKarpenter result PASS (6115.25s) — all 7 Main subtests passed including OpenshiftEC2NodeClass_Kubelet_propagation (555.18s)
TestAutoscaling/Teardown FAIL (1329.44s) — fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
Orphaned AWS resources 9 resources: 6 EC2 volumes (vol-0326601e, vol-0ebdde0b, vol-0e62507d, vol-0eb90a07, vol-07ee0565, vol-0549322e, vol-099af796, vol-02572242), 1 S3 bucket (autoscaling-rgw6b-image-registry-*)
e2e test summary DONE 536 tests, 22 skipped, 2 failures — both failures are TestAutoscaling and TestAutoscaling/Teardown (same test)
tide status state: "ERROR", targetUrl points to Prow PR query page — error is due to CONFLICTING merge state

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 7, 2026

closing in favor of #8192

@jkyros jkyros closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/ci-tooling Indicates the PR includes changes for CI or tooling area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/testing Indicates the PR includes changes for e2e testing do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants