Skip to content

AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2Nodeclass as structured fields + preserveunknown/overflow#8192

Open
jkyros wants to merge 3 commits intoopenshift:mainfrom
jkyros:autoscale-558-kubeletconfig-overflow
Open

AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2Nodeclass as structured fields + preserveunknown/overflow#8192
jkyros wants to merge 3 commits intoopenshift:mainfrom
jkyros:autoscale-558-kubeletconfig-overflow

Conversation

@jkyros
Copy link
Copy Markdown
Member

@jkyros jkyros commented Apr 9, 2026

What this PR does / why we need it:

  • Exposes spec.Kubelet on OpenShiftEC2NodeClass as a set of structured fields (the ones Karpenter needs for scheduling/bin packing) + preserves unknown
  • Reconciles the structured fields to Karpenter's ec2nodeclass so it can use them
  • Preserves the unstructured fields and sends them on to ignition so they make it to the node

Which issue(s) this PR fixes:

Fixes
AUTOSCALE-558

Special notes for your reviewer:

  • CEL expressions can't see inside the unstructured 😞
  • This tries to give us the approximate behavior we wanted from our sync discussion
  • The API Guidelines for OpenShift APIs want the bools to be enums, but that's going to be a weird corner if the karpenter-specific bools are enums and the rest arent. I left them as bools and marked them out of the linter, I will adjust it however you want

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • NodeClass now accepts detailed kubelet configuration and maps selected fields to provisioned nodes; per-NodeClass kubelet ConfigMaps, a cluster taint ConfigMap, finalizers, and NodePool config reference selection are reconciled. Added helpers to generate Karpenter taint manifests and label/name helpers. Added privileged checker Pod manifest for node kubelet validation.
  • Tests

    • New unit and e2e tests covering JSON marshal/unmarshal, unknown-field preservation, mapping to upstream config, ConfigMap lifecycle, finalizers, manifest validation, and runtime kubelet checks.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 9, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 9, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 9, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

  • Exposes spec.Kubelet on OpenShiftEC2NodeClass as a set of structured fields (the ones Karpenter needs for scheduling/bin packing) + preserves unknown
  • Reconciles the structured fields to Karpenter's ec2nodeclass so it can use them
  • Preserves the unstructured fields and sends them on to ignition so they make it to the node

Which issue(s) this PR fixes:

Fixes
AUTOSCALE-558

Special notes for your reviewer:

  • CEL expressions can't see inside the unstructured 😞
  • This tries to give us the approximate behavior we wanted from our sync discussion
  • The API Guidelines for OpenShift APIs want the bools to be enums, but that's going to be a weird corner if the karpenter-specific bools are enums and the rest arent. I left them as bools and marked them out of the linter, I will adjust it however you want

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces structured kubelet configuration: a new KubeletConfiguration API type with custom JSON marshal/unmarshal, IsZero semantics, and an optional kubelet field on OpenshiftEC2NodeClassSpec. Adds helpers/constants to generate a Karpenter taint KubeletConfig manifest. Controllers now reconcile a global Karpenter taint ConfigMap and per-NodeClass kubelet ConfigMaps (create/update/delete and finalizer lifecycle). Reconciliation copies selected kubelet fields into upstream EC2NodeClass Kubelet. Extensive unit and e2e tests validate JSON overflow preservation, ConfigMap lifecycle, manifest contents, and node-level kubelet propagation. Lint exclusions updated for kubelet fields.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant OSE2NC as OpenshiftEC2NodeClass
    participant KarpIgnition as KarpenterIgnitionController
    participant KubeletCM as Kubelet ConfigMap\n(Management Cluster)
    participant EC2NC as EC2NodeClass\n(Karpenter)
    participant NP as NodePool\n(Karpenter)
    participant Node as Provisioned Node

    User->>OSE2NC: Create/Update with spec.kubelet
    OSE2NC->>KarpIgnition: Notify controller of change

    alt spec.kubelet is set
        KarpIgnition->>KarpIgnition: Add kubeletConfigFinalizer
        KarpIgnition->>KarpIgnition: Merge user kubelet config + base taints
        KarpIgnition->>KubeletCM: Create/Update per-NodeClass ConfigMap (data["config"]=KubeletConfig YAML)
    else spec.kubelet is unset
        KarpIgnition->>KubeletCM: Delete per-NodeClass ConfigMap
        KarpIgnition->>KarpIgnition: Remove kubeletConfigFinalizer if present
    end

    User->>NP: Create NodePool referencing OpenshiftEC2NodeClass
    NP->>EC2NC: Populate EC2NodeClass.Kubelet via KarpenterKubeletConfiguration()
    NP->>Node: Provision node referencing Kubelet ConfigMap
    Node->>KubeletCM: Read and apply kubelet config
Loading
sequenceDiagram
    actor Operator
    participant KarController as KarpenterController
    participant MgrCM as Management ConfigMap\n(karpenter taint)
    participant Support as support/karpenter helpers

    Operator->>KarController: Reconcile loop
    KarController->>Support: KarpenterTaintConfigManifest()
    Support-->>KarController: YAML manifest
    KarController->>MgrCM: Create/Update global taint ConfigMap (data["config"]=manifest)
    MgrCM-->>KarController: Created/Updated or Error
Loading
🚥 Pre-merge checks | ✅ 7 | ❌ 5

❌ Failed checks (5 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test code manually patches finalizers instead of calling Reconcile(), verifying state mutation but not controller logic execution. Refactor finalizer tests to call r.Reconcile() directly instead of manually manipulating finalizers to verify the controller's own logic is exercised.
Microshift Test Compatibility ⚠️ Warning The e2e test TestKarpenter uses MicroShift-unavailable APIs (Karpenter NodePool, AWS EC2NodeClass) and features with no protection mechanisms to prevent execution on MicroShift clusters. Add [Skipped:MicroShift] label to test name or wrap with exutil.IsMicroShiftCluster() check that calls g.Skip() to prevent execution on MicroShift.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning The testKubeletPropagation e2e test lacks SNO protection and will fail on Single Node OpenShift clusters. Add [Skipped:SingleReplicaTopology] label to test or use exutil.IsSingleNode() check with g.Skip().
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning Pod manifest specifies 'image: alpine' without registry prefix, requiring external pull from Docker Hub in disconnected environments; test lacks IP family detection for IPv6-only clusters. Use image from internal registry, add IP family detection with GetIPAddressFamily(), or add [Skipped:Disconnected] tag for disconnected cluster scenarios.
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly summarizes the main change: exposing KubeletConfig on OpenShiftEC2NodeClass as structured fields with overflow preservation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test function names referenced in the PR (TestReconcileTaintConfigMap, TestCreateInMemoryNodePool, TestReconcileKubeletConfigMap, TestReconcileDeletedNodeClass) are stable, deterministic, and contain no dynamic values like timestamps, UUIDs, or generated identifiers.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces karpenter integration without topology-aware scheduling constraints. Changes include kubelet config API types, controller reconciliation logic, and e2e tests with no affinity rules or control-plane node assumptions.
Ote Binary Stdout Contract ✅ Passed Module-level variable initializer contains panic(err) in IIFE, but panic writes to stderr, not stdout. Embedded YAML is valid, so panic condition should never execute. Code does not emit non-JSON content to stdout.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkyros
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 9, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 77.18447% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.63%. Comparing base (bded456) to head (0a5d587).

Files with missing lines Patch % Lines
.../karpenterignition/karpenterignition_controller.go 73.60% 22 Missing and 11 partials ⚠️
...ator/controllers/karpenter/karpenter_controller.go 66.66% 5 Missing and 1 partial ⚠️
support/karpenter/karpenter.go 80.76% 4 Missing and 1 partial ⚠️
...r-operator/controllers/nodeclass/karpenter_util.go 93.93% 1 Missing and 1 partial ⚠️
...trollers/hostedcluster/hostedcluster_controller.go 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8192      +/-   ##
==========================================
+ Coverage   37.53%   37.63%   +0.09%     
==========================================
  Files         751      751              
  Lines       92026    92194     +168     
==========================================
+ Hits        34544    34695     +151     
- Misses      54841    54845       +4     
- Partials     2641     2654      +13     
Files with missing lines Coverage Δ
...ft-operator/controllers/hostedcluster/karpenter.go 75.49% <100.00%> (+12.16%) ⬆️
.../controllers/nodeclass/ec2_nodeclass_controller.go 53.72% <100.00%> (+0.08%) ⬆️
...trollers/hostedcluster/hostedcluster_controller.go 43.23% <0.00%> (ø)
...r-operator/controllers/nodeclass/karpenter_util.go 84.86% <93.93%> (+2.51%) ⬆️
support/karpenter/karpenter.go 74.50% <80.76%> (+6.50%) ⬆️
...ator/controllers/karpenter/karpenter_controller.go 28.77% <66.66%> (+2.00%) ⬆️
.../karpenterignition/karpenterignition_controller.go 64.94% <73.60%> (+2.27%) ⬆️
Flag Coverage Δ
cmd-support 32.80% <80.76%> (+0.04%) ⬆️
cpo-hostedcontrolplane 36.77% <ø> (ø)
cpo-other 37.76% <ø> (ø)
hypershift-operator 47.99% <66.66%> (+0.05%) ⬆️
other 28.63% <76.83%> (+0.85%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Apr 9, 2026

/test e2e-aws-autonode

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Apr 9, 2026

Heyyy that is super cool, I don't have to have my claude watch and root cause test failures anymore
/test e2e-aws-autonode

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Apr 10, 2026

KubeletConfig passed, teardown failure. One more time
/test e2e-aws-autonode

Comment thread api/karpenter/v1beta1/karpenter_types.go Outdated
Comment thread api/.golangci.yml Outdated
// +kubebuilder:object:generate=false
// +kubebuilder:pruning:PreserveUnknownFields
// +kubebuilder:validation:XValidation:rule="!has(self.imageGCHighThresholdPercent) || !has(self.imageGCLowThresholdPercent) || self.imageGCHighThresholdPercent > self.imageGCLowThresholdPercent",message="imageGCHighThresholdPercent must be greater than imageGCLowThresholdPercent"
type KubeletConfiguration struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need test coverage once d448ab4 merges

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can either come back afterwards and add it, or we can wait for that to merge and then do this one, and I can add it here. I'll set a test up based on your branch in the mean time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added envtest test coverage for KubeletConfiguration

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented Apr 10, 2026

/test e2e-aws-autonode

@maxcao13
Copy link
Copy Markdown
Member

tests are just taking too long i think 😂

We can see TestKarpenter hitting the 2 hour mark which is when it stops.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2026
@jkyros jkyros force-pushed the autoscale-558-kubeletconfig-overflow branch from c11822a to 4d6300e Compare April 15, 2026 07:03
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 15, 2026

@jkyros: This pull request references AUTOSCALE-558 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

Details

In response to this:

What this PR does / why we need it:

  • Exposes spec.Kubelet on OpenShiftEC2NodeClass as a set of structured fields (the ones Karpenter needs for scheduling/bin packing) + preserves unknown
  • Reconciles the structured fields to Karpenter's ec2nodeclass so it can use them
  • Preserves the unstructured fields and sends them on to ignition so they make it to the node

Which issue(s) this PR fixes:

Fixes
AUTOSCALE-558

Special notes for your reviewer:

  • CEL expressions can't see inside the unstructured 😞
  • This tries to give us the approximate behavior we wanted from our sync discussion
  • The API Guidelines for OpenShift APIs want the bools to be enums, but that's going to be a weird corner if the karpenter-specific bools are enums and the rest arent. I left them as bools and marked them out of the linter, I will adjust it however you want

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

Release Notes

  • New Features

  • Added kubelet configuration field to NodeClass specifications with support for image garbage collection thresholds, eviction policies, and resource reservation settings that are applied to provisioned nodes.

  • Tests

  • Added comprehensive tests for kubelet configuration lifecycle management, YAML serialization, and end-to-end validation of kubelet settings on nodes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment thread api/karpenter/v1/kubelet_config.go
Comment thread api/karpenter/v1/kubelet_config.go
Comment thread api/karpenter/v1/kubelet_config.go Outdated
Comment thread api/karpenter/v1/kubelet_config.go
Comment thread api/karpenter/v1/kubelet_config.go Outdated
Comment thread api/karpenter/v1/kubelet_config.go
Comment thread api/karpenter/v1/kubelet_config.go
Comment thread api/karpenter/v1/kubelet_config.go
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 5, 2026

@JoelSpeed I'm generally not a fan of the approach I've taken here, I was trying to unblock this by giving you the closest thing to "validate the fields we care about inside the rawExtension" behavior you desired from our Karpenter sync that I could. We need this for Karpenter GA, but if we're going to have to "OpenShift-ify" all these upstream fields every time it's really going to hurt.

If over time you wanted to add a new field to the KubeletConfiguration, that was originally part of the raw extension, how do you think that would go?

This whole thing is going to be a torture device if we're deviating from upstream in breaking ways. Going forward we'd probably need to not change the fields names/types like we're trying to do in review here. (Types in a breaking way for sure at least, Names...if we leave it so "overflow wins the merge" that's weirdly probably okay because the users existing config will continue to work) 😛

Is there a migration path that doesn't break existing users?

My plan was:

  • We add the field to the list of structured fields
  • The unmarshal pulls the value out into the structure vs the "overflow"
  • The user never notices because their value cleanly unmarshals into the struct field
  • Migration successful

Any existing structured client would no longer see the data where they expect to see it, and would have to update their code to read from the new structured location. Who do we expect to be reading/writing to this API?

Users are going to be sticking values in spec.Kubelet, and we're going to reconcile all of that to the correct places (ignition, karpenter ec2nodeclass) from there. We have full control.

  • Users are dealing in YAML so they're fine
  • For nodes/ignition our controllers just take everything and push it through to ignition regardless so they're fine
  • The only issue would be what gets sent to the ec2nodeclass and we'd be updating this API in concert with however upstream did (i.e. they add a field, we add a field we'd need to true up, so the controller could handle that transparently and that would seem to be fine)

What happens today if I were to put something into the Go raw extension that overlaps with, say, maxPods? Does this serialise correctly?

It does, but "overflow" currently wins. I can just re-order that block so the structured fields marshal in afterwards, but someone would have to intentionally abuse that for it to matter.

EDIT: actually I think overflow winning is good, that means that if we change the name, their existing config will work without breaking, and we could reconcile that under the hood (e.g. they specify Something, we've renamed it to SomethingSeconds once upstream Karpenter adds it/we extend our API to match, user's value ends up in overflow, controller coerces it into Something under the hood anyway without them knowing)

@jkyros jkyros force-pushed the autoscale-558-kubeletconfig-overflow branch 2 times, most recently from 7adb331 to 26ae4c2 Compare May 5, 2026 22:11
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 6, 2026

/test e2e-aws

@jkyros jkyros force-pushed the autoscale-558-kubeletconfig-overflow branch from 26ae4c2 to b4b1fa2 Compare May 7, 2026 19:24
@openshift-ci openshift-ci Bot added the area/ci-tooling Indicates the PR includes changes for CI or tooling label May 7, 2026
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 7, 2026

  • Rebased
  • structured now wins over unstructured (for serialization concerns)
  • added envtest for ratcheting API validation to ensure fields from unstructured to structured that should balk if a breaking change is introduced
  • added documentation for our decisions that:
    • this will need to follow upstream convention to preserve compatibility
    • we can (and should) continue to add CEL validation to prevent silly values
  • I have an upgrade e2e test but I did not include it here - it's chicken and egg since we don't have kubeletConfig in yet, testing adding a field to it and making sure nodes don't roll is impossible between old payload -> new payload , so I'm thinking I'll put that up as a separate PR and test it with a hypothetical field promotion

@jkyros jkyros force-pushed the autoscale-558-kubeletconfig-overflow branch from ed024e1 to 844d836 Compare May 7, 2026 20:52
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 7, 2026

Quay are you done being broken?
/test images

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 7, 2026

quay, you said you were fixed, but that doesn't seem fixed:
/test images

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 8, 2026

Let's see if quay is sane again, and I don't know how codecov can flake, but it is.
/test images
/test codecov

@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 8, 2026

I expect that image stream being globally broken is going to prevent tests from working, but let's try:
/test e2e-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

@jkyros: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws a2887fe link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 8, 2026

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-images / pull-ci-openshift-hypershift-main-e2e-aws
  • Build IDs: 2052581617074966528 (images) / 2052591525044424704 (e2e-aws)
  • PR: AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2Nodeclass as structured fields + preserveunknown/overflow #8192 (AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2Nodeclass)
  • Base SHA: 37f46b94
  • Failed Step: [release-inputs:latest] — "Find all of the input images from ocp/5.0 and tag them into the stable stream"
  • Result: Both Prow jobs failed with identical error; all 4 image builds (hypershift, hypershift-operator, hypershift-tests, hypershift-cli) succeeded

Test Failure Analysis

Error

step [release-inputs:latest] failed: failed to wait for importing imagestreamtags on
ci-op-f54q187x/stable: failed to reimport the tag ci-op-f54q187x/stable:vcf-migration-operator:
unable to import tag ci-op-f54q187x/stable:vcf-migration-operator with message Internal error
occurred: [dockerimage.image.openshift.io "quay.io/openshift/ci:ocp_5.0_vcf-migration-operator"
not found, dockerimage.image.openshift.io "quay-proxy.ci.openshift.org/openshift/ci:ocp_5.0_vcf-migration-operator"
not found] on the image stream even after (6) imports: timed out waiting for the condition

Summary

All four failures on PR #8192 are unrelated to the PR's code changes. The two Prow jobs (ci/prow/images and ci/prow/e2e-aws) failed with an identical CI infrastructure error: the vcf-migration-operator image is missing from the OCP 5.0 release stream (ocp_5.0_vcf-migration-operator) in both quay.io and the CI proxy. This failure occurs during the [release-inputs:latest] step — before any test code from the PR is executed — making it a release-stream/image-registry infrastructure problem affecting all jobs that depend on the OCP 5.0 stable stream. The codecov/project failure is a coverage threshold drop (34.02%, down 3.40%) which is a pre-existing coverage gap unrelated to the build failure. The Konflux enterprise-contract failure is a separate integration test infrastructure issue.

Root Cause

The root cause is a missing container image in the OCP 5.0 CI release stream. Specifically:

  1. Missing image: vcf-migration-operator is registered as a component in the OCP 5.0 release stream but the actual container image does not exist at either mirror location:

    • quay.io/openshift/ci:ocp_5.0_vcf-migration-operator — not found
    • quay-proxy.ci.openshift.org/openshift/ci:ocp_5.0_vcf-migration-operator — not found
  2. Failure mechanism: The ci-operator's [release-inputs:latest] step attempts to import all images from the ocp/5.0 stream into the test namespace's stable imagestream. When it encounters vcf-migration-operator, it retries 6 times over ~4 minutes, then fails with a timeout. This blocks the entire job pipeline — no test steps can proceed.

  3. Shared namespace: Both jobs ran in the same CI namespace ci-op-f54q187x (the e2e-aws job reused cached builds from the images job), confirming this is a single infrastructure failure propagating to both jobs.

  4. Not PR-related: The PR's code changes (exposing KubeletConfig fields on OpenShiftEC2Nodeclass) compiled and built successfully — all 4 image builds passed. The failure is purely in the release-stream import step that is common CI infrastructure.

  5. Other failures:

    • codecov/project: Coverage dropped to 34.02% (-3.40% vs base). This is a pre-existing coverage threshold issue, not caused by this PR's changes.
    • Konflux enterprise-contract: Snapshot hypershift-operator-20260508-001921-000 failed the enterprise-contract integration test — this is a separate Konflux pipeline issue unrelated to the PR code.
Recommendations
  1. Retest the Prow jobs: Run /retest on the PR. If the vcf-migration-operator image has been published to the OCP 5.0 stream since the failure, the jobs will pass. This is the most likely resolution since image publishing issues are typically transient.

  2. If retests continue to fail: File a bug against the OCP Release / CI Infrastructure team reporting that ocp_5.0_vcf-migration-operator is missing from both quay.io/openshift/ci and quay-proxy.ci.openshift.org/openshift/ci. The image may need to be built and pushed, or the release stream configuration may need to be updated.

  3. Codecov failure: This is a separate concern — the project coverage threshold (34.02%) is below the configured gate. If the team wants this to pass, either add tests to improve coverage or adjust the codecov threshold configuration. This is pre-existing and not caused by this PR.

  4. Konflux failure: This is an independent integration test failure in the Konflux pipeline. Monitor for resolution or contact the Konflux/enterprise-contract team if it persists across retests.

Evidence
Evidence Detail
Error location [release-inputs:latest] step in ci-operator graph, before any test execution
Missing image quay.io/openshift/ci:ocp_5.0_vcf-migration-operator and quay-proxy.ci.openshift.org/openshift/ci:ocp_5.0_vcf-migration-operator
Import retries 6 attempts, all failed with "not found"
CI namespace ci-op-f54q187x (shared by both jobs)
Image builds All 4 succeeded: hypershift (3m2s), hypershift-operator (5m8s), hypershift-tests (8m59s), hypershift-cli (1m12s)
images job Failed after 11m5s at creating_release_images phase
e2e-aws job Failed after 13m55s at creating_release_images phase (reused cached builds)
codecov/project 34.02% coverage (-3.40% vs base commit 2d3b34d) — threshold failure
Konflux EC Snapshot hypershift-operator-20260508-001921-000 failed enterprise-contract scenario
JUnit XML junit_operator.xml confirms single failure: "Find all of the input images from ocp/5.0" testcase

@@ -0,0 +1,147 @@
apiVersion: apiextensions.k8s.io/v1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it be nice to populate Expected: for these cases

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, populated expected for these


// Overflow holds additional kubelet configuration fields not explicitly defined above.
// These fields are preserved during serialization and deserialization, allowing arbitrary
// kubelet configuration to pass through to the node's ignition/MachineConfig.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be mention overflow fields bypass all CRD validation, invalid values will manifest as node bootstrap failures (kubelet crash loop), not admission errors.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated language to mention the kubelet crash loop

Comment thread api/AGENTS.md Outdated

When graduating a field from overflow to a typed struct field:

- **Match upstream Karpenter's field name and JSON tag exactly.** Our structured fields must use the
Copy link
Copy Markdown
Member

@enxebre enxebre May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens with overflow fields set in existing configs, if upstream Karpenter introduce support for it, and we want to promote it but karpenter upstream choose to differ from upstream kubelet on how the semantic is exposed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually want to match kubelet. Users today will set kubelet fields which end up in our overflow, and we want to maintain compatibility with those. If Karpenter decide to change the field, that would break our upgrades if we made the field structured and kept them aligned to karpenter, so we would have to align to kubelet, then translate to karpenter (so that it can then translate back to kubelet)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Match kubelet. I changed that one weird EvictionSoftGracePeriod Karpenter time.Duration back to a string (so we're matching upstream Kube not Karpenter, it will serialize the same) and updated the AGENTS, etc language to match this.

// +kubebuilder:validation:MinProperties=1
// +optional
KubeReserved map[string]string `json:"kubeReserved,omitempty"`
// evictionHard is a map of signal names to quantities that defines hard eviction thresholds.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to validate that evictionHard and EvictionSoft doesn't clash any values?

Copy link
Copy Markdown
Member Author

@jkyros jkyros May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as much CEL as we can to validate this.

They can be either quantities or percentages (yay!) and that makes it difficult to compare since CEL doesn't know node capacity. But we can can compare percentages to percentages and values to values so I added what ended up being kind of a gnarly expression that compares them if it can and fails open if they are mix/match quantity/percentage.

EDIT: Nope. Too gnarly. quantity() and compareTo() are too expensive. And apparently...I can't put max lengths on the map key strings to help without type aliasing them so I have somewhere to attach. Yuck. (unless we want to hack on the CRD like adjust-cel.sh does or something? I don't think we do )

I'm going to turn into Joel here with the "ugh, this API sucks" 😛

Yeah I think (because contents can be either type) we either have to hack on the CRD or type alias it to pass budget and still validate.

Copy link
Copy Markdown
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On each of the maps, could we add a maximum limit too? Seems like the realistic user case is ~4/5 entries in each map, what if we limited to 32 to give plenty of buffer but bring estimated CEL costs down?

Comment thread api/karpenter/v1/kubelet_config.go Outdated
Comment on lines +36 to +37
// When graduating new fields from overflow to typed fields, match upstream Karpenter's
// field names and types exactly. See api/AGENTS.md "KubeletConfiguration Field Graduation"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it karpenter, or kubelet that we must match?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relates to #8192 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching Kubelet, adjusted accordingly.

jkyros added 2 commits May 8, 2026 17:15
…eClass

Add KubeletConfiguration type with custom JSON marshal/unmarshal to support
both explicitly typed fields (maxPods, systemReserved, eviction thresholds,
etc.) and arbitrary overflow fields that pass through to the node's kubelet
config via MachineConfig.

Typed fields get CEL validation at admission time (range checks,
cross-field rules like imageGCHigh > imageGCLow and evictionSoft >=
evictionHard, key-set validation on maps). Overflow fields bypass CRD
validation entirely — invalid values surface as kubelet crash loops at
node bootstrap, not admission errors.

The overflow mechanism uses a runtime.RawExtension with json:"-" and
custom MarshalJSON/UnmarshalJSON to split known fields into the struct
and unknown fields into overflow. On marshal, structured fields win
over overflow on conflict.

Field types match upstream kubelet (k8s.io/kubelet/config/v1beta1) as
the primary compatibility target, with upstream Karpenter as a secondary
reference. The one deliberate deviation is evictionMaxPodGracePeriod
(*int32 vs kubelet's int32) required by the API linter since 0 is a
valid value with no Minimum constraint.

Signed-off-by: John Kyros <jkyros@redhat.com>
…d mapping

Wire the KubeletConfiguration from OpenshiftEC2NodeClass through the
karpenter-operator's ignition controller to inject kubelet settings into
node ignition via MachineConfig.

Add type mapping function (karpenterKubeletConfigurationFromNodeClassSpec)
that converts our kubelet types to upstream Karpenter types, including a
string-to-Duration conversion for evictionSoftGracePeriod where our API
matches kubelet's map[string]string but Karpenter uses map[string]metav1.Duration.

Move taint ConfigMap creation from the hypershift-operator to the
karpenter-operator for centralized management.

Signed-off-by: John Kyros <jkyros@redhat.com>
@jkyros jkyros force-pushed the autoscale-558-kubeletconfig-overflow branch from a2887fe to 31ee3c0 Compare May 8, 2026 22:53
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 8, 2026

envtest...cancelled itself? hmmm
/retest

Add envtest suites for CEL validation rules including threshold
ordering, field promotion ratcheting, and generated ratcheting tests.
Add e2e test infrastructure for verifying kubelet config on nodes.

Signed-off-by: John Kyros <jkyros@redhat.com>
@jkyros jkyros force-pushed the autoscale-558-kubeletconfig-overflow branch from 31ee3c0 to 0a5d587 Compare May 9, 2026 02:07
@jkyros
Copy link
Copy Markdown
Member Author

jkyros commented May 9, 2026

/test e2e-aws-autonode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/ci-tooling Indicates the PR includes changes for CI or tooling area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants