Skip to content

AUTOSCALE-615: include Karpenter node vCPUs in billing metric#8265

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
maxcao13:karpenter-node-billing
May 1, 2026
Merged

AUTOSCALE-615: include Karpenter node vCPUs in billing metric#8265
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
maxcao13:karpenter-node-billing

Conversation

@maxcao13
Copy link
Copy Markdown
Member

@maxcao13 maxcao13 commented Apr 16, 2026

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • HostedCluster status now includes an optional AutoNode vCPU count; metrics now seed and aggregate AutoNode plus native node vCPUs for improved billing and capacity visibility.
    • Karpenter now sums live NodeClaim CPU capacity when computing AutoNode vCPUs and reacts to NodeClaim changes that affect CPU or assignment.
  • Tests

    • Added unit and e2e tests covering vCPU aggregation, error reporting, metric convergence, and billing/vCPU validation during provisioning and consolidation.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 16, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Apr 16, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds vCPU capacity tracking for Karpenter-managed AutoNodes: a new optional VCPUs *int32 on HostedCluster.Status.AutoNode; the guest-side Karpenter controller now watches NodeClaim create/delete and updates that change CPU capacity or Status.NodeName, filters NodeClaims to those whose Status.NodeName corresponds to a live corev1.Node, sums their CPU capacity, and writes the total to Status.AutoNode.VCPUs. The nodepool metrics collector seeds per-cluster vCPU totals from AutoNode.VCPUs (when set), aggregates native NodePool vCPUs, and emits a computation-error metric on lookup failures. Unit and e2e tests validate summation, nil behavior, lookup errors, metric emission, and billing integration.

Sequence Diagram(s)

sequenceDiagram
    participant KC as Karpenter Controller
    participant NC as NodeClaim Resources
    participant N as corev1.Node List
    participant HCP as HostedCluster (Status.AutoNode)
    participant MC as Metrics Collector
    participant Prom as Prometheus

    Note over KC,NC: NodeClaim watch + vCPU computation
    KC->>NC: Receive create/update/delete events
    NC-->>KC: NodeClaim.Status.Capacity[CPU], Status.NodeName
    KC->>N: Query list of Nodes (verify backing node exists)
    N-->>KC: List of node names

    Note over KC: Sum only NodeClaims with NodeName in Nodes
    KC->>KC: sumNodeClaimVCPUs(nodeClaims, liveNodes)

    Note over KC,HCP: Status update
    KC->>HCP: Update Status.AutoNode.VCPUs with computed sum
    HCP-->>HCP: Persist AutoNode.VCPUs

    Note over MC,Prom: Metrics aggregation
    MC->>HCP: Read Status.AutoNode.VCPUs
    MC->>MC: Seed hclusterData.vCpusCount with AutoNode.VCPUs (if present)
    MC->>MC: Add native NodePool vCPUs (from instance-type lookup)
    MC->>Prom: Emit hypershift_cluster_vcpus and error metrics
    Prom-->>Prom: Store/serve metrics
Loading
🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning The testBillingConsolidationAndPDB test uses pod anti-affinity requiring different nodes, which is incompatible with Single Node OpenShift deployments that have only one node. Add [Skipped:SingleReplicaTopology] label to the test or guard with runtime topology checks using exutil.IsSingleNode() to skip on SNO clusters.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning The new testBillingConsolidationAndPDB e2e test requires pulling container image quay.io/openshift/origin-pod:4.22.0 from public Quay.io registry, which will fail in IPv6-only disconnected CI environments without internet access. The test lacks the [Skipped:Disconnected] annotation. Add the [Skipped:Disconnected] annotation to the test name to skip it on disconnected clusters, or use an internally-available container image instead of quay.io.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding Karpenter node vCPUs to the billing metric, which is the core objective of this multi-file changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the pull request are stable and deterministic without any dynamic information. The test titles clearly describe what is being validated using static, descriptive strings.
Test Structure And Quality ✅ Passed Tests demonstrate solid Ginkgo and Go testing practices with proper timeout specifications, meaningful assertions, and cleanup patterns.
Microshift Test Compatibility ✅ Passed The new e2e tests in test/e2e/karpenter_test.go are protected by a platform guard that prevents execution on MicroShift.
Topology-Aware Scheduling Compatibility ✅ Passed Pull request adds vCPU tracking for Karpenter nodes through API types and metrics without introducing pod scheduling constraints incompatible with SNO, Two-Node, or HyperShift topologies.
Ote Binary Stdout Contract ✅ Passed PR's e2e test code properly uses crzap logger (outputs to stderr by default) for all process-level logging with no direct fmt.Print calls to stdout.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-techpreview

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 16, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Added vCPU capacity tracking and reporting for Karpenter-managed nodes, enabling better visibility into resource allocation and billing metrics.

  • Tests

  • Added comprehensive test coverage for vCPU metric aggregation and reporting across various scenarios.

  • Updated end-to-end tests to validate vCPU metric convergence during node provisioning and consolidation workflows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
karpenter-operator/controllers/karpenter/karpenter_controller.go (1)

398-414: Precompute the live-node set before summing vCPUs.

This helper is currently quadratic because every NodeClaim re-scans the full node list. On larger clusters that turns each Node/NodeClaim event into an avoidable hot path in the billing reconcile loop.

♻️ Suggested refactor
 func sumNodeClaimVCPUs(nodeClaims []karpenterv1.NodeClaim, allNodes []corev1.Node) int32 {
+	liveNodes := make(map[string]struct{}, len(allNodes))
+	for i := range allNodes {
+		liveNodes[allNodes[i].Name] = struct{}{}
+	}
+
 	var total int64
 	for i := range nodeClaims {
 		nc := &nodeClaims[i]
-		if !slices.ContainsFunc(allNodes, func(n corev1.Node) bool {
-			return n.Name == nc.Status.NodeName
-		}) {
+		if _, ok := liveNodes[nc.Status.NodeName]; !ok {
 			continue
 		}
 		if cpu, ok := nc.Status.Capacity[corev1.ResourceCPU]; ok {
 			total += cpu.Value()
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@karpenter-operator/controllers/karpenter/karpenter_controller.go` around
lines 398 - 414, The sumNodeClaimVCPUs helper is O(N*M) because it calls
slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live
node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at
the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups
against that map (check nc.Status.NodeName exists) before summing
nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in
sumNodeClaimVCPUs accordingly to use the precomputed map instead of
slices.ContainsFunc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1368-1421: The predicate in e2eutil.EventuallyObject currently
treats a nil hc.Status.AutoNode.VCPUs as zero which can falsely pass; change the
predicate (inside the anonymous func passed to e2eutil.EventuallyObject that
reads hc.Status.AutoNode.VCPUs) to explicitly require the pointer be non-nil
before considering the value (return false if nil), and only succeed when
*hc.Status.AutoNode.VCPUs == expectedVCPUs. Likewise, tighten the metric check
in the wait.PollUntilContextTimeout loop (the loop that reads
npmetrics.VCpusCountByHClusterMetricName) to match the metric series by both the
"name" label and the "namespace" label (require l.GetName()=="name" &&
l.GetValue()==hostedCluster.Name and also require a label "namespace" with value
hostedCluster.Namespace) so the correct series is selected.

---

Nitpick comments:
In `@karpenter-operator/controllers/karpenter/karpenter_controller.go`:
- Around line 398-414: The sumNodeClaimVCPUs helper is O(N*M) because it calls
slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live
node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at
the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups
against that map (check nc.Status.NodeName exists) before summing
nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in
sumNodeClaimVCPUs accordingly to use the precomputed map instead of
slices.ContainsFunc.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 41073511-5238-4fd7-b7a0-bb8022ab9509

📥 Commits

Reviewing files that changed from the base of the PR and between 533aad6 and 29df349.

⛔ Files ignored due to path filters (12)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (6)
  • api/hypershift/v1beta1/hostedcluster_types.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/metrics/metrics_test.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • test/e2e/karpenter_test.go

Comment thread test/e2e/karpenter_test.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 43.58974% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.42%. Comparing base (a9b5118) to head (8147c32).
⚠️ Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
...ator/controllers/karpenter/karpenter_controller.go 31.25% 22 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8265   +/-   ##
=======================================
  Coverage   36.42%   36.42%           
=======================================
  Files         765      765           
  Lines       93300    93336   +36     
=======================================
+ Hits        33981    33996   +15     
- Misses      56604    56625   +21     
  Partials     2715     2715           
Files with missing lines Coverage Δ
...t-operator/controllers/nodepool/metrics/metrics.go 70.20% <100.00%> (+0.43%) ⬆️
...ator/controllers/karpenter/karpenter_controller.go 26.76% <31.25%> (+0.55%) ⬆️
Flag Coverage Δ
cmd-support 30.37% <ø> (ø)
cpo-hostedcontrolplane 37.08% <ø> (ø)
cpo-other 35.69% <ø> (ø)
hypershift-operator 47.89% <100.00%> (+0.01%) ⬆️
other 27.77% <31.25%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maxcao13 maxcao13 force-pushed the karpenter-node-billing branch from 29df349 to 5cd4832 Compare April 16, 2026 22:22
@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-techpreview

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • HostedCluster status now exposes AutoNode vCPU counts; cluster vCPU metrics include both AutoNode and native node contributions for improved billing and capacity visibility.

  • Karpenter reconciliation now accounts for live NodeClaim vCPU capacity when computing AutoNode vCPUs.

  • Tests

  • Added unit and integration tests validating vCPU aggregation, error handling, and end-to-end metric convergence during provisioning and consolidation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maxcao13 maxcao13 marked this pull request as ready for review April 17, 2026 16:53
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2026
@openshift-ci openshift-ci Bot requested review from bryan-cox and jparrill April 17, 2026 16:55
@maxcao13 maxcao13 force-pushed the karpenter-node-billing branch from 5cd4832 to 3d550fe Compare April 17, 2026 16:59
@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-techpreview

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 17, 2026

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • HostedCluster status now reports AutoNode vCPU counts; cluster vCPU metrics aggregate AutoNode and native node contributions for improved billing and capacity visibility.

  • Karpenter now includes live NodeClaim CPU capacity when computing AutoNode vCPUs.

  • Tests

  • Added unit and end-to-end tests validating vCPU aggregation, error reporting, and billing-metric convergence during provisioning and consolidation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1171-1178: The baseline vCPU snapshot is taken before AutoNode
convergence so it may include transient Karpenter vCPUs; change the order so you
call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to
ensure Karpenter vCPUs are zero, then call getVCPUsMetric(t, ctx, mgtClient,
hostedCluster) to set baseline and assert found—update the code around
getVCPUsMetric and waitForAutoNodeStatusVCPUs to reflect this ordering and keep
the baseline variable logic the same.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ad81a3ff-ac87-4709-8c6f-23548d604839

📥 Commits

Reviewing files that changed from the base of the PR and between 5cd4832 and 3d550fe.

⛔ Files ignored due to path filters (12)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (6)
  • api/hypershift/v1beta1/hostedcluster_types.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/metrics/metrics_test.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • test/e2e/karpenter_test.go
✅ Files skipped from review due to trivial changes (1)
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • api/hypershift/v1beta1/hostedcluster_types.go

Comment thread test/e2e/karpenter_test.go Outdated
@joshbranham
Copy link
Copy Markdown
Contributor

For HyperShift NodePools we return -1 if we hit an error calculating a vCPU count. We are not doing that here, as proposed, but I think that is fine since that error/calculation happens downstream in karpenter.

/lgtm
/test e2e-aws-techpreview

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 30, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@enxebre
Copy link
Copy Markdown
Member

enxebre commented Apr 30, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@maxcao13
Copy link
Copy Markdown
Member Author

/retest-required
/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@maxcao13: This PR has been marked as verified by e2e.

Details

In response to this:

/retest-required
/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

…us billing metric

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from
NodeClaim capacity and cross-referenced against live Node objects.
The metrics collector seeds per-cluster vCPU count from this field
before accumulating native NodePool vCPUs on top.

- NodeClaim is the authority for Karpenter ownership (no label dependency)
- e2e tests validate vCPU status + metric at 0, scale-up, and consolidation

Made-with: Cursor
Signed-off-by: Max Cao <macao@redhat.com>
@maxcao13 maxcao13 force-pushed the karpenter-node-billing branch from d75fe34 to 8147c32 Compare April 30, 2026 15:31
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 30, 2026
@joshbranham
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@maxcao13
Copy link
Copy Markdown
Member Author

this has gone through so many rebases now, that it would be a disservice not to check if autonode tests still pass

/test e2e-aws-autonode

@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-4-22

@maxcao13
Copy link
Copy Markdown
Member Author

/verified by e2e-tests

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@maxcao13: This PR has been marked as verified by e2e-tests.

Details

In response to this:

/verified by e2e-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-4-22

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 0ee6567 and 2 for PR HEAD 8147c32 in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 1, 2026

@maxcao13: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify-workflows 8147c32 link true /test verify-workflows

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit fb6955d into openshift:main May 1, 2026
52 checks passed
@hypershift-jira-solve-ci
Copy link
Copy Markdown

✅ Analysis complete!
📄 Report: .work/prow-job-analyze-test-failure/2049925899784032256/analysis.md

💡 Tip: The Markdown report can be copied directly into a JIRA Description field.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants