AUTOSCALE-615: include Karpenter node vCPUs in billing metric#8265
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds vCPU capacity tracking for Karpenter-managed AutoNodes: a new optional Sequence Diagram(s)sequenceDiagram
participant KC as Karpenter Controller
participant NC as NodeClaim Resources
participant N as corev1.Node List
participant HCP as HostedCluster (Status.AutoNode)
participant MC as Metrics Collector
participant Prom as Prometheus
Note over KC,NC: NodeClaim watch + vCPU computation
KC->>NC: Receive create/update/delete events
NC-->>KC: NodeClaim.Status.Capacity[CPU], Status.NodeName
KC->>N: Query list of Nodes (verify backing node exists)
N-->>KC: List of node names
Note over KC: Sum only NodeClaims with NodeName in Nodes
KC->>KC: sumNodeClaimVCPUs(nodeClaims, liveNodes)
Note over KC,HCP: Status update
KC->>HCP: Update Status.AutoNode.VCPUs with computed sum
HCP-->>HCP: Persist AutoNode.VCPUs
Note over MC,Prom: Metrics aggregation
MC->>HCP: Read Status.AutoNode.VCPUs
MC->>MC: Seed hclusterData.vCpusCount with AutoNode.VCPUs (if present)
MC->>MC: Add native NodePool vCPUs (from instance-type lookup)
MC->>Prom: Emit hypershift_cluster_vcpus and error metrics
Prom-->>Prom: Store/serve metrics
🚥 Pre-merge checks | ✅ 9 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test e2e-aws-techpreview |
|
@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
karpenter-operator/controllers/karpenter/karpenter_controller.go (1)
398-414: Precompute the live-node set before summing vCPUs.This helper is currently quadratic because every
NodeClaimre-scans the full node list. On larger clusters that turns each Node/NodeClaim event into an avoidable hot path in the billing reconcile loop.♻️ Suggested refactor
func sumNodeClaimVCPUs(nodeClaims []karpenterv1.NodeClaim, allNodes []corev1.Node) int32 { + liveNodes := make(map[string]struct{}, len(allNodes)) + for i := range allNodes { + liveNodes[allNodes[i].Name] = struct{}{} + } + var total int64 for i := range nodeClaims { nc := &nodeClaims[i] - if !slices.ContainsFunc(allNodes, func(n corev1.Node) bool { - return n.Name == nc.Status.NodeName - }) { + if _, ok := liveNodes[nc.Status.NodeName]; !ok { continue } if cpu, ok := nc.Status.Capacity[corev1.ResourceCPU]; ok { total += cpu.Value() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@karpenter-operator/controllers/karpenter/karpenter_controller.go` around lines 398 - 414, The sumNodeClaimVCPUs helper is O(N*M) because it calls slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups against that map (check nc.Status.NodeName exists) before summing nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in sumNodeClaimVCPUs accordingly to use the precomputed map instead of slices.ContainsFunc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1368-1421: The predicate in e2eutil.EventuallyObject currently
treats a nil hc.Status.AutoNode.VCPUs as zero which can falsely pass; change the
predicate (inside the anonymous func passed to e2eutil.EventuallyObject that
reads hc.Status.AutoNode.VCPUs) to explicitly require the pointer be non-nil
before considering the value (return false if nil), and only succeed when
*hc.Status.AutoNode.VCPUs == expectedVCPUs. Likewise, tighten the metric check
in the wait.PollUntilContextTimeout loop (the loop that reads
npmetrics.VCpusCountByHClusterMetricName) to match the metric series by both the
"name" label and the "namespace" label (require l.GetName()=="name" &&
l.GetValue()==hostedCluster.Name and also require a label "namespace" with value
hostedCluster.Namespace) so the correct series is selected.
---
Nitpick comments:
In `@karpenter-operator/controllers/karpenter/karpenter_controller.go`:
- Around line 398-414: The sumNodeClaimVCPUs helper is O(N*M) because it calls
slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live
node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at
the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups
against that map (check nc.Status.NodeName exists) before summing
nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in
sumNodeClaimVCPUs accordingly to use the precomputed map instead of
slices.ContainsFunc.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 41073511-5238-4fd7-b7a0-bb8022ab9509
⛔ Files ignored due to path filters (12)
api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!**/zz_generated*.go,!**/zz_generated*api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**client/applyconfiguration/hypershift/v1beta1/autonodestatus.gois excluded by!client/**cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamldocs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**,!**/zz_generated*.go,!**/zz_generated*
📒 Files selected for processing (6)
api/hypershift/v1beta1/hostedcluster_types.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/metrics/metrics_test.gokarpenter-operator/controllers/karpenter/karpenter_controller.gokarpenter-operator/controllers/karpenter/karpenter_controller_test.gotest/e2e/karpenter_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8265 +/- ##
=======================================
Coverage 36.42% 36.42%
=======================================
Files 765 765
Lines 93300 93336 +36
=======================================
+ Hits 33981 33996 +15
- Misses 56604 56625 +21
Partials 2715 2715
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
29df349 to
5cd4832
Compare
|
/test e2e-aws-techpreview |
|
@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5cd4832 to
3d550fe
Compare
|
/test e2e-aws-techpreview |
|
@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1171-1178: The baseline vCPU snapshot is taken before AutoNode
convergence so it may include transient Karpenter vCPUs; change the order so you
call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to
ensure Karpenter vCPUs are zero, then call getVCPUsMetric(t, ctx, mgtClient,
hostedCluster) to set baseline and assert found—update the code around
getVCPUsMetric and waitForAutoNodeStatusVCPUs to reflect this ordering and keep
the baseline variable logic the same.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: ad81a3ff-ac87-4709-8c6f-23548d604839
⛔ Files ignored due to path filters (12)
api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!**/zz_generated*.go,!**/zz_generated*api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**client/applyconfiguration/hypershift/v1beta1/autonodestatus.gois excluded by!client/**cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamldocs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**,!**/zz_generated*.go,!**/zz_generated*
📒 Files selected for processing (6)
api/hypershift/v1beta1/hostedcluster_types.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/metrics/metrics_test.gokarpenter-operator/controllers/karpenter/karpenter_controller.gokarpenter-operator/controllers/karpenter/karpenter_controller_test.gotest/e2e/karpenter_test.go
✅ Files skipped from review due to trivial changes (1)
- karpenter-operator/controllers/karpenter/karpenter_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- hypershift-operator/controllers/nodepool/metrics/metrics.go
- api/hypershift/v1beta1/hostedcluster_types.go
|
For HyperShift NodePools we return /lgtm |
|
/lgtm |
|
Scheduling tests matching the |
|
/retest-required |
|
@maxcao13: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…us billing metric Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top. - NodeClaim is the authority for Karpenter ownership (no label dependency) - e2e tests validate vCPU status + metric at 0, scale-up, and consolidation Made-with: Cursor Signed-off-by: Max Cao <macao@redhat.com>
d75fe34 to
8147c32
Compare
|
/lgtm |
|
Scheduling tests matching the |
|
this has gone through so many rebases now, that it would be a disservice not to check if autonode tests still pass /test e2e-aws-autonode |
|
/test e2e-aws-4-22 |
|
/verified by e2e-tests |
|
@maxcao13: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-4-22 |
|
@maxcao13: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
✅ Analysis complete! 💡 Tip: The Markdown report can be copied directly into a JIRA Description field. |
What this PR does / why we need it:
Add
VCPUsfield toAutoNodeStatus, computed by karpenter-operator fromNodeClaimcapacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.
We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.
Which issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615
Special notes for your reviewer:
Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the
AutoNodeStatus, I think it makes sense to be consistent here.Made-with: Cursor
Checklist:
Summary by CodeRabbit
New Features
Tests