CNTRLPLANE-2262: Add Azure scale-from-zero support#8337
CNTRLPLANE-2262: Add Azure scale-from-zero support#8337jhjaggars wants to merge 6 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@jhjaggars: This pull request references CNTRLPLANE-2262 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jhjaggars The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR extends scale-from-zero support to Azure in addition to AWS. CRD XValidation for NodePoolSpec now permits ChangesScale-from-zero: Azure support + Azure instancetype provider
Sequence Diagram(s)sequenceDiagram
participant OperatorMain as Operator (main/boot)
participant AzureSKUs as Azure Resource SKUs API
participant InstTypeProv as Azure Instancetype Provider
participant K8sAPI as Kubernetes API (CAPI objects)
participant NodePoolCtrl as NodePool Controller
OperatorMain->>AzureSKUs: Read credentials & location\ncreate SKUs client
OperatorMain->>InstTypeProv: NewProvider(skuClient, location)
Note over InstTypeProv: Provider initialized (cache empty)
NodePoolCtrl->>K8sAPI: Get NodePool / AzureMachineTemplate
K8sAPI-->>NodePoolCtrl: return AzureMachineTemplate (VMSize)
NodePoolCtrl->>InstTypeProv: GetInstanceTypeInfo(ctx, vmSize)
InstTypeProv->>AzureSKUs: ListPager() / Paginate SKUs
AzureSKUs-->>InstTypeProv: SKU pages
InstTypeProv-->>NodePoolCtrl: InstanceTypeInfo (vCPUs, Memory, GPUs)
NodePoolCtrl->>K8sAPI: Patch node template annotations\n(scale-from-zero capacity/taints)
K8sAPI-->>NodePoolCtrl: Patch result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/install/install.go (1)
251-263:⚠️ Potential issue | 🟡 MinorExpose Azure in the CLI help text as well.
Validation accepts
azurehere, but the--scale-from-zero-providerhelp string still saysPlatform type for scale-from-zero autoscaling (aws)at Line 394.hypershift install --helpwill still advertise AWS-only support.✏️ Suggested follow-up
- cmd.PersistentFlags().StringVar(&opts.ScaleFromZeroProvider, "scale-from-zero-provider", opts.ScaleFromZeroProvider, "Platform type for scale-from-zero autoscaling (aws)") + cmd.PersistentFlags().StringVar(&opts.ScaleFromZeroProvider, "scale-from-zero-provider", opts.ScaleFromZeroProvider, "Platform type for scale-from-zero autoscaling (aws, azure)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/install/install.go` around lines 251 - 263, The CLI help text still advertises AWS-only for --scale-from-zero-provider even though supportedProviders includes "azure"; update the help string where the flag for ScaleFromZeroProvider (the option described as "Platform type for scale-from-zero autoscaling (aws)") to list both aws and azure (or a dynamic list based on supportedProviders) so the help matches validation in supportedProviders and the ScaleFromZeroProvider flag behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/hypershift/v1beta1/nodepool_types.go`:
- Line 109: Update the comment on the NodePoolAutoScaling.Min field to reflect
that scale-from-zero (min=0) is supported for both AWS and Azure platforms;
locate the NodePoolAutoScaling struct and the Min field comment in
nodepool_types.go (symbol: NodePoolAutoScaling.Min) and change the text that
currently mentions only AWS to mention "AWS and Azure" so the CRD/OpenAPI schema
and docs match the XValidation rule. Ensure the wording mirrors the XValidation
message: "Scale-from-zero (autoScaling.min=0) is supported for AWS and Azure
platforms."
In `@hypershift-operator/controllers/nodepool/instancetype/azure/provider.go`:
- Around line 48-50: The cache is being set before the full Azure SKU pagination
succeeds, causing partial results to persist after pager/NextPage() failures;
modify loadSKUs to populate a local temporary map (e.g., tempCache) while
walking pages and only assign it to p.cache (and any related fields) after the
entire walk succeeds, and ensure GetInstanceTypeInfo still checks p.cache==nil
to trigger reloads; apply the same pattern to the other similar block around the
63-75 logic so that p.cache is only updated on successful completion of the full
SKU load.
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Line 437: The current platform check returns true for Azure unconditionally
which lets Azure NodePools enter the scale-from-zero path even when the
configured provider (scale-from-zero-provider) is AWS; update the platform gate
to require both the cluster platform and the configured InstanceTypeProvider
match: modify the function that currently returns "return platform ==
hyperv1.AWSPlatform || platform == hyperv1.AzurePlatform" to instead check the
configured provider (e.g., scaleFromZeroProvider/InstanceTypeProvider) and only
return true when platform==AWS && provider==aws OR platform==Azure &&
provider==azure (use the actual flag/field name used to hold the
--scale-from-zero-provider value and the InstanceTypeProvider symbol in the
reconciler).
In `@hypershift-operator/main.go`:
- Around line 487-499: The code parses Azure credentials into the azureCreds
struct but only validates Location; update the validation after json.Unmarshal
to ensure SubscriptionID, ClientID, ClientSecret and TenantID are non-empty
before creating the client. Specifically, in the block that defines azureCreds
and calls json.Unmarshal, add checks for azureCreds.SubscriptionID,
azureCreds.ClientID, azureCreds.ClientSecret and azureCreds.TenantID and return
descriptive fmt.Errorf errors (or a single aggregated error) if any are empty so
client creation logic (using these fields) never runs with missing values.
---
Outside diff comments:
In `@cmd/install/install.go`:
- Around line 251-263: The CLI help text still advertises AWS-only for
--scale-from-zero-provider even though supportedProviders includes "azure";
update the help string where the flag for ScaleFromZeroProvider (the option
described as "Platform type for scale-from-zero autoscaling (aws)") to list both
aws and azure (or a dynamic list based on supportedProviders) so the help
matches validation in supportedProviders and the ScaleFromZeroProvider flag
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ffada2e8-18eb-4f47-9fc5-f9200533350a
📒 Files selected for processing (11)
api/hypershift/v1beta1/nodepool_types.gocmd/install/install.gohypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/conditions.gohypershift-operator/controllers/nodepool/instancetype/azure/provider.gohypershift-operator/controllers/nodepool/instancetype/azure/provider_test.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/scale_from_zero.gohypershift-operator/controllers/nodepool/scale_from_zero_test.gohypershift-operator/main.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8337 +/- ##
==========================================
+ Coverage 37.53% 37.58% +0.04%
==========================================
Files 751 752 +1
Lines 92025 92199 +174
==========================================
+ Hits 34543 34649 +106
- Misses 54841 54908 +67
- Partials 2641 2642 +1
... and 1 file with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
0b01bc1 to
b3477bf
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/capi.go`:
- Around line 774-775: The code sets effectiveMin to 1 for Azure based solely on
nodePool.Spec.Platform.Type; change this to also require that the
operator/runtime has Azure scale-from-zero support wired up. Update the
condition around effectiveMin (the place that checks nodePool.Spec.Platform.Type
and sets effectiveMin) to call the runtime-configured check (e.g., a function or
map such as supportsScaleFromZero(platform) or scaleFromZeroProviders[platform])
and only force effectiveMin=1 when the platform is not supported for
scale-from-zero or when the runtime config does not indicate Azure
scale-from-zero is enabled; apply the same guarded change to the other identical
branch referenced by the comment (the block around the other effectiveMin
handling). Ensure you reference and use the existing runtime config/provider
flag/function rather than just nodePool.Spec.Platform.Type.
In `@hypershift-operator/main.go`:
- Around line 527-537: The Azure scale-from-zero path creates credentials and a
ResourceSKUs client with nil options, ignoring AZURE_CLOUD_NAME; update the
NewClientSecretCredential and armcompute.NewResourceSKUsClient calls to use the
same cloud-specific client options used elsewhere (the resolved
cloud/environment options derived from AZURE_CLOUD_NAME) so the credential and
skuClient target the correct sovereign endpoints (use azureCreds and the
resolved azure cloud options when constructing cred and skuClient before
assigning instanceTypeProvider and scaleFromZeroPlatform and logging).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 255fb7ac-fd14-43de-b4ad-7f7fc9cd02c4
⛔ Files ignored due to path filters (8)
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/stable.nodepools.autoscaling.testsuite.yamlis excluded by!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (12)
api/hypershift/v1beta1/nodepool_types.gocmd/install/install.gogo.modhypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/conditions.gohypershift-operator/controllers/nodepool/instancetype/azure/provider.gohypershift-operator/controllers/nodepool/instancetype/azure/provider_test.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/scale_from_zero.gohypershift-operator/controllers/nodepool/scale_from_zero_test.gohypershift-operator/main.go
✅ Files skipped from review due to trivial changes (1)
- hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
- hypershift-operator/controllers/nodepool/scale_from_zero_test.go
- cmd/install/install.go
- api/hypershift/v1beta1/nodepool_types.go
- hypershift-operator/controllers/nodepool/conditions.go
- hypershift-operator/controllers/nodepool/nodepool_controller.go
b3477bf to
242417f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
hypershift-operator/controllers/nodepool/capi.go (1)
774-775:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftAzure
min=0guard still checks only platform type, not runtime scale-from-zero configuration.The exemption added for
hyperv1.AzurePlatformallowseffectiveMin=0for any Azure NodePool, regardless of whether the operator was started with the Azure scale-from-zero provider wired (--scale-from-zero-provider=azure). Without that provider, the scale-from-zero capacity annotations are never written, so the autoscaler receives a zero-minimum pool with no instance-type metadata and cannot scale back up — permanently stalling the pool.The fix should key this exemption off the runtime-configured provider set rather than the static platform type. The identical issue exists in both
setMachineDeploymentReplicas(Line 774) andsetMachineSetReplicas(Line 1083).#!/bin/bash # Look for any existing runtime check or helper that exposes whether a given platform # has scale-from-zero support configured (e.g. supportedScaleFromZeroPlatform, # scaleFromZeroProviders, or similar). rg -n --type=go -C4 'scaleFromZero|ScaleFromZero|scale_from_zero' \ --glob '!*_test.go'Also applies to: 1083-1085
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/nodepool/capi.go` around lines 774 - 775, The Azure min=0 exemption currently checks nodePool.Spec.Platform.Type directly; update the condition in both setMachineDeploymentReplicas and setMachineSetReplicas so it verifies the runtime-configured scale-from-zero provider set (e.g., call the existing helper/flag that exposes configured providers such as scaleFromZeroProviders / supportedScaleFromZeroPlatform or the operator config tied to --scale-from-zero-provider) instead of checking hyperv1.AzurePlatform; change the if that sets effectiveMin to 0 to require the runtime provider to include "azure" (or the helper to return true) before allowing effectiveMin==0 so pools only get zero-min when the operator actually supports Azure scale-from-zero.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/instancetype/azure/provider.go`:
- Around line 130-136: The code currently ignores parse errors and accepts
negative GPU counts; instead validate and fail fast: after calling
getCapabilityValue and obtaining gpuStr, attempt strconv.ParseInt(gpuStr, 10,
32) and if err != nil or the parsed value is negative, return an error (or log
and propagate) with context including gpuStr and the SKU identifier rather than
silently setting info.GPU; only assign info.GPU = int32(gpu) when parsing
succeeds and gpu >= 0. Use the existing gpuStr, getCapabilityValue,
strconv.ParseInt and info.GPU symbols to locate and implement the checks and
error propagation.
---
Duplicate comments:
In `@hypershift-operator/controllers/nodepool/capi.go`:
- Around line 774-775: The Azure min=0 exemption currently checks
nodePool.Spec.Platform.Type directly; update the condition in both
setMachineDeploymentReplicas and setMachineSetReplicas so it verifies the
runtime-configured scale-from-zero provider set (e.g., call the existing
helper/flag that exposes configured providers such as scaleFromZeroProviders /
supportedScaleFromZeroPlatform or the operator config tied to
--scale-from-zero-provider) instead of checking hyperv1.AzurePlatform; change
the if that sets effectiveMin to 0 to require the runtime provider to include
"azure" (or the helper to return true) before allowing effectiveMin==0 so pools
only get zero-min when the operator actually supports Azure scale-from-zero.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4a1461b0-6355-45e0-b5b5-14af5c2bc3e3
⛔ Files ignored due to path filters (10)
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/stable.nodepools.autoscaling.testsuite.yamlis excluded by!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamldocs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (12)
api/hypershift/v1beta1/nodepool_types.gocmd/install/install.gogo.modhypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/conditions.gohypershift-operator/controllers/nodepool/instancetype/azure/provider.gohypershift-operator/controllers/nodepool/instancetype/azure/provider_test.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/scale_from_zero.gohypershift-operator/controllers/nodepool/scale_from_zero_test.gohypershift-operator/main.go
🚧 Files skipped from review as they are similar to previous changes (9)
- api/hypershift/v1beta1/nodepool_types.go
- hypershift-operator/controllers/nodepool/conditions.go
- cmd/install/install.go
- hypershift-operator/controllers/nodepool/capi_test.go
- hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
- hypershift-operator/controllers/nodepool/scale_from_zero_test.go
- hypershift-operator/main.go
- hypershift-operator/controllers/nodepool/nodepool_controller.go
- hypershift-operator/controllers/nodepool/scale_from_zero.go
|
Now I have all the evidence needed. Let me compile the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryAll three Konflux pipeline runs failed identically at the Root CauseTransient Konflux infrastructure networking issue causing git clone timeouts from the Konflux build cluster ( The failure chain:
Evidence this is NOT a code issue:
Recommendations
Evidence
|
Extend the existing scale-from-zero autoscaling framework to support Azure by implementing an Azure instance type provider that queries the Azure Resource SKUs API for VM size specifications and writing capacity annotations on MachineDeployments. Changes: - Add Azure instancetype.Provider using armcompute.ResourceSKUsClient - Add AzureMachineTemplate case to scale_from_zero.go type switch - Extend supportedScaleFromZeroPlatform() for Azure - Extend reconcileScaleFromZeroAnnotations() for Azure - Update autoscalerEnabledCondition() to accept Azure with min=0 - Update effectiveMin guard in capi.go to allow min=0 for Azure - Add "azure" to supportedProviders in main.go and install.go - Add Azure provider initialization with credential file parsing - Update CRD CEL validation to allow min=0 for Azure platform - Add unit tests for Azure provider and extended type switches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update NodePoolAutoScaling.Min field comment and CRD validation rule to reflect Azure support alongside AWS - Regenerate CRD manifests with updated docs and validation - Fix partial SKU cache on Azure pager failure: build into local map and assign to cache only after full walk succeeds - Tighten platform gate: add ScaleFromZeroPlatform field so annotations are only set when nodepool platform matches the configured provider - Validate all required Azure credential fields (subscriptionId, clientId, clientSecret, tenantId, location) upfront with a clear error listing missing fields Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update CRD test suite to match the updated validation rule that allows autoScaling.min=0 on Azure platform: - Change Azure min=0 test from expecting failure to expecting success - Update Agent and KubeVirt error messages to include Azure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run make generate update to sync generated API docs, aggregated docs, go.mod, and vendored type files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Lowercase error string for Azure scale-from-zero credentials - Fix gci import ordering in main.go, provider_test.go, scale_from_zero_test.go, and nodepool_test.go Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
242417f to
826f26a
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
hypershift-operator/controllers/nodepool/capi.go (2)
1083-1085:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftSame unguarded Azure platform-type check in
setMachineSetReplicas.Same issue as lines 774–776:
effectiveMin=0is allowed for Azure based on platform type alone, with no check for runtime-configured Azure scale-from-zero support.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/nodepool/capi.go` around lines 1083 - 1085, In setMachineSetReplicas the AzurePlatform check is unguarded so effectiveMin may be set to 1 based solely on platform type; update the condition to also verify the runtime-configured Azure scale-from-zero feature flag (the same runtime check used earlier around lines with the guarded Azure check) before allowing effectiveMin to remain 0. Specifically, change the if that references nodePool.Spec.Platform.Type == hyperv1.AzurePlatform to require the runtime-scale-from-zero check (e.g., call the existing isAzureScaleFromZeroEnabled/clusterConfig.ScaleFromZero.Azure/feature helper used elsewhere) so Azure only gets scale-from-zero behavior when the runtime config enables it, leaving AWS and other logic unchanged.
774-776:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift[Still unresolved from previous review] Gate Azure
min=0on runtime-configured provider, not just platform type.This guard continues to permit
effectiveMin=0for any Azure NodePool based solely onnodePool.Spec.Platform.Type, regardless of whether the operator was started with--scale-from-zero-provider=azure. If the Azure instancetype provider is not wired up at startup, the scale-from-zero annotation path won't populate the capacity metadata the autoscaler needs—leaving the pool permanently stuck at0replicas with no recovery path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/nodepool/capi.go` around lines 774 - 776, The current guard sets effectiveMin=1 for non-AWS/Azure based only on nodePool.Spec.Platform.Type, which still allows Azure pools to stay at 0 when the operator wasn't started with the Azure scale-from-zero provider; update the condition in the block that assigns effectiveMin (around the effectiveMin variable usage) to also check the operator's runtime configuration for enabled scale-from-zero providers (e.g., consult the operator config or the ScaleFromZeroProviders/scaleFromZeroProvider flag mechanism used at startup) and only permit min=0 for Azure when the Azure provider is actually enabled; in practice change the if that references nodePool.Spec.Platform.Type and hyperv1.AzurePlatform to require both platform==Azure and providerEnabled("azure") (or the equivalent runtime-config boolean/collection used by the operator) before allowing effectiveMin to remain 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@hypershift-operator/controllers/nodepool/capi.go`:
- Around line 1083-1085: In setMachineSetReplicas the AzurePlatform check is
unguarded so effectiveMin may be set to 1 based solely on platform type; update
the condition to also verify the runtime-configured Azure scale-from-zero
feature flag (the same runtime check used earlier around lines with the guarded
Azure check) before allowing effectiveMin to remain 0. Specifically, change the
if that references nodePool.Spec.Platform.Type == hyperv1.AzurePlatform to
require the runtime-scale-from-zero check (e.g., call the existing
isAzureScaleFromZeroEnabled/clusterConfig.ScaleFromZero.Azure/feature helper
used elsewhere) so Azure only gets scale-from-zero behavior when the runtime
config enables it, leaving AWS and other logic unchanged.
- Around line 774-776: The current guard sets effectiveMin=1 for non-AWS/Azure
based only on nodePool.Spec.Platform.Type, which still allows Azure pools to
stay at 0 when the operator wasn't started with the Azure scale-from-zero
provider; update the condition in the block that assigns effectiveMin (around
the effectiveMin variable usage) to also check the operator's runtime
configuration for enabled scale-from-zero providers (e.g., consult the operator
config or the ScaleFromZeroProviders/scaleFromZeroProvider flag mechanism used
at startup) and only permit min=0 for Azure when the Azure provider is actually
enabled; in practice change the if that references nodePool.Spec.Platform.Type
and hyperv1.AzurePlatform to require both platform==Azure and
providerEnabled("azure") (or the equivalent runtime-config boolean/collection
used by the operator) before allowing effectiveMin to remain 0.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d92e2f97-9c70-40b5-9329-3c634ae4f483
⛔ Files ignored due to path filters (1)
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (22)
api/hypershift/v1beta1/nodepool_types.goapi/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yamlapi/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yamlapi/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yamlcmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/stable.nodepools.autoscaling.testsuite.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yamlcmd/install/install.godocs/content/reference/aggregated-docs.mddocs/content/reference/api.mdgo.modhypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/conditions.gohypershift-operator/controllers/nodepool/instancetype/azure/provider.gohypershift-operator/controllers/nodepool/instancetype/azure/provider_test.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/scale_from_zero.gohypershift-operator/controllers/nodepool/scale_from_zero_test.gohypershift-operator/main.gotest/e2e/nodepool_test.go
✅ Files skipped from review due to trivial changes (3)
- test/e2e/nodepool_test.go
- docs/content/reference/aggregated-docs.md
- cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml
🚧 Files skipped from review as they are similar to previous changes (9)
- hypershift-operator/controllers/nodepool/conditions.go
- cmd/install/install.go
- hypershift-operator/controllers/nodepool/capi_test.go
- go.mod
- hypershift-operator/controllers/nodepool/scale_from_zero.go
- hypershift-operator/controllers/nodepool/nodepool_controller.go
- hypershift-operator/controllers/nodepool/instancetype/azure/provider.go
- hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
- hypershift-operator/main.go
|
/test all |
|
@jhjaggars: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- Gate effectiveMin=0 on runtime-configured scaleFromZeroPlatform instead of static platform type check, preventing stalled pools when the scale-from-zero provider isn't wired up - Resolve AZURE_CLOUD_NAME for credential and SKU client construction in scale-from-zero init, matching sovereign cloud support used elsewhere - Return errors on invalid/negative GPU values in transformSKU instead of silently skipping, with VM size in error messages for debuggability Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend the existing scale-from-zero autoscaling framework to support Azure by implementing an Azure instance type provider that queries the Azure Resource SKUs API for VM size specifications and writing capacity annotations on MachineDeployments.
Changes:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests