Skip to content

CNTRLPLANE-2207: Upgrade to CAPI 1.11#7590

Open
clebs wants to merge 10 commits intoopenshift:mainfrom
clebs:capi-1.11-bump
Open

CNTRLPLANE-2207: Upgrade to CAPI 1.11#7590
clebs wants to merge 10 commits intoopenshift:mainfrom
clebs:capi-1.11-bump

Conversation

@clebs
Copy link
Copy Markdown
Member

@clebs clebs commented Jan 27, 2026

What this PR does / why we need it:

Bumps hypershift to use CAPI v1.11 including the following tasks:

  • Update CAPI and all providers to a v1.11 compatible version in go.mod.
  • Removes @csrwng's fork containing a temporary fix.
  • Update controller-gen goal in Makefile.
  • Update install assets: CAPI CRDs.
  • Patch CAPI CRDs to use v1beta1 as storage version.
  • Adds conversion webhooks for v1beta1 <-> v1beta2.
  • Removes the temporary CAPI image overrides (OCPBUGS-74247: CAPI image overrides aware of registry config #7575).
  • Check in updated vendored dependencies.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-2207

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

Release Notes

  • Chores
    • Updated Go module dependencies to newer versions for improved stability and security.
    • Adjusted build configuration for CRD generation to optimize the build process.
    • Updated linter configuration to address deprecated package deprecations.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The pull request makes three configuration and dependency-related updates. It adds a new staticcheck exclusion rule in .golangci.yml for a deprecated Kubernetes Cluster API package path. The Makefile is updated to narrow the controller-gen CRD generation input paths from broader vendor scans to more specific paths under api/core and api/ipam. The api/go.mod is updated with version bumps for multiple indirect dependencies, reorganization of go-openapi submodule requirements, and removal of some unused dependencies.

🚥 Pre-merge checks | ✅ 10
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title 'CAPI 1.11' clearly identifies the primary objective of upgrading Cluster API to version 1.11, which is directly reflected in all three modified files (linter config, Makefile, and go.mod).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test files were modified in this PR, which only changes .golangci.yml, Makefile, and api/go.mod configuration files.
Test Structure And Quality ✅ Passed PR contains only configuration and dependency changes (.golangci.yml, Makefile, api/go.mod) with no modifications to Ginkgo test code.
Microshift Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests; it only modifies configuration and dependency files (.golangci.yml, Makefile, api/go.mod).
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add any new Ginkgo e2e tests. The PR only modifies three configuration and dependency files: .golangci.yml, Makefile, and api/go.mod. Since no new test definitions using Ginkgo patterns are introduced, the SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR exclusively modifies build configuration and dependencies to upgrade Cluster API v1.11 without introducing scheduling constraints or topology-dependent logic.
Ote Binary Stdout Contract ✅ Passed PR modifies only configuration and dependency files with no changes to Go source code or test files, so OTE Binary Stdout Contract cannot be violated.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies only configuration and dependency files (.golangci.yml, Makefile, api/go.mod) with no new Ginkgo e2e tests added.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 27, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/ci-tooling Indicates the PR includes changes for CI or tooling area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/ibmcloud PR/issue for IBMCloud (IBMCloudPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jan 27, 2026
@clebs clebs changed the title WIP: upgrade to CAPI 1.11 CNTRLPLANE-2207: upgrade to CAPI 1.11 Jan 27, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Jan 27, 2026

@clebs: This pull request references CNTRLPLANE-2207 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Bumps hypershift to use CAPI v1.11 including the following tasks:

  • Update CAPI and all providers to a v1.11 compatible version in go.mod.
  • Removes @csrwng's fork containing a temporary fix.
  • Update controller-gen goal in Makefile.
  • Update install assets: CAPI CRDs.
  • Adds conversion webhooks for v1beta1 <-> v1beta2.
  • Removes the temporary CAPI image overrides (OCPBUGS-74247: CAPI image overrides aware of registry config #7575).
  • Check in updated vendored dependencies.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-2207

Special notes for your reviewer:

⚠️ This is a WIP opened for collaboration on a large task. Do not approve, lgtm or merge yet!

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clebs clebs changed the title CNTRLPLANE-2207: upgrade to CAPI 1.11 CNTRLPLANE-2207: Upgrade to CAPI 1.11 Jan 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Jan 27, 2026

@clebs: This pull request references CNTRLPLANE-2207 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Bumps hypershift to use CAPI v1.11 including the following tasks:

  • Update CAPI and all providers to a v1.11 compatible version in go.mod.
  • Removes @csrwng's fork containing a temporary fix.
  • Update controller-gen goal in Makefile.
  • Update install assets: CAPI CRDs.
  • Patch CAPI CRDs to use v1beta1 as storage version.
  • Adds conversion webhooks for v1beta1 <-> v1beta2.
  • Removes the temporary CAPI image overrides (OCPBUGS-74247: CAPI image overrides aware of registry config #7575).
  • Check in updated vendored dependencies.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-2207

Special notes for your reviewer:

⚠️ This is a WIP opened for collaboration on a large task. Do not approve, lgtm or merge yet!

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2026
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2026
@openshift-ci openshift-ci Bot added the area/platform/gcp PR/issue for GCP (GCPPlatform) platform label Jan 29, 2026
@clebs
Copy link
Copy Markdown
Member Author

clebs commented Jan 29, 2026

/test e2e-aws-minimal verify

@clebs
Copy link
Copy Markdown
Member Author

clebs commented Apr 20, 2026

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2046311527698403328 | Cost: $4.172172249999998 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@clebs
Copy link
Copy Markdown
Member Author

clebs commented Apr 21, 2026

/test e2e-aws e2e-aks

1 similar comment
@clebs
Copy link
Copy Markdown
Member Author

clebs commented Apr 21, 2026

/test e2e-aws e2e-aks

@clebs
Copy link
Copy Markdown
Member Author

clebs commented Apr 22, 2026

/test e2e-aks

@clebs
Copy link
Copy Markdown
Member Author

clebs commented Apr 28, 2026

/test e2e-aws e2e-aks

@clebs
Copy link
Copy Markdown
Member Author

clebs commented Apr 28, 2026

/test e2e-aws-v2 e2e-aks-4-22 e2e-aws-4-22 e2e-kubevirt-aws-ovn-reduced e2e-aws-upgrade-hypershift-operator

@clebs
Copy link
Copy Markdown
Member Author

clebs commented Apr 29, 2026

/test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 4, 2026

@clebs: This pull request references CNTRLPLANE-2207 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

Details

In response to this:

What this PR does / why we need it:

Bumps hypershift to use CAPI v1.11 including the following tasks:

  • Update CAPI and all providers to a v1.11 compatible version in go.mod.
  • Removes @csrwng's fork containing a temporary fix.
  • Update controller-gen goal in Makefile.
  • Update install assets: CAPI CRDs.
  • Patch CAPI CRDs to use v1beta1 as storage version.
  • Adds conversion webhooks for v1beta1 <-> v1beta2.
  • Removes the temporary CAPI image overrides (OCPBUGS-74247: CAPI image overrides aware of registry config #7575).
  • Check in updated vendored dependencies.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-2207

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

Release Notes

  • Chores
  • Updated Go module dependencies to newer versions for improved stability and security.
  • Adjusted build configuration for CRD generation to optimize the build process.
  • Updated linter configuration to address deprecated package deprecations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@everettraven
Copy link
Copy Markdown
Contributor

Approving the API changes as they are gomod/vendor only updates.

/approve

@jparrill
Copy link
Copy Markdown
Contributor

jparrill commented May 5, 2026

/retest-required

@jparrill
Copy link
Copy Markdown
Contributor

jparrill commented May 5, 2026

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clebs, everettraven, jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 5, 2026

/test e2e-gke e2e-v2-gke

@LiangquanLi930
Copy link
Copy Markdown
Contributor

As we discussed, I'll continue working on Upgrade HO clients to use v1beta2 once this PR is merged.

return err
}

imageOverride, err = backwardcompat.GetBackwardCompatibleCAPIImage(cpContext, pullSecret, r.RegistryProvider.GetReleaseProvider(), releaseVersion, ImageStreamCAPI)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this func declaration be removed: GetBackwardCompatibleCAPIImage?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed yes.

Name: sa.Name,
Namespace: sa.Namespace,
},
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point me to where the provider need access to this?

Should we instead create a dedicated targeted clusterRole with only what's needed here hypershift/control-plane-operator/controllers/hostedcontrolplane/v2/assets/capi-provider?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It' s been a while but IIRC I added this so the provider would have access to CRDs to be able to do conversions.

I will look into creating a separate role + binding + SA for this.

Copy link
Copy Markdown
Member Author

@clebs clebs May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have given a second look to this and the Role we are binding capi-provider to (hypershift-cluster-api) is already minimal. It only grants get, watch and list of CRDs and is only used by capi-manager SA and capi-provider SA.

I am not sure we would benefit of having a dedicated role here.

clebs and others added 10 commits May 6, 2026 14:03
- Upgrade all CAPI modules to 1.11.
- Update changed import paths
- Silence depreciation linter errors
- Update make cluster-api goal.
CAPI 1.11 defaults to v1beta2 storage. Override to v1beta1 for HyperShift compatibility.

Signed-off-by: Borja Clemente <bclement@redhat.com>
Signed-off-by: Borja Clemente <bclement@redhat.com>
Signed-off-by: Borja Clemente <bclement@redhat.com>
Remove the temporary hardocded CAPI image overrides now that hypershift
supports CAPI 1.11

Signed-off-by: Borja Clemente <bclement@redhat.com>
For conversion to work, the CAPI provider needs to be able to access
CRDs cluster-wide to list available versions.

Signed-off-by: Borja Clemente <bclement@redhat.com>
Update TestScaleFromZero to support both CAPI 1.11+ native Status.Capacity
and pre-1.11 annotation-based capacity information.

In CAPI 1.11, cluster-api-provider-aws now populates Status.Capacity
directly on AWSMachineTemplate, making the workaround annotations
unnecessary. The HyperShift controller detects this and skips setting
annotations when Status.Capacity is present.

The test now:
- First checks AWSMachineTemplate.Status.Capacity (CAPI 1.11+)
- Falls back to MachineDeployment annotations (pre-CAPI 1.11)
- Logs the capacity source for debugging

This makes the test backward compatible and fixes the failure in PR openshift#7590.
Setting the MinReadySeconds default to 0 explicitly on the nodepool
controller causes infinite reconciliaiton due to a lossy v1beta1 ->
v1beta2 conversion and flipping value between 0 and nil.

Removing the explicit setting should not have any other side effect
since the zero value of the field is the same.

Signed-off-by: Borja Clemente <bclement@redhat.com>
…mplete check

Replace the 1-second sleep workaround for OCPBUGS-77922 with a deterministic
cross-check of the v1beta2 conversion-data annotation. In CAPI v1.11+, the
v1beta1 UpdatedReplicas field maps from deprecated.v1beta1.updatedReplicas
rather than the native upToDateReplicas, which can transiently disagree.
When v1beta1 fields indicate completion, we now verify against the authoritative
v1beta2 status in the conversion-data annotation before declaring complete.

Jira: https://issues.redhat.com/browse/OCPBUGS-77922

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The word uptodate and all its casing variants are a false positive on
codespell. They are defined as such in CAPI.

Signed-off-by: Borja Clemente <bclement@redhat.com>
// Initialize CAPI v1beta1 conversion support.
// CAPI v1beta1 types need an apiVersionGetter to convert object references
// from v1beta2 (Hub) ContractVersionedObjectReference back to v1beta1 corev1.ObjectReference.
// The getter resolves GroupKind to the preferred API version string using the scheme.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which object references to core objects might we have in capi resouces?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this block will be removed because it actually always falls back to the default and therefore has no effect.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@clebs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks-4-21 80d3f1f link true /test e2e-aks-4-21
ci/prow/e2e-aws-4-21 80d3f1f link true /test e2e-aws-4-21
ci/prow/e2e-azure-self-managed e7c06ee link true /test e2e-azure-self-managed
ci/prow/unit e7c06ee link true /test unit
ci/prow/e2e-aks-4-22 562b85a link true /test e2e-aks-4-22
ci/prow/e2e-v2-gke 06b2474 link false /test e2e-v2-gke
ci/prow/e2e-gke 06b2474 link false /test e2e-gke

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 7, 2026

Test Failure Analysis Complete

Job Information

  • Prow Job 1: pull-ci-openshift-hypershift-main-e2e-v2-gke
  • Build ID 1: 2051996625710092288
  • Prow Job 2: pull-ci-openshift-hypershift-main-e2e-gke
  • Build ID 2: 2051996625659760640
  • PR: CNTRLPLANE-2207: Upgrade to CAPI 1.11 #7590CNTRLPLANE-2207: Upgrade to CAPI 1.11
  • Failed Steps: create-hostedcluster (e2e-v2-gke, pre phase), hypershift-gcp-run-e2e / TestCreateCluster (e2e-gke, test phase)

Test Failure Analysis

Error

e2e-v2-gke: Hosted cluster version history never reached "Completed" state (25m timeout, exit code 124).
  Degraded=True: UnavailableReplicas(cluster-api deployment has 1 unavailable replicas)

e2e-gke: Failed to wait for 2 nodes to become ready in 45m0s: context deadline exceeded.
  expected 2 nodes, got 0.
  Degraded=True: UnavailableReplicas(cluster-api deployment has 1 unavailable replicas)
  ClusterVersionSucceeding=False: ClusterOperatorsNotAvailable(Cluster operators dns, insights,
    kube-storage-version-migrator, monitoring, network, node-tuning, openshift-samples, service-ca
    are not available)

CAPI conversion webhook failure (from e2e-v2-gke dump):
  storage is (re)initializing: failed to list cluster.x-k8s.io/v1beta2, Kind=Cluster:
    conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed:
    Post "https://operator.hypershift.svc:443/convert?timeout=30s":
    dial tcp 10.4.1.25:9443: connect: connection refused

Summary

Both GKE e2e jobs fail with identical root cause: the CAPI v1.11 upgrade (PR #7590) introduces cluster.x-k8s.io/v1beta2 API resources, but the CAPI CRD conversion webhook served by the HyperShift operator at operator.hypershift.svc:9443 is refusing connections. This causes the cluster-api deployment in the hosted control plane to remain unavailable (1 unavailable replica), which blocks all Machine provisioning — zero worker nodes ever join the cluster. Without worker nodes, most cluster operators (dns, network, monitoring, etc.) cannot run, and the cluster version never completes installation.

Root Cause

The PR upgrades Cluster API from v1beta1 to v1.11 (v1beta2). This introduces a CRD version conversion path: when any component requests cluster.x-k8s.io/v1beta1 resources (the old API version), Kubernetes must call a conversion webhook to translate to/from v1beta2 (the new stored version). That conversion webhook is served by the HyperShift operator pod at operator.hypershift.svc:443 (port 9443).

The conversion webhook endpoint is unreachable (dial tcp 10.4.1.25:9443: connect: connection refused). This means either:

  1. The HyperShift operator pod is crashing or not starting after the CAPI 1.11 upgrade — possibly due to an incompatible controller-runtime or CAPI manager initialization error introduced by the bump.
  2. The webhook server within the operator isn't registering the v1beta1↔v1beta2 conversion handler — the operator may not yet implement the conversion webhook that the new CRD definitions require.
  3. The CRD webhook configuration points to a service/port that the operator doesn't serve — the CAPI 1.11 CRDs may ship with a spec.conversion.webhook configuration that expects a specific service endpoint not yet configured in HyperShift's operator deployment.

Cascade of failures:

  • Conversion webhook down → cannot list/watch/create any cluster.x-k8s.io resources (Cluster, Machine, MachineDeployment, MachineSet)
  • No CAPI resources → no Machines created → 0 worker nodes
  • 0 worker nodes → cluster operators that need DaemonSets on workers (dns, network, monitoring, etc.) never become available
  • Cluster operators unavailable → ClusterVersion never reaches "Completed"
  • e2e-v2-gke: 25-minute timeout on version completion → exit code 124
  • e2e-gke: 45-minute timeout waiting for 2 nodes → TestCreateCluster/ValidateHostedCluster fails

Both jobs exhibit the exact same failure mode, confirming this is a deterministic regression introduced by the CAPI 1.11 bump, not a flake.

Recommendations
  1. Verify the HyperShift operator pod health after the CAPI bump: Check if the operator pod is crash-looping after the upgrade. Look at the operator's container logs for CAPI manager initialization errors, missing conversion webhook registration, or controller-runtime version incompatibilities.

  2. Ensure the conversion webhook is implemented: CAPI 1.11's CRDs with v1beta2 as the stored version require a conversion webhook from v1beta1. Verify the HyperShift operator registers a conversion webhook handler for all CAPI CRDs (Cluster, Machine, MachineDeployment, MachineSet, MachinePool). If HyperShift vendors CAPI types, the conversion functions must be implemented for the hub-spoke pattern.

  3. Check CRD webhook configuration alignment: Ensure the spec.conversion.webhook configuration in the CAPI CRDs points to the correct service (operator.hypershift.svc), correct port (443 → targetPort 9443), and correct path. Mismatched webhook configurations will cause the exact "connection refused" error observed.

  4. Validate locally before re-pushing: Deploy the HyperShift operator from the PR branch to a local/dev cluster and verify that kubectl get clusters.cluster.x-k8s.io succeeds without conversion webhook errors. Also verify the operator pod remains healthy and doesn't crash-loop.

  5. Consider a multi-version CRD strategy: If the conversion webhook isn't ready, consider whether the CRDs can temporarily keep v1beta1 as the stored version while the conversion webhook is developed, to unblock CI.

Evidence
Evidence Detail
e2e-v2-gke failed step create-hostedcluster (pre phase) — timed out after 30m55s waiting for cluster version to reach "Completed"
e2e-gke failed test TestCreateCluster/ValidateHostedCluster — 0 of 2 expected nodes joined in 45 minutes
Degraded condition Degraded=True: UnavailableReplicas(cluster-api deployment has 1 unavailable replicas) — identical in both jobs
Conversion webhook error storage is (re)initializing: failed to list cluster.x-k8s.io/v1beta2, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://operator.hypershift.svc:443/convert?timeout=30s": dial tcp 10.4.1.25:9443: connect: connection refused
Also affected Kind=MachineDeployment — same conversion webhook failure for MachineDeployment v1beta1→v1beta2
ClusterVersion state controlPlaneVersion state is Partial, expected Completed
Unavailable operators dns, insights, kube-storage-version-migrator, monitoring, network, node-tuning, openshift-samples, service-ca
e2e-v2-gke timeout exit Exit code 124 (bash timeout) — timeout 25m bash -c 'until status==Completed'
e2e-gke test duration TestCreateCluster ran for 7196s (~2h), ValidateHostedCluster for 2894s (~48m)
Cluster Available HostedCluster reached Available condition in both jobs (API server came up), confirming control plane partially initialized
Deterministic Both independent GKE e2e jobs failed identically, confirming regression from PR #7590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/ci-tooling Indicates the PR includes changes for CI or tooling area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/ibmcloud PR/issue for IBMCloud (IBMCloudPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants