Skip to content

feat(etcd): implement multi-shard etcd support#7375

Draft
jhjaggars wants to merge 11 commits intoopenshift:mainfrom
jhjaggars:etc-sharded
Draft

feat(etcd): implement multi-shard etcd support#7375
jhjaggars wants to merge 11 commits intoopenshift:mainfrom
jhjaggars:etc-sharded

Conversation

@jhjaggars
Copy link
Copy Markdown
Contributor

@jhjaggars jhjaggars commented Dec 11, 2025

Summary

This PR implements multi-shard etcd support for HyperShift, enabling hosted clusters to distribute etcd data across multiple independent etcd clusters for improved scalability and performance.

Key Features

  • Configurable etcd sharding: Support for defining multiple etcd shards with independent resource prefixes
  • Named shards: Pre-defined shard configurations (main, events, leases)
  • CLI support: New flags for specifying etcd shard configuration during cluster creation
  • Flexible resource distribution: Ability to map Kubernetes resource types to specific etcd shards
  • Independent scaling: Each shard can be configured with different replica counts and storage settings

Changes

  • Implemented etcd shard API types and effective shard resolution
  • Added CLI flags for etcd sharding configuration (--etcd-shard-*)
  • Integrated etcd sharding with control-plane-operator reconciliation
  • Updated Kube API Server to connect to multiple etcd endpoints
  • Fixed OAuth APIServer NO_PROXY configuration for sharded etcd
  • Added comprehensive CLI documentation
  • Fixed regression: restored ServiceAccountMaxTokenExpiration field

Related Issues

This work is part of the etcd sharding initiative to improve hosted cluster scalability.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Full etcd sharding (managed + unmanaged): per-shard names, prefixes, priority, replicas, storage, backup schedules; CLI flags for file/inline shard config; shard-aware services, StatefulSets, ServiceMonitors, PDBs, init/restore flow, aggregated health reporting, and orphaned-shard cleanup.
  • Documentation
    • New etcd sharding guide with examples, CLI usage, validation and operational guidance.
  • Tests
    • Unit tests for shard parsing, reconciliation, status aggregation, cleanup, and API integrations.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2025
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 11, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 11, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds etcd sharding: new API types/CRD schema, helper methods to derive effective shards, CLI flags/parsing, shard-aware manifests and reconcilers (StatefulSets, Services, ServiceMonitors, PDBs), orphaned-shard cleanup, PKI updates, kube-apiserver/other component adaptations, init/restore script, tests, and user documentation/examples.

Cohort / File(s) Summary
API types & helpers / CRDs / docs
api/hypershift/v1beta1/hostedcluster_types.go, api/hypershift/v1beta1/hostedcluster_helpers.go, api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/..., docs/content/reference/*, docs/content/how-to/etcd-sharding*
Introduce Managed/Unmanaged Etcd shard specs, EtcdShardPriority enum, EffectiveShards helpers, extensive CRD schema additions/validations for shards, and API/docs updates/examples.
CLI create
cmd/cluster/core/create.go
Add --etcd-sharding-config and repeatable --etcd-shard flags, parsing, defaults application, validation, and wiring shards into rendered/prototype manifests.
Control-plane operator — etcd reconcilers & helpers
control-plane-operator/controllers/hostedcontrolplane/etcd/*.go, etcd-init.sh, statefulset.go, services.go, pdb.go, monitoring.go, params.go, cleanup.go, shards.go, status.go, shards_test.go, statefulset_test.go
Add shard reconciliation surface: per-shard manifests, ReconcileStatefulSet and per-resource reconcile functions, shard params builder, orphan cleanup, ServiceMonitor/PDB handling, aggregate status calculation, embedded init/restore script, and unit tests.
Manifests & manifest factories
control-plane-operator/controllers/hostedcontrolplane/manifests/etcd.go
Add shard-aware resource naming and new manifest factory functions that produce shard-scoped StatefulSets, Services, ServiceMonitors, and PDBs.
PKI changes
control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go
Extend server/peer secret reconciliation to accept shards and include shard-specific SANs/wildcards (preserving default shard compatibility).
v2 component/adapters and control-plane integrations
control-plane-operator/controllers/hostedcontrolplane/v2/..., hostedcontrolplane_controller.go
Adapt v2 component adapters and controller flows to be shard-aware: service/statefulset/servicemonitor adapters, kas/oapi/oauth_apiserver adaptations for shard overrides and NO_PROXY, and switch managed etcd reconciliation to the new non-v2 shard flow.
KAS / OAPI / OAuth APIServer params & tests
.../v2/kas/*, .../v2/oapi/*, .../v2/oauth_apiserver/*, tests
Build etcd shard override maps for API server args, update init/wait logic to nslookup all shard client services, compute NO_PROXY entries, and add unit tests for configs.
Docs examples / CLI usage
docs/content/how-to/etcd-sharding/*, docs/content/reference/*, docs/content/how-to/etcd-sharding/example-3-shard.yaml, example-inline-usage.sh
Add comprehensive how-to, examples, and reference docs covering sharding concepts, configuration formats, CLI usage, and operational guidance.
sequenceDiagram
    participant HCP as HostedControlPlane Controller
    participant REC as Etcd Shard Reconciler
    participant MAN as Manifest Provider
    participant K8S as Kubernetes API

    HCP->>REC: reconcileEtcdShards(hcp, params)
    REC->>REC: compute EffectiveShards(hcp)
    loop Per shard
        REC->>MAN: EtcdStatefulSetForShard(ns, shard)
        MAN-->>REC: StatefulSet manifest
        REC->>K8S: CreateOrUpdate(StatefulSet)
        REC->>MAN: EtcdClientServiceForShard(ns, shard)
        MAN-->>REC: Client Service manifest
        REC->>K8S: CreateOrUpdate(Service)
        REC->>MAN: EtcdDiscoveryServiceForShard(ns, shard)
        MAN-->>REC: Discovery Service manifest
        REC->>K8S: CreateOrUpdate(Service)
        REC->>MAN: EtcdServiceMonitorForShard(ns, shard)
        MAN-->>REC: ServiceMonitor manifest
        REC->>K8S: CreateOrUpdate(ServiceMonitor)
        alt HighlyAvailable
            REC->>MAN: EtcdPodDisruptionBudgetForShard(ns, shard)
            MAN-->>REC: PDB manifest
            REC->>K8S: CreateOrUpdate(PodDisruptionBudget)
        end
    end
    REC->>K8S: CleanupOrphanedShards(activeShards)
    K8S-->>REC: orphaned resources removed
    REC-->>HCP: Update aggregated etcd status / ControlPlaneComponent
Loading

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation and removed do-not-merge/needs-area labels Dec 11, 2025
@jtaleric
Copy link
Copy Markdown

/test ?

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 12, 2025

@jtaleric: The following commands are available to trigger required jobs:

/test e2e-aks
/test e2e-aks-4-20
/test e2e-aks-override
/test e2e-aws
/test e2e-aws-4-20
/test e2e-aws-override
/test e2e-aws-upgrade-hypershift-operator
/test e2e-kubevirt-aws-ovn-reduced
/test images
/test okd-scos-images
/test security
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-aws-autonode
/test e2e-aws-metrics
/test e2e-aws-minimal
/test e2e-aws-techpreview
/test e2e-azure-aks-ovn-conformance
/test e2e-conformance
/test e2e-kubevirt-aws-ovn
/test e2e-kubevirt-azure-ovn
/test e2e-kubevirt-metal-conformance
/test e2e-openstack-aws
/test e2e-openstack-aws-conformance
/test e2e-openstack-aws-csi-cinder
/test e2e-openstack-aws-csi-manila
/test e2e-openstack-aws-nfv
/test reqserving-e2e-aws

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-hypershift-main-e2e-aks
pull-ci-openshift-hypershift-main-e2e-aks-4-20
pull-ci-openshift-hypershift-main-e2e-aws
pull-ci-openshift-hypershift-main-e2e-aws-upgrade-hypershift-operator
pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced
pull-ci-openshift-hypershift-main-images
pull-ci-openshift-hypershift-main-okd-scos-images
pull-ci-openshift-hypershift-main-security
pull-ci-openshift-hypershift-main-unit
pull-ci-openshift-hypershift-main-verify
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Comment on lines +3 to +5
import (
"k8s.io/utils/ptr"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad practice to include dependencies like this in your API types. Impacts users building projects against your APIs. Prefer to have utils like this somewhere separate and not as methods on the types themselves please

Comment on lines +1577 to +1578
// When shards are specified, this serves as the default for all shards
// unless overridden per-shard.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also make it so that this field cannot be set, forcing a choice in each shard to be explicit

// When specified, exactly one shard must have "/" in its resourcePrefixes.
// +optional
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular decision process to get to 10 as a maximum?

Should document in the godoc that this list accepts at most 10 items

// +kubebuilder:validation:MaxItems=10
// +listType=map
// +listMapKey=name
// +kubebuilder:validation:XValidation:rule="self.exists(s, '/' in s.resourcePrefixes)",message="exactly one shard must have '/' prefix"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule itself checks that at least 1 item has / in the resourcePrefixes, but the message says exactly one, perhaps you want

Suggested change
// +kubebuilder:validation:XValidation:rule="self.exists(s, '/' in s.resourcePrefixes)",message="exactly one shard must have '/' prefix"
// +kubebuilder:validation:XValidation:rule="self.exists_one(s, '/' in s.resourcePrefixes)",message="exactly one shard must have '/' prefix"

// +listType=map
// +listMapKey=name
// +kubebuilder:validation:XValidation:rule="self.exists(s, '/' in s.resourcePrefixes)",message="exactly one shard must have '/' prefix"
// +kubebuilder:validation:XValidation:rule="self.all(s, s.resourcePrefixes.all(p, p == '/' || p.endsWith('#')))",message="non-default prefixes must end with '#'"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this rule here vs on the resourcePrefixes field itself?

// All CRITICAL shards must have quorum
if criticalTotal > 0 && criticalReady < criticalTotal/2+1 {
return &metav1.Condition{
Type: string(hyperv1.EtcdAvailable),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets thrown away in the parent scope where this function is called, why are we creating other conditions based on this one vs this being the condition we set?

// resourceNameForShard generates resource names for etcd shards
// For backward compatibility, the default shard uses the base name without suffix
// Named shards use the pattern: baseName-shardName
func resourceNameForShard(baseName, shardName string) string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated in at least one other location in this PR, is there a way to avoid the duplication/possibility of drift

if len(shards) == 0 {
return fmt.Errorf("no etcd shards configured")
}
defaultShard := shards[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good assumption to make?

args.Set("etcd-servers", p.EtcdURL)

// Build etcd-servers-overrides if shards are configured
// Format per Kubernetes docs: /group/resource#url1;url2,/group2/resource2#url3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we require the # in the API vs just constructing it here? Seems to be passing complexity to end users vs something that would be relatively straightforward for us to handle?

- "/events#"
priority: Low
replicas: 1
# No backupSchedule = no backups for this shard
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs on the field says it falls back to priority based backup?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2025
@openshift-merge-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@JoelSpeed
Copy link
Copy Markdown
Contributor

JoelSpeed commented Jan 13, 2026

The changes to ‎control-plane-operator/controllers/hostedcontrolplane/v2/etcd/component.go in this PR don't actually achieve anything since the functions there are not currently called.

Importantly, this means that the defrag bits are not deployed if you are trying to use an HA control plane for testing. Will need to fix this, for now will work around this by deploying the components manually

@openshift-bot
Copy link
Copy Markdown

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2026
jhjaggars and others added 5 commits April 17, 2026 09:04
Moved etcd management from v2 component framework to direct reconciliation
to enable dynamic multi-shard deployments. The v2 framework's single-workload
limitation prevented deploying multiple etcd shards.

Implementation:
- New etcd package with multi-shard reconciliation (shards.go, statefulset.go,
  services.go, monitoring.go, status.go, cleanup.go, pdb.go)
- Shard-aware manifest constructors loading from v2 assets
- ControlPlaneComponent resource for dependency tracking
- Priority-based status aggregation (Critical/Medium/Low)
- Automatic orphan cleanup when shards removed
- Asset loading for complete base structures

Integration updates:
- kube-apiserver: --etcd-servers-overrides with correct format
- openshift-apiserver: etcd URL configuration for default shard
- KAS wait-for-etcd init container: waits for all shard services
- PKI certificates: SANs for all shard-specific services
- NO_PROXY: includes all shard service FQDNs

Signed-off-by: Jesse Jaggars <jjaggars@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Add comprehensive documentation for the newly implemented CLI flags
for etcd sharding configuration (--etcd-sharding-config and --etcd-shard).

Changes:
- Add new "CLI Usage" section with file-based and inline examples
- Update "Limitations" section to remove outdated "No CLI support"
- Update "Future Enhancements" to mark CLI integration as complete
- Add example-inline-usage.sh demonstrating inline flag usage

The documentation now covers both configuration methods with examples
for AWS, Azure, and other platforms, including global storage defaults,
manifest rendering, and validation rules.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implement CLI flags for configuring etcd sharding during cluster creation.
Users can now specify etcd shard configuration via command-line flags or
configuration files.

New flags:
- --etcd-sharding-config: Path to YAML/JSON configuration file
- --etcd-shard: Inline shard definition (repeatable)

Features:
- File-based configuration for complex multi-shard setups
- Inline flags for simple 2-3 shard configurations
- Validation of shard configuration (catch-all requirement, DNS-1035 names, etc.)
- Global storage defaults via --etcd-storage-class and --etcd-storage-size
- Support for all shard properties: name, prefixes, priority, replicas, storage, backup schedule
- Mutual exclusivity between file and inline configuration methods
- Full render support for previewing configurations

Implementation includes:
- RawCreateOptions struct fields for both configuration methods
- Flag bindings in bindCoreOptions()
- Validation in Validate() method
- Shard building in prototypeResources()
- Helper functions: buildEtcdShards, parseShardingConfigFile, parseInlineShards,
  parseInlineShard, applyGlobalDefaults, validateEtcdSharding

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The OAuth APIServer was failing to connect to sharded etcd services because
NO_PROXY only contained the short hostname (e.g., etcd-client-main) but the
actual connection URL used the FQDN (e.g., etcd-client-main.namespace.svc).
This caused connections to incorrectly route through the konnectivity proxy,
resulting in connection failures.

Changes:
- Extract etcd FQDN from the etcd URL for sharded configurations
- Add both short hostname and FQDN to NO_PROXY list
- Support for managed etcd sharding with default shard detection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
This field was accidentally removed during etcd sharding implementation.
Restores the field that enables --service-account-max-token-expiration
flag support for graceful service account key rotation.

Fixes regression from commits 4ce7874 and cbb3666.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2026

@jhjaggars: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 91da625 link true /test security
ci/prow/unit 91da625 link true /test unit
ci/prow/e2e-azure-self-managed 91da625 link true /test e2e-azure-self-managed
ci/prow/verify-workflows 91da625 link true /test verify-workflows

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jhjaggars
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

♻️ Duplicate comments (3)
control-plane-operator/controllers/hostedcontrolplane/etcd/pdb.go (1)

18-21: ⚠️ Potential issue | 🟠 Major

Comment and code disagree on quorum requirements.

The comment states "For a 3-node etcd cluster, we need at least 2 available (quorum = n/2 + 1)" but MinAvailable is set to 1. With MinAvailable: 1, up to 2 nodes could be unavailable simultaneously, breaking quorum.

Either update MinAvailable to 2, or use MaxUnavailable: 1 to ensure only one node can be disrupted at a time, or correct the comment if the current value is intentional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/pdb.go` around
lines 18 - 21, The comment and code disagree: the comment says a 3-node etcd
cluster requires quorum of 2 but the pdb.Spec.MinAvailable is set to 1
(minAvailable variable). Update the PDB so the code matches the comment by
setting minAvailable to 2 (i.e., set pdb.Spec.MinAvailable via minAvailable :=
intstr.FromInt(2)), or alternatively replace the MinAvailable usage with
MaxUnavailable := intstr.FromInt(1) and set pdb.Spec.MaxUnavailable to enforce
only one disruption; ensure the chosen change is reflected in the comment and
remove the inconsistency between pdb.Spec.MinAvailable/minAvailable and the
comment.
cmd/cluster/core/create.go (1)

1268-1280: ⚠️ Potential issue | 🟠 Major

--etcd-sharding-config rejects the checked-in example shape.

This parser only accepts a top-level shards: document. The example in docs/content/how-to/etcd-sharding/example-3-shard.yaml is a full HostedCluster with shards under spec.etcd.managed.shards, so passing that file to --etcd-sharding-config will always fail with “must contain at least one shard”.

Suggested fix
-	var config struct {
-		Shards []hyperv1.ManagedEtcdShardSpec `json:"shards"`
-	}
+	var config struct {
+		Shards []hyperv1.ManagedEtcdShardSpec `json:"shards"`
+		Spec   struct {
+			Etcd struct {
+				Managed struct {
+					Shards []hyperv1.ManagedEtcdShardSpec `json:"shards"`
+				} `json:"managed"`
+			} `json:"etcd"`
+		} `json:"spec"`
+	}
@@
-	if len(config.Shards) == 0 {
+	shards := config.Shards
+	if len(shards) == 0 {
+		shards = config.Spec.Etcd.Managed.Shards
+	}
+	if len(shards) == 0 {
 		return nil, fmt.Errorf("etcd sharding config file must contain at least one shard")
 	}
 
-	return config.Shards, nil
+	return shards, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/core/create.go` around lines 1268 - 1280, The YAML parsing
currently only accepts a top-level "shards" document and rejects full
HostedCluster manifests; update the parsing to accept both shapes by first
unmarshalling into the existing config (with Shards
[]hyperv1.ManagedEtcdShardSpec) and if that yields zero shards, unmarshal (or
decode) into a HostedCluster-shaped struct that contains
Spec.Etcd.Managed.Shards and extract those into config.Shards; ensure you
continue to return config.Shards and keep the same error messages when no shards
are found. Use the existing variables/data and the config.Shards identifier and
the HostedCluster nested path (Spec -> Etcd -> Managed -> Shards) to locate and
extract the shards.
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go (1)

23-29: ⚠️ Potential issue | 🟠 Major

Don’t assume shards[0] is the catch-all shard.

EffectiveShards() preserves the user-defined order; it does not move the "/" shard to index 0. If events or leases is listed first, this StatefulSet will reconcile the wrong shard.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`
around lines 23 - 29, The code incorrectly assumes the first element of the
slice returned by managedEtcdSpec.EffectiveShards(hcp) is the catch-all shard;
update the selection logic so you iterate the shards slice and pick the shard
whose Name (or equivalent identifier) equals "/" as the defaultShard, and only
if no "/" shard exists fall back to shards[0]; adjust any error handling around
the shards slice accordingly (look for managedEtcdSpec.EffectiveShards, the
shards variable and the defaultShard usage).
🧹 Nitpick comments (4)
docs/content/how-to/etcd-sharding.md (2)

194-194: Fix spaces inside emphasis markers.

The line has spaces inside emphasis markers (e.g., * * * * * being interpreted as emphasis). Escape the asterisks or use backticks to prevent markdown parsing issues.

Suggested fix
-- `backup-schedule`: Cron format (e.g., "*/30 * * * *")
+- `backup-schedule`: Cron format (e.g., `*/30 * * * *`)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/how-to/etcd-sharding.md` at line 194, The markdown line
documenting `backup-schedule` contains a cron example with unescaped asterisks
("*/30 * * * *") that can be parsed as emphasis; update the text for the
`backup-schedule` entry so the cron expression is protected by inline code or
escaped asterisks (e.g., wrap "*/30 * * * *" in backticks or escape each *),
ensuring the `backup-schedule` label and its cron example render correctly.

181-183: Add language identifier to fenced code block.

The code block showing inline flag format is missing a language identifier, which triggers markdownlint MD040. Since this is showing command-line syntax, consider using text or bash.

Suggested fix
 **Inline Flag Format:**
-```
+```text
 --etcd-shard name=<name>,prefixes=<prefix1>|<prefix2>,priority=<priority>,replicas=<1|3>,storage-size=<size>,storage-class=<class>,backup-schedule=<cron>
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/content/how-to/etcd-sharding.md around lines 181 - 183, The fenced code
block that shows the inline flag format is missing a language identifier
(triggers MD040); update the triple-backtick fence that contains the line
starting with "--etcd-shard name=..." to include a language identifier such as
text or bash (e.g., ```text) so the code block is properly annotated.


</details>

</blockquote></details>
<details>
<summary>control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go (2)</summary><blockquote>

`91-101`: **Fragile string replacement for script modification.**

The approach of using `strings.ReplaceAll` to modify shell script content (lines 94-98) is fragile. If the original script in the YAML asset changes format (e.g., extra whitespace, different quoting), these replacements will silently fail to match.

Consider either:
1. Regenerating the script entirely with the correct service names
2. Using environment variables in the script that can be set via the container spec

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go`
around lines 91 - 101, The current use of strings.ReplaceAll on c.Args[1]
(variable script) to substitute service names is fragile and can silently fail;
instead either (A) regenerate the entire script string explicitly using
fmt.Sprintf and assign it to c.Args (replace the strings.ReplaceAll block with a
constructed script string that interpolates clientService and discoveryService
into the exact script you need), or (B) avoid editing the script text and set
container environment variables (e.g., ETCDCTL_ENDPOINTS and a DISCOVERY_SERVICE
or similar) in the pod spec and update the asset script to reference those env
vars; update the code paths that touch c.Args, script, clientService and
discoveryService accordingly.
```

</details>

---

`192-198`: **Defrag controller container added but not removed when no longer needed.**

The defrag controller is added when `params.NeedsDefragController` is true, but there's no corresponding logic to remove it if the availability policy changes from HighlyAvailable to SingleReplica. This could leave orphaned containers.

However, since StatefulSet selector is immutable and changing availability policy likely requires cluster recreation, this may be acceptable.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go`
around lines 192 - 198, The code adds the etcd-defrag controller container when
params.NeedsDefragController is true but never removes it when that flag becomes
false; update the logic around params.NeedsDefragController so it also removes
the container and resets ServiceAccountName when no longer needed: detect
presence with hasContainer(sts.Spec.Template.Spec.Containers, "etcd-defrag"),
remove the container entry from sts.Spec.Template.Spec.Containers when
params.NeedsDefragController is false, and clear or restore
sts.Spec.Template.Spec.ServiceAccountName
(manifests.EtcdDefragControllerServiceAccount("")) as appropriate; use
buildEtcdDefragControllerContainer only when adding and ensure removal is safe
for immutable selector scenarios.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @api/hypershift/v1beta1/hostedcluster_types.go:

  • Around line 2058-2085: Add XValidation rules on the UnmanagedEtcdSpec type to
    enforce that either Shards is provided or both Endpoint and TLS are provided,
    and to prevent supplying Endpoint/TLS when Shards is set; specifically add a
    rule like "has(self.shards) || (self.endpoint != '' && has(self.tls))" with an
    appropriate message, and a second rule that forbids endpoint/tls when shards
    exists (e.g. "not(has(self.shards)) || (self.endpoint == '' &&
    not(has(self.tls)))") so UnmanagedEtcdSpec (type name) cannot be left without
    configuration and cannot mix legacy fields (Endpoint, TLS) with Shards.

In @cmd/cluster/core/create.go:

  • Around line 1302-1310: parseInlineShard currently naively splits def on every
    comma which breaks cron expressions; change the parsing in parseInlineShard so
    commas that occur inside a value (e.g. a cron schedule) are not treated as
    separators: when iterating pairs from strings.Split(def, ",") if
    strings.SplitN(pair, "=", 2) yields len != 2 then treat that token as part of
    the previous value (append ","+pair to the last accumulated token) instead of
    returning an error; update the loop that builds pairs and the kv handling to
    only error if you encounter a token without "=" when there is no previous token
    to attach to, keeping references to def, pairs, and the kv split logic for
    location.

In @control-plane-operator/controllers/hostedcontrolplane/etcd/etcd-init.sh:

  • Line 12: The curl invocation using the RESTORE_URL variable is unquoted which
    can cause word-splitting or unexpected behavior; update the command that writes
    the snapshot (the curl call that currently references ${RESTORE_URL}) to quote
    the variable (use "${RESTORE_URL}") so the entire URL is passed as a single
    argument to curl.
  • Around line 1-3: Add a shebang as the first line of etcd-init.sh (before the
    existing mkdir -p /var/lib/data and the ls check) so the script runs with a
    defined shell interpreter (e.g., use #!/bin/bash or #!/bin/sh); ensure the
    shebang is the very first line to guarantee correct interpreter selection when
    executing the etcd-init.sh script that contains the mkdir -p /var/lib/data and
    the [ "$(ls -A /var/lib/data)" ] check.

In @control-plane-operator/controllers/hostedcontrolplane/etcd/params.go:

  • Around line 66-70: The code accesses hcp.Spec.Networking.ClusterNetwork[0]
    without checking the slice length; add a defensive bounds check before calling
    util.IsIPv4CIDR in the function that returns (*ShardParams, error) so that if
    len(hcp.Spec.Networking.ClusterNetwork) == 0 you return a clear error (e.g.,
    fmt.Errorf("empty ClusterNetwork in HCP spec")) instead of panicking; then
    proceed to call
    util.IsIPv4CIDR(hcp.Spec.Networking.ClusterNetwork[0].CIDR.String()) as before.

In @control-plane-operator/controllers/hostedcontrolplane/etcd/services.go:

  • Around line 17-22: The Service's shard label is only being set on
    svc.Spec.Selector but must also be added to Service.metadata.labels so
    ServiceMonitor discovery can match it; update the code in the service
    reconciliation (the block that sets
    svc.Spec.Selector["hypershift.openshift.io/etcd-shard"] = shard.Name and the
    similar block at the later occurrence) to ensure svc.Labels is initialized (if
    nil) and then set svc.Labels["hypershift.openshift.io/etcd-shard"] = shard.Name
    (and keep the existing selector assignment) so ReconcileServiceMonitor can
    discover shard-specific etcd Services.

In @control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go:

  • Around line 26-31: The test's scheme variable used in the ReconcileEtcdShards
    test is missing the prometheusoperatorv1 registration, so the fake client cannot
    recognize ServiceMonitor GVKs created by manifests.EtcdServiceMonitorForShard;
    add prometheusoperatorv1.AddToScheme(scheme) alongside the existing AddToScheme
    calls (where scheme is created) so ServiceMonitor objects are properly
    registered before invoking ReconcileEtcdShards.

In @control-plane-operator/controllers/hostedcontrolplane/etcd/shards.go:

  • Around line 26-30: ReconcileEtcdShards currently calls
    hcp.Spec.Etcd.Managed.EffectiveShards(hcp) without ensuring Managed is non-nil;
    add a defensive nil check for hcp.Spec.Etcd.Managed at the top of
    ReconcileEtcdShards and return a clear error (or nil/short-circuit as
    appropriate) if Managed is nil before calling EffectiveShards, referencing the
    ReconcileEtcdShards function and the hcp.Spec.Etcd.Managed.EffectiveShards(hcp)
    call so the fix is applied where the dereference occurs.

In @control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go:

  • Around line 215-226: The code indexes sts.Spec.VolumeClaimTemplates[0] without
    ensuring the slice is non-empty; modify the block handling storage.Type ==
    hyperv1.PersistentVolumeEtcdStorage (and when storage.PersistentVolume pv !=
    nil) to first check len(sts.Spec.VolumeClaimTemplates) > 0 before accessing [0],
    and if it is zero either append a new corev1.PersistentVolumeClaim template
    (constructed to match the expected fields) or skip the mutation depending on
    intended behavior; ensure you update the same sts.Spec.VolumeClaimTemplates
    slice used elsewhere so subsequent code (and functions referencing sts) see the
    new template if you choose to append.
  • Around line 44-48: The v2 adapter's service name construction for Etcd shards
    is using the incorrect pattern ("etcd--discovery") and must match the
    main implementation's "etcd-discovery" or "etcd-discovery-" pattern;
    update the code in the v2 statefulset creation (the block that sets
    sts.Spec.ServiceName in
    control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go) to
    use the same logic as the main implementation (if shard.Name == "default" then
    "etcd-discovery" else fmt.Sprintf("etcd-discovery-%s", shard.Name)) so it aligns
    with manifests.EtcdDiscoveryServiceForShard/resourceNameForShard.

In @control-plane-operator/controllers/hostedcontrolplane/etcd/status.go:

  • Around line 143-150: The current quorum check uses aggregated counters
    (criticalTotal/criticalReady) which can hide a single shard lacking quorum;
    update the logic in the Etcd availability calculation (the block that returns a
    metav1.Condition with Type=string(hyperv1.EtcdAvailable) and
    Reason=hyperv1.EtcdWaitingForQuorumReason) to iterate over each critical shard
    instead of using criticalTotal/criticalReady, compute quorum per shard as
    (replicas/2 + 1) and ensure each shard's ready count >= its quorum, collect
    per-shard failure messages into messages and only return the Condition when any
    shard fails its own quorum check.
  • Line 122: The code dereferences sts.Spec.Replicas into requiredReplicas
    without checking for nil; update the logic around the assignment of
    requiredReplicas to first check if sts.Spec != nil and sts.Spec.Replicas != nil
    and handle the nil case (e.g., set requiredReplicas to a safe default like 1 or
    return an error/early exit), so replace the direct dereference of
    sts.Spec.Replicas with a guarded read that uses the chosen fallback and avoid a
    possible nil pointer panic when evaluating requiredReplicas.

In @control-plane-operator/controllers/hostedcontrolplane/v2/etcd/service.go:

  • Around line 23-29: Service names for etcd shards are using the wrong pattern
    ("etcd--client"/"etcd--discovery") instead of the canonical
    "etcd-client-"/"etcd-discovery-", causing hostname mismatches;
    update the assignments where svc.Name is set (the block that checks
    defaultShard.Name == "default" and the similar block at the later 53-59 lines)
    so that the default shard uses "etcd-client" / "etcd-discovery" and non-default
    shards use fmt.Sprintf("etcd-client-%s", defaultShard.Name) /
    fmt.Sprintf("etcd-discovery-%s", defaultShard.Name) respectively, leaving the
    default-case naming unchanged.

In
@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/servicemonitor.go:

  • Around line 43-54: The code accesses sm.Spec.Endpoints[0] (setting
    MetricRelabelConfigs and calling util.ApplyClusterIDLabel) before confirming
    there is at least one endpoint, which can panic; change the logic to first check
    len(sm.Spec.Endpoints) > 0 and only then set
    sm.Spec.Endpoints[0].MetricRelabelConfigs, call
    util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID),
    and append the shard RelabelConfig (using defaultShard) to
    sm.Spec.Endpoints[0].RelabelConfigs; ensure all references to
    cpContext.MetricsSet and cpContext.HCP.Spec.ClusterID are used inside that
    guarded block so no access occurs when there are no endpoints.

In
@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go:

  • Around line 45-50: The Service name for named shards is using the wrong
    convention; when defaultShard.Name != "default" update sts.Spec.ServiceName (and
    any code building ETCD_INITIAL_CLUSTER entries) to use the canonical pattern
    "etcd-discovery-" instead of "etcd--discovery"; leave sts.Name as
    "etcd-" if that is desired but ensure every reference that computes the
    discovery Service (including where ETCD_INITIAL_CLUSTER or related env vars are
    constructed) uses fmt.Sprintf("etcd-discovery-%s", defaultShard.Name) for named
    shards.

In @control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go:

  • Around line 98-124: The wait-for-etcd init container builds DNS names using
    the wrong shard service naming convention; in UpdateContainer("wait-for-etcd",
    ...) where serviceName is constructed for non-default shards, change the
    fmt.Sprintf pattern from "etcd-client-%s" to "etcd-%s-client" so it matches the
    service names produced by hcp.Spec.Etcd.Managed.EffectiveShards and the service
    creation in v2/etcd/service.go; then run a quick grep (including kas/params.go
    and other kas-related components) to ensure all consumers use the same
    "etcd-{shard}-client" format and update any mismatches.

In @control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go:

  • Around line 255-271: buildUnmanagedEtcdConfig can return an empty defaultURL
    when no shard has the "/" prefix; update the function (buildUnmanagedEtcdConfig)
    to guarantee a non-empty default by either (A) adding a fallback: if defaultURL
    == "" and len(shards) > 0 then set defaultURL = shards[0].Endpoint (or pick the
    first shard you deem appropriate) and optionally add a warning log, or (B) more
    robustly change the signature to return an error (string, map[string]string,
    error), validate that a catch‑all "/" shard exists and return an error to
    callers if it does not—apply the chosen fix and update all callers to handle the
    fallback or the new error return.

In
@control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment.go:

  • Around line 39-48: The current loop that constructs the etcd client service
    name using shard.Name (in the code inspecting
    cpContext.HCP.Spec.Etcd.Managed.Shards and comparing ResourcePrefixes to "/")
    will produce "etcd-client-default" when the shard.Name is "default", but the
    canonical service is "etcd-client"; update the logic inside that loop (where
    serviceName is built and etcdURL/etcdHostname are set) to special-case
    shard.Name == "default" and use the unsuffixed "etcd-client" serviceName in that
    case, otherwise keep the existing fmt.Sprintf("etcd-client-%s", shard.Name)
    behavior so --etcd-servers and NO_PROXY point to the real Service.

Duplicate comments:
In @cmd/cluster/core/create.go:

  • Around line 1268-1280: The YAML parsing currently only accepts a top-level
    "shards" document and rejects full HostedCluster manifests; update the parsing
    to accept both shapes by first unmarshalling into the existing config (with
    Shards []hyperv1.ManagedEtcdShardSpec) and if that yields zero shards, unmarshal
    (or decode) into a HostedCluster-shaped struct that contains
    Spec.Etcd.Managed.Shards and extract those into config.Shards; ensure you
    continue to return config.Shards and keep the same error messages when no shards
    are found. Use the existing variables/data and the config.Shards identifier and
    the HostedCluster nested path (Spec -> Etcd -> Managed -> Shards) to locate and
    extract the shards.

In @control-plane-operator/controllers/hostedcontrolplane/etcd/pdb.go:

  • Around line 18-21: The comment and code disagree: the comment says a 3-node
    etcd cluster requires quorum of 2 but the pdb.Spec.MinAvailable is set to 1
    (minAvailable variable). Update the PDB so the code matches the comment by
    setting minAvailable to 2 (i.e., set pdb.Spec.MinAvailable via minAvailable :=
    intstr.FromInt(2)), or alternatively replace the MinAvailable usage with
    MaxUnavailable := intstr.FromInt(1) and set pdb.Spec.MaxUnavailable to enforce
    only one disruption; ensure the chosen change is reflected in the comment and
    remove the inconsistency between pdb.Spec.MinAvailable/minAvailable and the
    comment.

In
@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go:

  • Around line 23-29: The code incorrectly assumes the first element of the slice
    returned by managedEtcdSpec.EffectiveShards(hcp) is the catch-all shard; update
    the selection logic so you iterate the shards slice and pick the shard whose
    Name (or equivalent identifier) equals "/" as the defaultShard, and only if no
    "/" shard exists fall back to shards[0]; adjust any error handling around the
    shards slice accordingly (look for managedEtcdSpec.EffectiveShards, the shards
    variable and the defaultShard usage).

Nitpick comments:
In @control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go:

  • Around line 91-101: The current use of strings.ReplaceAll on c.Args[1]
    (variable script) to substitute service names is fragile and can silently fail;
    instead either (A) regenerate the entire script string explicitly using
    fmt.Sprintf and assign it to c.Args (replace the strings.ReplaceAll block with a
    constructed script string that interpolates clientService and discoveryService
    into the exact script you need), or (B) avoid editing the script text and set
    container environment variables (e.g., ETCDCTL_ENDPOINTS and a DISCOVERY_SERVICE
    or similar) in the pod spec and update the asset script to reference those env
    vars; update the code paths that touch c.Args, script, clientService and
    discoveryService accordingly.
  • Around line 192-198: The code adds the etcd-defrag controller container when
    params.NeedsDefragController is true but never removes it when that flag becomes
    false; update the logic around params.NeedsDefragController so it also removes
    the container and resets ServiceAccountName when no longer needed: detect
    presence with hasContainer(sts.Spec.Template.Spec.Containers, "etcd-defrag"),
    remove the container entry from sts.Spec.Template.Spec.Containers when
    params.NeedsDefragController is false, and clear or restore
    sts.Spec.Template.Spec.ServiceAccountName
    (manifests.EtcdDefragControllerServiceAccount("")) as appropriate; use
    buildEtcdDefragControllerContainer only when adding and ensure removal is safe
    for immutable selector scenarios.

In @docs/content/how-to/etcd-sharding.md:

  • Line 194: The markdown line documenting backup-schedule contains a cron
    example with unescaped asterisks ("*/30 * * * ") that can be parsed as
    emphasis; update the text for the backup-schedule entry so the cron expression
    is protected by inline code or escaped asterisks (e.g., wrap "
    /30 * * * *" in
    backticks or escape each *), ensuring the backup-schedule label and its cron
    example render correctly.
  • Around line 181-183: The fenced code block that shows the inline flag format
    is missing a language identifier (triggers MD040); update the triple-backtick
    fence that contains the line starting with "--etcd-shard name=..." to include a
    language identifier such as text or bash (e.g., ```text) so the code block is
    properly annotated.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Repository YAML (base), Central YAML (inherited)

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `e1f21541-991d-4a6f-b3d9-789587f4c738`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 1180bcaf557d7df5da61fc97197c69c68c32c4e2 and a83140a3e23eed929f4b312d37b0500c5d86a477.

</details>

<details>
<summary>⛔ Files ignored due to path filters (34)</summary>

* `api/hypershift/v1beta1/zz_generated.deepcopy.go` is excluded by `!**/zz_generated*.go`, `!**/zz_generated*`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AAA_ungated.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDC.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/GCPPlatform.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ImageStreamImportMode.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/KMSEncryptionProvider.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/OpenStack.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AAA_ungated.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ExternalOIDC.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/GCPPlatform.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ImageStreamImportMode.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/KMSEncryptionProvider.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/OpenStack.yaml` is excluded by `!**/zz_generated.featuregated-crd-manifests/**`
* `client/applyconfiguration/hypershift/v1beta1/managedetcdshardspec.go` is excluded by `!client/**`
* `client/applyconfiguration/hypershift/v1beta1/managedetcdspec.go` is excluded by `!client/**`
* `client/applyconfiguration/hypershift/v1beta1/unmanagedetcdshardspec.go` is excluded by `!client/**`
* `client/applyconfiguration/hypershift/v1beta1/unmanagedetcdspec.go` is excluded by `!client/**`
* `client/applyconfiguration/utils.go` is excluded by `!client/**`
* `cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-Default.crd.yaml` is excluded by `!**/zz_generated.crd-manifests/**`, `!cmd/install/assets/**/*.yaml`
* `cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-Default.crd.yaml` is excluded by `!**/zz_generated.crd-manifests/**`, `!cmd/install/assets/**/*.yaml`
* `docs/content/reference/api.md` is excluded by `!docs/content/reference/api.md`
* `vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_helpers.go` is excluded by `!vendor/**`, `!**/vendor/**`
* `vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go` is excluded by `!vendor/**`, `!**/vendor/**`
* `vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go` is excluded by `!vendor/**`, `!**/vendor/**`, `!**/zz_generated*.go`, `!**/zz_generated*`

</details>

<details>
<summary>📒 Files selected for processing (30)</summary>

* `api/hypershift/v1beta1/hostedcluster_helpers.go`
* `api/hypershift/v1beta1/hostedcluster_types.go`
* `cmd/cluster/core/create.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/cleanup.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/etcd-init.sh`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/monitoring.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/params.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/pdb.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/services.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/shards.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset_test.go`
* `control-plane-operator/controllers/hostedcontrolplane/etcd/status.go`
* `control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
* `control-plane-operator/controllers/hostedcontrolplane/manifests/etcd.go`
* `control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/etcd/component.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/etcd/service.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/etcd/servicemonitor.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/kas/config.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/oapi/config.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/oapi/deployment.go`
* `control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment.go`
* `docs/content/how-to/etcd-sharding.md`
* `docs/content/how-to/etcd-sharding/example-3-shard.yaml`
* `docs/content/how-to/etcd-sharding/example-inline-usage.sh`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +2058 to +2085
// integrate with an externally managed etcd cluster.
type UnmanagedEtcdSpec struct {
// endpoint is the full etcd cluster client endpoint URL. For example:
//
// https://etcd-client:2379
//
// If the URL uses an HTTPS scheme, the TLS field is required.
//
// endpoint is the full etcd cluster client endpoint URL.
// Used only when shards is not specified (legacy single-etcd mode).
// When shards are specified, this field is ignored.
// +optional
// +kubebuilder:validation:Pattern=`^https://`
// +kubebuilder:validation:MaxLength=255
Endpoint string `json:"endpoint,omitempty"`

// tls specifies TLS configuration for HTTPS etcd client endpoints.
// Used only when shards is not specified (legacy single-etcd mode).
// When shards are specified, this field is ignored.
// +optional
TLS *EtcdTLSConfig `json:"tls,omitempty"`

// shards configures etcd sharding by Kubernetes resource kind.
// When not specified, uses endpoint and tls fields (legacy single-etcd mode).
// When specified, exactly one shard must have "/" in its resourcePrefixes.
// +optional
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=10
// +listType=map
// +listMapKey=name
// +kubebuilder:validation:XValidation:rule="self.exists(s, '/' in s.resourcePrefixes)",message="exactly one shard must have '/' prefix"
// +kubebuilder:validation:XValidation:rule="self.all(s, s.resourcePrefixes.all(p, p == '/' || p.endsWith('#')))",message="non-default prefixes must end with '#'"
Shards []UnmanagedEtcdShardSpec `json:"shards,omitempty"`
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validation gap allows invalid UnmanagedEtcdSpec state.

The endpoint and tls fields are marked optional, and documentation states they're "ignored when shards is specified." However, there's no validation ensuring that when shards is NOT specified, endpoint and tls ARE provided. This allows an invalid state where neither legacy fields nor shards are configured.

Consider adding an XValidation rule:

// +kubebuilder:validation:XValidation:rule="has(self.shards) || (self.endpoint != '' && has(self.tls))",message="either shards must be specified, or both endpoint and tls are required"

The mutual exclusivity validation (preventing both shards and endpoint/tls from being set simultaneously) was also flagged in a previous review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/hypershift/v1beta1/hostedcluster_types.go` around lines 2058 - 2085, Add
XValidation rules on the UnmanagedEtcdSpec type to enforce that either Shards is
provided or both Endpoint and TLS are provided, and to prevent supplying
Endpoint/TLS when Shards is set; specifically add a rule like "has(self.shards)
|| (self.endpoint != '' && has(self.tls))" with an appropriate message, and a
second rule that forbids endpoint/tls when shards exists (e.g.
"not(has(self.shards)) || (self.endpoint == '' && not(has(self.tls)))") so
UnmanagedEtcdSpec (type name) cannot be left without configuration and cannot
mix legacy fields (Endpoint, TLS) with Shards.

Comment on lines +1302 to +1310
pairs := strings.Split(def, ",")

var hasName, hasPrefixes bool

for _, pair := range pairs {
kv := strings.SplitN(pair, "=", 2)
if len(kv) != 2 {
return shard, fmt.Errorf("invalid key=value pair: %s", pair)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Comma-delimited parsing conflicts with cron syntax.

parseInlineShard splits on every ,, but valid cron expressions can contain commas (for example 0,30 * * * *). That makes a documented backup-schedule value impossible to express with --etcd-shard.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/core/create.go` around lines 1302 - 1310, parseInlineShard
currently naively splits def on every comma which breaks cron expressions;
change the parsing in parseInlineShard so commas that occur inside a value (e.g.
a cron schedule) are not treated as separators: when iterating pairs from
strings.Split(def, ",") if strings.SplitN(pair, "=", 2) yields len != 2 then
treat that token as part of the previous value (append ","+pair to the last
accumulated token) instead of returning an error; update the loop that builds
pairs and the kv handling to only error if you encounter a token without "="
when there is no previous token to attach to, keeping references to def, pairs,
and the kv split logic for location.

Comment on lines +1 to +3
# If /var/lib/data is not empty we exit early, since this means an etcd database already exists
mkdir -p /var/lib/data
[ "$(ls -A /var/lib/data)" ] && echo "/var/lib/data not empty, not restoring snapshot" && exit 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing shebang line.

The script lacks a shebang (#!/bin/bash or #!/bin/sh) at the start. This could cause issues depending on how the script is executed, as the shell interpreter won't be explicitly specified.

🔧 Proposed fix
+#!/bin/sh
 # If /var/lib/data is not empty we exit early, since this means an etcd database already exists
 mkdir -p /var/lib/data
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# If /var/lib/data is not empty we exit early, since this means an etcd database already exists
mkdir -p /var/lib/data
[ "$(ls -A /var/lib/data)" ] && echo "/var/lib/data not empty, not restoring snapshot" && exit 0
#!/bin/sh
# If /var/lib/data is not empty we exit early, since this means an etcd database already exists
mkdir -p /var/lib/data
[ "$(ls -A /var/lib/data)" ] && echo "/var/lib/data not empty, not restoring snapshot" && exit 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/etcd-init.sh`
around lines 1 - 3, Add a shebang as the first line of etcd-init.sh (before the
existing mkdir -p /var/lib/data and the ls check) so the script runs with a
defined shell interpreter (e.g., use #!/bin/bash or #!/bin/sh); ensure the
shebang is the very first line to guarantee correct interpreter selection when
executing the etcd-init.sh script that contains the mkdir -p /var/lib/data and
the [ "$(ls -A /var/lib/data)" ] check.

# and also when a pre-signed URL is expired.
# In this case we get an XML file which can be detected with `file` so we show
# the contents via the logs then exit with an error status
curl -o /tmp/snapshot ${RESTORE_URL}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Quote the URL variable to prevent word splitting.

The ${RESTORE_URL} variable should be quoted to prevent unexpected behavior if the URL contains spaces or special characters.

🔧 Proposed fix
-curl -o /tmp/snapshot ${RESTORE_URL}
+curl -o /tmp/snapshot "${RESTORE_URL}"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 12-12: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/etcd-init.sh` at
line 12, The curl invocation using the RESTORE_URL variable is unquoted which
can cause word-splitting or unexpected behavior; update the command that writes
the snapshot (the curl call that currently references ${RESTORE_URL}) to quote
the variable (use "${RESTORE_URL}") so the entire URL is passed as a single
argument to curl.

Comment on lines +66 to +70
) (*ShardParams, error) {
ipv4, err := util.IsIPv4CIDR(hcp.Spec.Networking.ClusterNetwork[0].CIDR.String())
if err != nil {
return nil, fmt.Errorf("error checking the ClusterNetworkCIDR: %w", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if ClusterNetwork validation is performed elsewhere before this function is called
rg -n "ClusterNetwork" --type go -A3 -B3 | head -100

Repository: openshift/hypershift

Length of output: 10172


🏁 Script executed:

#!/bin/bash
# Find the params.go file in control-plane-operator
find . -name "params.go" -path "*/hostedcontrolplane/etcd/*" | head -5

Repository: openshift/hypershift

Length of output: 134


🏁 Script executed:

#!/bin/bash
# Look for HostedControlPlane definition and ClusterNetwork field
rg -n "type HostedControlPlane struct" --type go -A 20 | head -50

Repository: openshift/hypershift

Length of output: 5643


🏁 Script executed:

#!/bin/bash
# Search for Networking struct definition
rg -n "type.*Networking struct" --type go -A 10 | head -50

Repository: openshift/hypershift

Length of output: 6178


🏁 Script executed:

#!/bin/bash
# Read the params.go file to see the actual code
cat -n ./control-plane-operator/controllers/hostedcontrolplane/etcd/params.go | sed -n '50,85p'

Repository: openshift/hypershift

Length of output: 1494


🏁 Script executed:

#!/bin/bash
# Find the Networking field in HostedControlPlaneSpec
rg -n "Networking.*struct" --type go -A 10 | head -40

Repository: openshift/hypershift

Length of output: 4227


🏁 Script executed:

#!/bin/bash
# Find HostedControlPlaneSpec definition and Networking field
rg -n "type HostedControlPlaneSpec struct" --type go -A 150 | grep -A 150 "type HostedControlPlaneSpec" | grep -E "(Networking|ClusterNetwork)" | head -20

Repository: openshift/hypershift

Length of output: 320


🏁 Script executed:

#!/bin/bash
# Search for ClusterNetwork field type definition in api directory
rg -n "ClusterNetwork\s*\[\]" --type go | head -10

Repository: openshift/hypershift

Length of output: 1707


🏁 Script executed:

#!/bin/bash
# Find the Networking field definition
rg -n "Networking\s*ClusterNetworking" --type go | head -10

Repository: openshift/hypershift

Length of output: 559


🏁 Script executed:

#!/bin/bash
# Read ClusterNetworking struct definition to see validation constraints
rg -n "type ClusterNetworking struct" --type go -A 30 | head -80

Repository: openshift/hypershift

Length of output: 7547


🏁 Script executed:

#!/bin/bash
# Check if there's MinItems validation on ClusterNetwork field
rg -n "ClusterNetwork.*kubebuilder" --type go -B 2 -A 2 | head -30

Repository: openshift/hypershift

Length of output: 46


🏁 Script executed:

#!/bin/bash
# Check if there are other places accessing ClusterNetwork[0]
rg -n "ClusterNetwork\[0\]" --type go

Repository: openshift/hypershift

Length of output: 1887


🏁 Script executed:

#!/bin/bash
# Check if validation is enforced before this function is called
rg -n "NewShardParams" --type go -B 3 -A 3 | head -40

Repository: openshift/hypershift

Length of output: 1740


Add bounds check on ClusterNetwork slice access for defensive programming.

Line 67 accesses hcp.Spec.Networking.ClusterNetwork[0] without verifying the slice is non-empty. While Kubernetes schema validation with MinItems=1 guarantees this field is populated for valid objects, an explicit bounds check improves code robustness and clarity.

Proposed fix
 func NewShardParams(
 	hcp *hyperv1.HostedControlPlane,
 	etcdImage string,
 	controlPlaneOperatorImage string,
 	clusterEtcdOperatorImage string,
 	snapshotRestored bool,
 	securityContext *corev1.PodSecurityContext,
 ) (*ShardParams, error) {
+	if len(hcp.Spec.Networking.ClusterNetwork) == 0 {
+		return nil, fmt.Errorf("ClusterNetwork must have at least one entry")
+	}
 	ipv4, err := util.IsIPv4CIDR(hcp.Spec.Networking.ClusterNetwork[0].CIDR.String())
 	if err != nil {
 		return nil, fmt.Errorf("error checking the ClusterNetworkCIDR: %w", err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
) (*ShardParams, error) {
ipv4, err := util.IsIPv4CIDR(hcp.Spec.Networking.ClusterNetwork[0].CIDR.String())
if err != nil {
return nil, fmt.Errorf("error checking the ClusterNetworkCIDR: %w", err)
}
) (*ShardParams, error) {
if len(hcp.Spec.Networking.ClusterNetwork) == 0 {
return nil, fmt.Errorf("ClusterNetwork must have at least one entry")
}
ipv4, err := util.IsIPv4CIDR(hcp.Spec.Networking.ClusterNetwork[0].CIDR.String())
if err != nil {
return nil, fmt.Errorf("error checking the ClusterNetworkCIDR: %w", err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/params.go` around
lines 66 - 70, The code accesses hcp.Spec.Networking.ClusterNetwork[0] without
checking the slice length; add a defensive bounds check before calling
util.IsIPv4CIDR in the function that returns (*ShardParams, error) so that if
len(hcp.Spec.Networking.ClusterNetwork) == 0 you return a clear error (e.g.,
fmt.Errorf("empty ClusterNetwork in HCP spec")) instead of panicking; then
proceed to call
util.IsIPv4CIDR(hcp.Spec.Networking.ClusterNetwork[0].CIDR.String()) as before.

Comment on lines +45 to +50
if defaultShard.Name == "default" {
sts.Name = "etcd"
sts.Spec.ServiceName = "etcd-discovery"
} else {
sts.Name = fmt.Sprintf("etcd-%s", defaultShard.Name)
sts.Spec.ServiceName = fmt.Sprintf("etcd-%s-discovery", defaultShard.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the canonical shard discovery Service name.

Named shards are created as etcd-discovery-<shard>, not etcd-<shard>-discovery. With the current value, both sts.Spec.ServiceName and ETCD_INITIAL_CLUSTER point at non-existent DNS names for named shards.

Suggested fix
 	if defaultShard.Name == "default" {
 		sts.Name = "etcd"
 		sts.Spec.ServiceName = "etcd-discovery"
 	} else {
 		sts.Name = fmt.Sprintf("etcd-%s", defaultShard.Name)
-		sts.Spec.ServiceName = fmt.Sprintf("etcd-%s-discovery", defaultShard.Name)
+		sts.Spec.ServiceName = fmt.Sprintf("etcd-discovery-%s", defaultShard.Name)
 	}
@@
 		if defaultShard.Name == "default" {
 			podPrefix = "etcd"
 			discoveryService = "etcd-discovery"
 		} else {
 			podPrefix = fmt.Sprintf("etcd-%s", defaultShard.Name)
-			discoveryService = fmt.Sprintf("etcd-%s-discovery", defaultShard.Name)
+			discoveryService = fmt.Sprintf("etcd-discovery-%s", defaultShard.Name)
 		}

Also applies to: 70-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`
around lines 45 - 50, The Service name for named shards is using the wrong
convention; when defaultShard.Name != "default" update sts.Spec.ServiceName (and
any code building ETCD_INITIAL_CLUSTER entries) to use the canonical pattern
"etcd-discovery-<shard>" instead of "etcd-<shard>-discovery"; leave sts.Name as
"etcd-<shard>" if that is desired but ensure every reference that computes the
discovery Service (including where ETCD_INITIAL_CLUSTER or related env vars are
constructed) uses fmt.Sprintf("etcd-discovery-%s", defaultShard.Name) for named
shards.

Comment on lines 98 to 124
} else if hcp.Spec.Etcd.ManagementType == hyperv1.Managed {
// Update wait-for-etcd init container to wait for ALL shard services
util.UpdateContainer("wait-for-etcd", deployment.Spec.Template.Spec.InitContainers, func(c *corev1.Container) {
shards := hcp.Spec.Etcd.Managed.EffectiveShards(hcp)

// Build list of all client service names
var serviceNames []string
for _, shard := range shards {
var serviceName string
if shard.Name == "default" {
serviceName = "etcd-client"
} else {
serviceName = fmt.Sprintf("etcd-client-%s", shard.Name)
}
serviceNames = append(serviceNames, serviceName)
}

// Build script that waits for all services
var scriptLines []string
scriptLines = append(scriptLines, "#!/bin/sh")
for _, serviceName := range serviceNames {
scriptLines = append(scriptLines, fmt.Sprintf("while ! nslookup %s.$(POD_NAMESPACE).svc; do sleep 1; done", serviceName))
}

c.Args = []string{"-c", strings.Join(scriptLines, "\n")}
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Find all etcd client service naming patterns to verify consistency

echo "=== Searching for etcd-client service naming patterns ==="
rg -n 'etcd-client-%s|etcd-%s-client|etcd-client-' --type go -C2

Repository: openshift/hypershift

Length of output: 17507


Service naming inconsistency between etcd service creation and deployment wait logic.

In control-plane-operator/controllers/hostedcontrolplane/v2/etcd/service.go, non-default shards create services with the format etcd-{shard}-client (e.g., etcd-events-client), but this code expects etcd-client-{shard} (e.g., etcd-client-events). The wait-for-etcd init container will fail DNS resolution for non-default shards because the service names won't match those actually created.

Align the naming format in this file with v2/etcd/service.go, or update service.go to use the etcd-client-{shard} format consistently across the codebase (also verify kas/params.go and other components use the same format).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go`
around lines 98 - 124, The wait-for-etcd init container builds DNS names using
the wrong shard service naming convention; in UpdateContainer("wait-for-etcd",
...) where serviceName is constructed for non-default shards, change the
fmt.Sprintf pattern from "etcd-client-%s" to "etcd-%s-client" so it matches the
service names produced by hcp.Spec.Etcd.Managed.EffectiveShards and the service
creation in v2/etcd/service.go; then run a quick grep (including kas/params.go
and other kas-related components) to ensure all consumers use the same
"etcd-{shard}-client" format and update any mismatches.

Comment on lines +255 to +271
// buildUnmanagedEtcdConfig constructs etcd URLs for unmanaged etcd shards
func buildUnmanagedEtcdConfig(shards []hyperv1.UnmanagedEtcdShardSpec) (string, map[string]string) {
var defaultURL string
overrides := make(map[string]string)

for _, shard := range shards {
for _, prefix := range shard.ResourcePrefixes {
if prefix == "/" {
defaultURL = shard.Endpoint
} else {
overrides[prefix] = shard.Endpoint
}
}
}

return defaultURL, overrides
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential empty defaultURL when no shard has "/" prefix.

If all shards have specific ResourcePrefixes (e.g., "/events#", "/coordination.k8s.io/leases#") but none has "/" (catch-all), defaultURL will be empty string. This could cause the kube-apiserver to have no default etcd endpoint.

Consider validating that at least one shard has "/" or handling this case explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go`
around lines 255 - 271, buildUnmanagedEtcdConfig can return an empty defaultURL
when no shard has the "/" prefix; update the function (buildUnmanagedEtcdConfig)
to guarantee a non-empty default by either (A) adding a fallback: if defaultURL
== "" and len(shards) > 0 then set defaultURL = shards[0].Endpoint (or pick the
first shard you deem appropriate) and optionally add a warning log, or (B) more
robustly change the signature to return an error (string, map[string]string,
error), validate that a catch‑all "/" shard exists and return an error to
callers if it does not—apply the chosen fix and update all callers to handle the
fallback or the new error return.

Comment on lines +94 to 116
case hyperv1.Managed:
// For managed etcd with sharding, find the default shard (with "/" prefix)
if hcp.Spec.Etcd.Managed != nil {
shards := hcp.Spec.Etcd.Managed.EffectiveShards(hcp)
for _, shard := range shards {
for _, prefix := range shard.ResourcePrefixes {
if prefix == "/" {
// Found the default shard - use its client service
var serviceName string
if shard.Name == "default" {
serviceName = "etcd-client"
} else {
serviceName = fmt.Sprintf("etcd-client-%s", shard.Name)
}
cfg.StorageConfig.EtcdConnectionInfo.URLs = []string{
fmt.Sprintf("https://%s.%s.svc:2379", serviceName, hcp.Namespace),
}
break
}
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Service naming inconsistency with etcd/service.go.

The service name format etcd-client-%s (line 106) does not match etcd/service.go which uses etcd-%s-client. This is the same inconsistency flagged in kas/deployment.go.

Additionally, the break on line 111 only exits the inner loop over ResourcePrefixes. After finding the "/" prefix and setting the URL, the outer loop continues iterating through remaining shards unnecessarily.

🐛 Proposed fix
 	case hyperv1.Managed:
 		// For managed etcd with sharding, find the default shard (with "/" prefix)
 		if hcp.Spec.Etcd.Managed != nil {
 			shards := hcp.Spec.Etcd.Managed.EffectiveShards(hcp)
+		shardLoop:
 			for _, shard := range shards {
 				for _, prefix := range shard.ResourcePrefixes {
 					if prefix == "/" {
 						// Found the default shard - use its client service
 						var serviceName string
 						if shard.Name == "default" {
 							serviceName = "etcd-client"
 						} else {
-							serviceName = fmt.Sprintf("etcd-client-%s", shard.Name)
+							serviceName = fmt.Sprintf("etcd-%s-client", shard.Name)
 						}
 						cfg.StorageConfig.EtcdConnectionInfo.URLs = []string{
 							fmt.Sprintf("https://%s.%s.svc:2379", serviceName, hcp.Namespace),
 						}
-						break
+						break shardLoop
 					}
 				}
 			}
 		}
 	}

jhjaggars and others added 2 commits April 17, 2026 10:09
…des prefixes

Non-core API group resource prefixes (e.g., /coordination.k8s.io/leases#,
/events.k8s.io/events#) were passed to KAS --etcd-servers-overrides with a
leading slash, which KAS rejects. KAS expects:
  - Core group resources:     /resource#servers  (leading slash preserved)
  - Non-core group resources: group/resource#servers (no leading slash)
Add normalizeOverridePrefix() to detect non-core groups by the presence of a
dot in the group segment and strip the leading slash for those entries. Core
resources (e.g., /events#) are unaffected.
Signed-off-by: Jesse Jaggars <jjaggars@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The oauth-apiserver was CrashLoopBackOff because it tried to connect to
etcd-client-default instead of etcd-client for the default shard. Apply
the same naming logic used in kas/params.go: the "default" shard uses
base name etcd-client, other shards use etcd-client-{name}.
Also fix nil pointer dereference when Managed is nil but ManagementType
is Managed.
Signed-off-by: Jesse Jaggars <jjaggars@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment_test.go (1)

60-76: Add NO_PROXY assertions for sharded managed-etcd paths.

These tests validate URL selection well, but the same code path also builds shard-specific NO_PROXY entries (short host + FQDN). Adding assertions here would better guard the oauth-apiserver sharded-etcd regression fix.

✅ Suggested test additions
@@
 			validate: func(t *testing.T, g *GomegaWithT, hcp *hyperv1.HostedControlPlane) {
@@
 				container := util.FindContainer(ComponentName, deployment.Spec.Template.Spec.Containers)
 				g.Expect(container).ToNot(BeNil())
 				g.Expect(container.Args).To(ContainElement("--etcd-servers=https://etcd-client.test-ns.svc:2379"))
 				g.Expect(container.Args).ToNot(ContainElement(ContainSubstring("etcd-client-default")))
+				noProxyEnv := util.FindEnvVar("NO_PROXY", container.Env)
+				g.Expect(noProxyEnv).ToNot(BeNil())
+				g.Expect(noProxyEnv.Value).To(ContainSubstring("etcd-client"))
+				g.Expect(noProxyEnv.Value).To(ContainSubstring("etcd-client.test-ns.svc"))
 			},
 		},
@@
 			validate: func(t *testing.T, g *GomegaWithT, hcp *hyperv1.HostedControlPlane) {
@@
 				container := util.FindContainer(ComponentName, deployment.Spec.Template.Spec.Containers)
 				g.Expect(container).ToNot(BeNil())
 				g.Expect(container.Args).To(ContainElement("--etcd-servers=https://etcd-client-core.test-ns.svc:2379"))
+				noProxyEnv := util.FindEnvVar("NO_PROXY", container.Env)
+				g.Expect(noProxyEnv).ToNot(BeNil())
+				g.Expect(noProxyEnv.Value).To(ContainSubstring("etcd-client-core"))
+				g.Expect(noProxyEnv.Value).To(ContainSubstring("etcd-client-core.test-ns.svc"))
 			},
 		},

Also applies to: 107-122

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment_test.go`
around lines 60 - 76, The test's validate closure should be extended to assert
that the oauth-apiserver container's NO_PROXY entries include both the shard
short host and its FQDN for sharded managed-etcd; after calling adaptDeployment
and locating the container via util.FindContainer(ComponentName, ...), add
Gomega assertions on container.Args to ContainElement strings like
"--env=NO_PROXY=..." or the specific NO_PROXY arg entries that include the shard
short host (e.g., "etcd-client-default") and the FQDN (e.g.,
"etcd-client-default.test-ns.svc") so both forms are present; apply the same
additions to the other validate block that tests the alternate scenario.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment_test.go`:
- Around line 60-76: The test's validate closure should be extended to assert
that the oauth-apiserver container's NO_PROXY entries include both the shard
short host and its FQDN for sharded managed-etcd; after calling adaptDeployment
and locating the container via util.FindContainer(ComponentName, ...), add
Gomega assertions on container.Args to ContainElement strings like
"--env=NO_PROXY=..." or the specific NO_PROXY arg entries that include the shard
short host (e.g., "etcd-client-default") and the FQDN (e.g.,
"etcd-client-default.test-ns.svc") so both forms are present; apply the same
additions to the other validate block that tests the alternate scenario.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 857eb9b6-449d-4cc7-996f-f36ef4445243

📥 Commits

Reviewing files that changed from the base of the PR and between d12b64a and 71c4ca4.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment.go

Remove unused etcdv2 import, register ServiceMonitor/PDB types in test
schemes, fix discovery service naming convention in test expectations,
set Managed spec in KAS params test, and regenerate CRD manifests.
Signed-off-by: Jesse Jaggars <jjaggars@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go (1)

320-323: ⚠️ Potential issue | 🟡 Minor

Assert NotFound explicitly after cleanup delete.

On Line 321, the test currently passes for any error, which can hide unrelated failures. Validate apierrors.IsNotFound(err) instead.

Proposed assertion hardening diff
 import (
 	"context"
 	"testing"
 
 	hyperv1 "github.com/openshift/hypershift/api/hypershift/v1beta1"
 	"github.com/openshift/hypershift/api/util/ipnet"
 	"github.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane/manifests"
 	"github.com/openshift/hypershift/support/metrics"
 	"github.com/openshift/hypershift/support/upsert"
 
 	prometheusoperatorv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
 
 	appsv1 "k8s.io/api/apps/v1"
 	corev1 "k8s.io/api/core/v1"
 	policyv1 "k8s.io/api/policy/v1"
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/runtime"
 	"k8s.io/utils/ptr"
@@
 	// Verify orphaned StatefulSet was deleted
 	err = fakeClient.Get(ctx, client.ObjectKeyFromObject(orphanedSts), orphanedSts)
-	if err == nil {
+	if err == nil {
 		t.Error("Expected orphaned StatefulSet to be deleted")
+	}
+	if !apierrors.IsNotFound(err) {
+		t.Fatalf("Expected NotFound after deletion, got: %v", err)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go`
around lines 320 - 323, The test currently treats any non-nil error as success
after calling fakeClient.Get(ctx, client.ObjectKeyFromObject(orphanedSts),
orphanedSts); change this to explicitly assert that the error is a NotFound
error by using apierrors.IsNotFound(err) (import
"k8s.io/apimachinery/pkg/api/errors" as apierrors) and fail the test if the
error is nil or not a NotFound (e.g., if err == nil -> t.Error("expected
orphaned StatefulSet to be deleted") else if !apierrors.IsNotFound(err) ->
t.Fatalf("unexpected error getting orphaned StatefulSet: %v", err)); keep
references to orphanedSts and fakeClient.Get to locate the code.
🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go (1)

28-33: Consider extracting shared test setup helpers.

Scheme registration and HCP scaffolding are repeated across tests. A small helper (newTestScheme, newTestHCP) would cut duplication and make future shard-case additions easier.

Also applies to: 94-99, 185-188, 234-237, 283-289

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go`
around lines 28 - 33, Tests repeat scheme registration and hosted control plane
scaffolding; extract helpers to reduce duplication by adding a newTestScheme
helper that calls runtime.NewScheme() and registers appsv1, corev1, policyv1,
hyperv1, prometheusoperatorv1 (replacing the repeated runtime.NewScheme() +
AddToScheme calls), and a newTestHCP helper that returns a pre-populated
hyperv1.HostedControlPlane used by tests (replace the repeated HCP scaffolding).
Update tests to call newTestScheme() and newTestHCP() where the scheme variable
and HCP scaffolding currently appear (references: runtime.NewScheme(),
AddToScheme calls, and the HostedControlPlane construction).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go`:
- Around line 320-323: The test currently treats any non-nil error as success
after calling fakeClient.Get(ctx, client.ObjectKeyFromObject(orphanedSts),
orphanedSts); change this to explicitly assert that the error is a NotFound
error by using apierrors.IsNotFound(err) (import
"k8s.io/apimachinery/pkg/api/errors" as apierrors) and fail the test if the
error is nil or not a NotFound (e.g., if err == nil -> t.Error("expected
orphaned StatefulSet to be deleted") else if !apierrors.IsNotFound(err) ->
t.Fatalf("unexpected error getting orphaned StatefulSet: %v", err)); keep
references to orphanedSts and fakeClient.Get to locate the code.

---

Nitpick comments:
In `@control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go`:
- Around line 28-33: Tests repeat scheme registration and hosted control plane
scaffolding; extract helpers to reduce duplication by adding a newTestScheme
helper that calls runtime.NewScheme() and registers appsv1, corev1, policyv1,
hyperv1, prometheusoperatorv1 (replacing the repeated runtime.NewScheme() +
AddToScheme calls), and a newTestHCP helper that returns a pre-populated
hyperv1.HostedControlPlane used by tests (replace the repeated HCP scaffolding).
Update tests to call newTestScheme() and newTestHCP() where the scheme variable
and HCP scaffolding currently appear (references: runtime.NewScheme(),
AddToScheme calls, and the HostedControlPlane construction).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 913f63c9-12cb-4099-b290-930b0de63d15

📥 Commits

Reviewing files that changed from the base of the PR and between 71c4ca4 and 9808425.

⛔ Files ignored due to path filters (34)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDC.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUpstreamParity.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HCPEtcdBackup.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ImageStreamImportMode.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/KMSEncryptionProvider.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ExternalOIDC.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ExternalOIDCWithUpstreamParity.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/HCPEtcdBackup.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ImageStreamImportMode.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/KMSEncryptionProvider.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
📒 Files selected for processing (7)
  • control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go
  • control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset_test.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/params_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment_test.go
✅ Files skipped from review due to trivial changes (3)
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/deployment_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/params_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go

Fix kubeapilinter violations on new etcd sharding API types: add
MinLength markers, use XValidation instead of Pattern, use +default
instead of +kubebuilder:default, add omitempty/omitzero tags, and
convert optional pointer fields to value types with omitzero.
Fix gci import ordering in shards_test.go, convert if/else chain
to tagged switch in kas/deployment.go, and update CLI code for
non-pointer Storage/BackupSchedule fields. Regenerate CRDs and docs.
Signed-off-by: Jesse Jaggars <jjaggars@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (18)
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml (1)

2767-2901: ⚠️ Potential issue | 🟠 Major

Require exactly one unmanaged etcd mode.

After the legacy required fields were removed, managementType: Unmanaged now accepts {} and partial legacy configs like endpoint without tls. That punts an obviously invalid spec to reconciliation time. The schema should require either shards, or both legacy endpoint and tls—and ideally forbid mixing both modes.

Suggested fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: either shards or both legacy endpoint and tls must be configured
+                      rule: 'has(self.shards) || (has(self.endpoint) && has(self.tls))'
+                    - message: legacy endpoint/tls cannot be combined with shards
+                      rule: '!has(self.shards) || (!has(self.endpoint) && !has(self.tls))'

Also applies to: 2905-2913

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml`
around lines 2767 - 2901, The schema for Unmanaged etcd mode currently allows
empty or partial legacy configs; update the manifest validations so that when
managementType is Unmanaged the object requires either the "shards" array OR
both legacy "endpoint" and "tls" (and rejects configs mixing them). Concretely,
add x-kubernetes-validations on the Unmanaged spec (the block that defines
endpoint, shards, and tls) with rules: 1) ensure (self.shards != null) XOR
(self.endpoint != null && self.tls != null) so exactly one mode is provided, 2)
if shards is present forbid endpoint/tls, and 3) if endpoint or tls is present
require both endpoint and tls; reference the existing properties "shards",
"endpoint", "tls" and the Unmanaged type name (UnmanagedEtcdShardSpec / the
parent Unmanaged spec) when adding these validation rules.
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/KMSEncryptionProvider.yaml (1)

2788-2922: ⚠️ Potential issue | 🟠 Major

Unmanaged mode now accepts configs with no usable etcd connection.

After removing the old required endpoint/tls pair, there is no replacement cross-field validation requiring either shards or legacy endpoint and tls. As written, spec.etcd.unmanaged: {} and partial legacy configs are valid against the CRD but cannot be reconciled.

Suggested fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: either shards or both endpoint and tls must be set
+                      rule: 'has(self.shards) || (has(self.endpoint) && has(self.tls))'

If mixed legacy + shard config should also be rejected instead of ignored, this rule should be tightened further in the source API type before regenerating the manifest.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/KMSEncryptionProvider.yaml`
around lines 2788 - 2922, The UnmanagedEtcdSpec validation is missing a
cross-field rule to require either a non-empty shards array or both legacy
endpoint and tls, so empty/partial configs like spec.etcd.unmanaged: {} validate
but cannot be reconciled; add an x-kubernetes-validations entry on the
UnmanagedEtcdSpec (the object that currently has properties endpoint, shards,
tls) with a rule such as self.shards != null && self.shards.size() > 0 ||
(self.endpoint != null && self.tls != null) and a clear message like "must
specify either shards or both endpoint and tls" to reject empty/partial configs
(adjust CEL expression syntax to match the other rules in this file).
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go (1)

125-136: ⚠️ Potential issue | 🟠 Major

Use the resolved shard storage for restore as well.

The restore init container still reads managedEtcdSpec.Storage.RestoreSnapshotURL, so a shard-level storage.restoreSnapshotURL override is ignored even though PVC sizing/class now uses defaultShard.Storage.

Suggested fix
+	storage := managedEtcdSpec.Storage
+	if defaultShard.Storage.Type != "" {
+		storage = defaultShard.Storage
+	}
+
 	snapshotRestored := meta.IsStatusConditionTrue(hcp.Status.Conditions, string(hyperv1.EtcdSnapshotRestored))
-	if managedEtcdSpec != nil && len(managedEtcdSpec.Storage.RestoreSnapshotURL) > 0 && !snapshotRestored {
+	if len(storage.RestoreSnapshotURL) > 0 && !snapshotRestored {
 		sts.Spec.Template.Spec.InitContainers = append(sts.Spec.Template.Spec.InitContainers,
-			buildEtcdInitContainer(managedEtcdSpec.Storage.RestoreSnapshotURL[0]), // RestoreSnapshotURL can only have 1 entry
+			buildEtcdInitContainer(storage.RestoreSnapshotURL[0]), // RestoreSnapshotURL can only have 1 entry
 		)
 	}
-
-	// adapt PersistentVolume using shard-specific storage or default
-	storage := managedEtcdSpec.Storage
-	if defaultShard.Storage.Type != "" {
-		storage = defaultShard.Storage
-	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`
around lines 125 - 136, The restore init container is still using
managedEtcdSpec.Storage.RestoreSnapshotURL directly so shard-level overrides are
ignored; modify the logic that appends buildEtcdInitContainer to use the
resolved storage variable (the one set from defaultShard.Storage fallback)
instead of managedEtcdSpec.Storage.RestoreSnapshotURL, e.g. reference
storage.RestoreSnapshotURL (and check for nil/len>0) when calling
buildEtcdInitContainer and when computing snapshotRestored so shard-specific
storage.restoreSnapshotURL is honored; update usages involving snapshotRestored,
buildEtcdInitContainer, and sts.Spec.Template.Spec.InitContainers accordingly.
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml (1)

3549-3683: ⚠️ Potential issue | 🟠 Major

Require either shard config or legacy endpoint/tls for unmanaged etcd.

After dropping the old required fields, managementType: Unmanaged now accepts unusable specs like unmanaged: {}. The CRD should reject configs that provide neither shards nor legacy endpoint + tls.

Suggested validation
                     type: object
+                    x-kubernetes-validations:
+                    - message: unmanaged etcd requires shards or legacy endpoint/tls
+                      rule: 'has(self.shards) || (has(self.endpoint) && has(self.tls))'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml`
around lines 3549 - 3683, The Unmanaged etcd schema currently allows an empty
spec (unmanaged: {}) because there's no validation requiring either shards OR
the legacy endpoint+tls pair; add an x-kubernetes-validations CEL rule on the
UnmanagedEtcdSpec object (the same object that declares properties endpoint,
shards, tls) that enforces "shards is present OR (endpoint and tls are
present)". Implement it as a validation entry with a clear message like "must
specify either shards or both endpoint and tls" and a rule such as: self.shards
!= null || (self.endpoint != null && self.tls != null).
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ImageStreamImportMode.yaml (1)

2647-2781: ⚠️ Potential issue | 🟠 Major

Restore validation for legacy unmanaged mode.

After dropping the object-level required fields, managementType: Unmanaged now accepts unmanaged: {} when shards is omitted. That pushes an invalid config past admission even though neither the legacy endpoint/tls path nor the shard list is actually configured.

Suggested validation fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: endpoint and tls are required when shards are not specified
+                      rule: has(self.shards) ? true : has(self.endpoint) && has(self.tls)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ImageStreamImportMode.yaml`
around lines 2647 - 2781, When managementType == "Unmanaged" ensure legacy
single-etcd mode cannot be empty by adding an object-level validation that
requires either shards be present with at least one item OR the legacy
endpoint+tls fields be populated; update the validation on the Unmanaged
field/object (referencing managementType, unmanaged, shards, endpoint, tls) to
enforce: if self.managementType == 'Unmanaged' then (self.shards != null &&
self.shards.size() > 0) || (self.endpoint != null && self.endpoint != '' &&
self.tls != null), and include a clear error message like "for Unmanaged
management either shards must be configured or endpoint and tls must be set".
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml (1)

3185-3318: ⚠️ Potential issue | 🟠 Major

Reject incomplete or mixed unmanaged etcd configs at admission time.

After making endpoint and tls optional, this schema no longer enforces either valid legacy single-etcd mode or valid sharded mode. managementType: Unmanaged can now admit {}, endpoint without tls, or shards plus legacy fields that are merely documented as ignored. That turns an API contract error into a reconcile/runtime failure.

Suggested validation fix
                   unmanaged:
                     description: |-
                       unmanaged specifies configuration which enables the control plane to
                       integrate with an externally managed etcd cluster.
                     properties:
                       endpoint:
                         description: |-
                           endpoint is the full etcd cluster client endpoint URL.
                           Used only when shards is not specified (legacy single-etcd mode).
                           When shards are specified, this field is ignored.
...
                       tls:
                         description: |-
                           tls specifies TLS configuration for HTTPS etcd client endpoints.
                           Used only when shards is not specified (legacy single-etcd mode).
                           When shards are specified, this field is ignored.
                         properties:
                           clientSecret:
                             ...
                         required:
                         - clientSecret
                         type: object
+                    x-kubernetes-validations:
+                    - message: endpoint and tls are required when shards are not configured
+                      rule: 'has(self.shards) ? true : has(self.endpoint) && has(self.tls)'
+                    - message: endpoint and tls must be omitted when shards are configured
+                      rule: '!has(self.shards) || (!has(self.endpoint) && !has(self.tls))'
                     type: object
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml`
around lines 3185 - 3318, The Unmanaged etcd schema now permits invalid/mixed
configs because making endpoint and tls optional allows empty objects or
combinations (e.g., endpoint without tls, or shards alongside legacy fields)
under managementType: Unmanaged; update the CRD validations to reject incomplete
or mixed unmanaged configs by adding validation rules that (1) when
managementType == "Unmanaged" require either (A) legacy single-etcd mode:
endpoint AND tls.clientSecret present and endpoint startsWith('https://'), or
(B) sharded mode: shards present with minItems >=1 and exactly one shard whose
resourcePrefixes contains "/" and each non-default prefix endsWith('#'), and (2)
forbid providing both shards and legacy endpoint/tls together; reference the
top-level fields endpoint, tls, shards, the UnmanagedEtcdShardSpec
(items.endpoint, items.tls, items.resourcePrefixes) and the
x-kubernetes-validations block to implement these checks.
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/GCPPlatform.yaml (1)

2629-2763: ⚠️ Potential issue | 🟠 Major

Require a complete unmanaged config when shards is omitted.

After dropping the old required: [endpoint, tls], the schema now admits spec.etcd.unmanaged: {} and half-configured legacy specs like only endpoint or only tls. This should be guarded with an object-level validation requiring either shards, or both legacy fields together.

Suggested validation
                     type: object
+                    x-kubernetes-validations:
+                    - message: either shards or both endpoint and tls must be set
+                      rule: has(self.shards) || (has(self.endpoint) && has(self.tls))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/GCPPlatform.yaml`
around lines 2629 - 2763, Add an object-level x-kubernetes-validations on the
unmanaged etcd spec to require either shards or both legacy fields (endpoint and
tls). Specifically, in the same object that contains the existing
shards/endpoint/tls definitions add a validation entry with message like "must
provide either shards, or both endpoint and tls" and rule: self.shards != null
|| (self.endpoint != null && self.tls != null). Update the existing
x-kubernetes-validations list for the unmanaged object (the one that contains
fields named "shards", "endpoint", and "tls") to include this new rule.
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/OpenStack.yaml (1)

2712-2846: ⚠️ Potential issue | 🟠 Major

Restore legacy single-etcd validation when shards is absent.

After relaxing the root-level required fields, managementType: Unmanaged now accepts {} or partial legacy configs as long as spec.etcd.unmanaged exists. That breaks admission-time validation for the non-sharded path and will surface later as reconcile failures.

🔧 Suggested fix
                   unmanaged:
                     description: |-
                       unmanaged specifies configuration which enables the control plane to
                       integrate with an externally managed etcd cluster.
                     properties:
                       endpoint:
                         description: |-
                           endpoint is the full etcd cluster client endpoint URL.
                           Used only when shards is not specified (legacy single-etcd mode).
                           When shards are specified, this field is ignored.
@@
                       tls:
                         description: |-
                           tls specifies TLS configuration for HTTPS etcd client endpoints.
                           Used only when shards is not specified (legacy single-etcd mode).
                           When shards are specified, this field is ignored.
@@
-                    type: object
+                    type: object
+                    x-kubernetes-validations:
+                    - message: endpoint and tls are required when shards is not specified
+                      rule: has(self.shards) || (has(self.endpoint) && has(self.tls))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/OpenStack.yaml`
around lines 2712 - 2846, The Unmanaged etcd spec currently allows missing
legacy single-etcd fields when shards is absent; add CEL validations on the same
object that enforce the legacy path: when shards is null/absent require endpoint
and tls (and ensure endpoint startsWith('https://') and tls.clientSecret
exists). Concretely, add x-kubernetes-validations rules on the UnmanagedEtcdSpec
object (the same object that defines endpoint, shards, tls) such as: "rule:
self.shards == null implies (self.endpoint != null && self.tls != null)" and a
rule to validate endpoint format and tls.clientSecret presence (e.g.
"self.shards == null implies self.endpoint.startsWith('https://')" and
"self.shards == null implies self.tls.clientSecret != null") so legacy
single-etcd configs are rejected when required fields are missing.
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml (1)

2651-2785: ⚠️ Potential issue | 🟠 Major

Restore validation for legacy unmanaged configs.

After removing the old required fields, this object no longer enforces the non-sharded path. managementType: Unmanaged can now validate with neither shards nor legacy endpoint/tls, leaving reconciliation with no usable etcd target.

🔧 Suggested fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: either shards must be set, or both endpoint and tls are required
+                      rule: has(self.shards) ? true : has(self.endpoint) && has(self.tls)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml`
around lines 2651 - 2785, The schema no longer enforces that Unmanaged hosts
have an etcd target; add an x-kubernetes-validations rule on the same object
that requires, when managementType == "Unmanaged", either shards is present
(non-empty) OR the legacy single-etcd fields (endpoint and tls.clientSecret) are
present; implement using a CEL expression such as: self.managementType !=
'Unmanaged' || (self.shards != null && self.shards.size() > 0) || (self.endpoint
!= null && self.tls != null && self.tls.clientSecret != null) and provide a
clear message like "Unmanaged requires either shards or endpoint+tls".
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUpstreamParity.yaml (1)

3166-3300: ⚠️ Potential issue | 🟠 Major

Reject empty unmanaged etcd configurations.

Relaxing the legacy top-level requirements is fine for sharded mode, but this block now accepts unusable configs such as spec.etcd.unmanaged: {} or a legacy endpoint without tls when shards is absent. That should be rejected at admission time.

Suggested fix
                   unmanaged:
                     description: |-
                       unmanaged specifies configuration which enables the control plane to
                       integrate with an externally managed etcd cluster.
                     properties:
                       endpoint:
                         ...
                       shards:
                         ...
                       tls:
                         ...
+                    x-kubernetes-validations:
+                    - message: set shards, or set both endpoint and tls for legacy single-etcd mode
+                      rule: 'has(self.shards) || (has(self.endpoint) && has(self.tls))'
                     type: object

This preserves backward compatibility for legacy single-etcd mode while still allowing the new sharded shape.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUpstreamParity.yaml`
around lines 3166 - 3300, The unmanaged etcd schema currently permits empty or
incomplete configs (e.g., spec.etcd.unmanaged: {} or legacy endpoint without tls
when shards is absent); add validation to the parent unmanaged object so that
when shards is not provided the legacy single-etcd mode requires both endpoint
and tls and reject an empty unmanaged object, and ensure the existing shards
array validation remains unchanged; locate the unmanaged schema block (the
object containing properties endpoint, shards, tls and the
UnmanagedEtcdShardSpec), add an "x-kubernetes-validations" rule to require (a)
if shards is missing then endpoint AND tls must be present and non-empty, and
(b) the unmanaged object cannot be entirely empty, using self and exists/has
checks referencing endpoint, tls, and shards to enforce these constraints.
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml (1)

2629-2763: ⚠️ Potential issue | 🟠 Major

Keep legacy unmanaged mode valid only when endpoint and tls are both present.

After removing the unconditional required list, managementType: Unmanaged now accepts unmanaged: {} when shards is omitted. That admits an invalid legacy single-etcd config.

Suggested fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: either shards or legacy endpoint/tls must be specified
+                      rule: 'has(self.shards) ? true : has(self.endpoint) && has(self.tls)'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml`
around lines 2629 - 2763, The Unmanaged legacy single-etcd mode currently allows
an empty unmanaged object when shards is omitted; update the CRD validations so
that when managementType == "Unmanaged" (or the top-level unmanaged block is
present) and shards is not specified, both endpoint and tls must be present: add
a conditional x-kubernetes-validations rule on the unmanaged object (or the
parent spec) that checks if shards is empty/absent then self.endpoint exists()
AND self.tls exists(), and ensure existing per-shard rules remain unchanged
(reference: managementType, unmanaged, endpoint, tls, shards,
UnmanagedEtcdShardSpec).
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/GCPPlatform.yaml (1)

2712-2846: ⚠️ Potential issue | 🟠 Major

Restore CRD validation for legacy unmanaged etcd mode.

After removing the old top-level required pair, this schema now accepts .spec.etcd.unmanaged without shards and without a complete legacy {endpoint,tls} config. That lets invalid specs through API validation and defers the failure to reconciliation.

Suggested fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: either shards or both endpoint and tls must be specified
+                      rule: has(self.shards) || (has(self.endpoint) && has(self.tls))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/GCPPlatform.yaml`
around lines 2712 - 2846, The schema for UnmanagedEtcdSpec currently permits
missing legacy fields because the top-level required constraint was removed; add
a conditional validation on the parent object (the UnmanagedEtcdSpec that
declares "endpoint", "shards", and "tls") to require that when "shards" is not
present, both "endpoint" and "tls" must be present (e.g. add an
x-kubernetes-validations rule that checks: if not self.exists(s, s == 'shards')
then self.exists(e, e == 'endpoint') and self.exists(t, t == 'tls') with an
explicit message like "when shards is not specified, endpoint and tls are
required"), referencing the existing properties "shards", "endpoint", and "tls"
so legacy single-etcd mode is validated again.
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-Default.crd.yaml (1)

3214-3348: ⚠️ Potential issue | 🟠 Major

spec.etcd.unmanaged can now be empty and still validate.

Dropping the old required: [endpoint, tls] without adding a replacement object-level rule means managementType: Unmanaged now accepts {}, {endpoint: ...}, or {tls: ...}. Those specs are incomplete and will fail later instead of being rejected by the API.

Suggested fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: either shards or both endpoint and tls must be specified
+                      rule: 'has(self.shards) || (has(self.endpoint) && has(self.tls))'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-Default.crd.yaml`
around lines 3214 - 3348, The Unmanaged etcd spec currently allows empty objects
because the old required: [endpoint, tls] was removed; add an object-level
validation on the Unmanaged spec (the object that contains endpoint, shards,
tls) via x-kubernetes-validations to enforce that either shards is present OR
both endpoint and tls are present (e.g., rule: self.exists(s, s.shards) ||
(self.endpoint and self.tls) expressed in the CRD's CEL/validation syntax),
referencing the existing properties endpoint, shards, and tls so incomplete {}
or single-field specs are rejected by the API.
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AAA_ungated.yaml (1)

2721-2855: ⚠️ Potential issue | 🟠 Major

Reject empty or partial legacy unmanaged etcd configs.

This schema now accepts managementType: Unmanaged with unmanaged: {} and also endpoint-only / tls-only objects when shards is omitted. In legacy single-etcd mode those objects are unusable, so they should be rejected at CRD validation time.

Because this file is generated, please add the conditional validation in the source API type/markers and regenerate.

Proposed fix
                   unmanaged:
                     description: |-
                       unmanaged specifies configuration which enables the control plane to
                       integrate with an externally managed etcd cluster.
                     properties:
                       endpoint:
                         ...
                       shards:
                         ...
                       tls:
                         ...
                     type: object
+                    x-kubernetes-validations:
+                    - message: configure either unmanaged.shards or legacy unmanaged.endpoint and unmanaged.tls
+                      rule: has(self.shards) || (has(self.endpoint) && has(self.tls))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AAA_ungated.yaml`
around lines 2721 - 2855, The CRD currently allows managementType: Unmanaged
with an empty unmanaged object and permits legacy single-etcd objects that have
only endpoint or only tls when shards is omitted; update the API type markers to
add conditional validations: when managementType == "Unmanaged" require that the
unmanaged spec object is non-empty (reject {}), and when shards is not specified
require both unmanaged.endpoint and unmanaged.tls to be present (reject
endpoint-only or tls-only); add the corresponding x-kubernetes-validation
markers on the UnmanagedEtcdSpec (and the parent Unmanaged field) to enforce
these rules, then regenerate the YAML so the generated file
(hostedclusters...AAA_ungated.yaml) includes the new validations.
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDC.yaml (1)

3045-3179: ⚠️ Potential issue | 🟠 Major

Re-add legacy unmanaged requiredness when shards is unset.

After making endpoint and tls shard-conditional, the schema no longer requires either legacy config or shard config. managementType: Unmanaged can now admit unmanaged: {}, which is invalid and will only fail later.

Suggested fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: endpoint and tls are required when shards is not specified
+                      rule: has(self.shards) ? true : has(self.endpoint) && has(self.tls)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDC.yaml`
around lines 3045 - 3179, The Unmanaged etcd schema allows an empty unmanaged
config because after making endpoint and tls conditional on shards, nothing
enforces legacy fields when managementType is Unmanaged; add a CEL validation at
the same object level that enforces when managementType == "Unmanaged" then
either shards is present or both endpoint and tls are present. Concretely, add
an x-kubernetes-validations entry referencing managementType/unmanaged with a
message like "Unmanaged must specify either shards or both endpoint and tls" and
a rule such as self.managementType == 'Unmanaged' => (self.shards.exists() ||
(self.endpoint != null && self.tls != null)), targeting the schema that contains
properties shards, endpoint, and tls.
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-Default.crd.yaml (1)

3131-3265: ⚠️ Potential issue | 🟠 Major

Legacy unmanaged mode no longer validates required connection fields.

After removing the old required coupling, there is no replacement rule that requires endpoint and tls when shards is absent. That means invalid legacy configs like spec.etcd.unmanaged: {} now pass CRD validation even though single-etcd mode cannot work without both fields. Please add an object-level conditional validation in the source type and regenerate this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-Default.crd.yaml`
around lines 3131 - 3265, Add an object-level conditional validation to the
UnmanagedEtcdSpec/source type so that when shards is absent the fields endpoint
and tls are required: update the Go CRD struct tags or validation markers on the
UnmanagedEtcdSpec (or the parent type that defines spec.etcd.unmanaged) to
express "if shards == nil then endpoint and tls are required" (use
kubebuilder+validation marker for conditional/object-level validation), then
regenerate the CRD so the generated YAML for fields endpoint, tls, and shards
includes the corresponding x-kubernetes-validations rule enforcing that legacy
single-etcd mode must include both endpoint and tls when shards is not
specified.
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HCPEtcdBackup.yaml (1)

2777-2911: ⚠️ Potential issue | 🟠 Major

Add a one-of validation for unmanaged single-etcd vs sharded mode.

This schema now accepts invalid unmanaged specs: neither endpoint/tls nor shards, or both at once. Both cases should be rejected at admission instead of creating an unreconcilable HostedCluster or silently ignoring fields.

Suggested fix
                     type: object
+                    x-kubernetes-validations:
+                    - message: endpoint and tls are required when shards is not specified
+                      rule: has(self.shards) || (has(self.endpoint) && has(self.tls))
+                    - message: endpoint and tls must not be set when shards is specified
+                      rule: '!has(self.shards) || (!has(self.endpoint) && !has(self.tls))'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HCPEtcdBackup.yaml`
around lines 2777 - 2911, The schema currently allows invalid unmanaged etcd
specs when neither or both modes are set; add a CEL validation on the parent
object (the UnmanagedEtcdSpec represented by the block containing endpoint,
shards, tls) to enforce exactly-one-of: either shards is present (self.shards !=
null) OR legacy single-etcd mode is present (both self.endpoint != null and
self.tls != null), but not both; implement this as an x-kubernetes-validations
rule referencing self.shards, self.endpoint and self.tls that returns true only
when (self.shards != null && self.endpoint == null && self.tls == null) ||
(self.shards == null && self.endpoint != null && self.tls != null), and include
a clear message like "must specify either shards or endpoint+tls (mutually
exclusive)".
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml (1)

2707-2846: ⚠️ Potential issue | 🟠 Major

Reject empty unmanaged etcd configs.

After removing the top-level required: [endpoint, tls], this schema no longer requires either the legacy single-etcd fields or shards. spec.etcd.managementType: Unmanaged with spec.etcd.unmanaged: {} now passes CRD validation and will fail later.

Suggested object-level validation
                   unmanaged:
                     description: |-
                       unmanaged specifies configuration which enables the control plane to
                       integrate with an externally managed etcd cluster.
                     properties:
                       endpoint:
                         ...
                       tls:
                         ...
+                    x-kubernetes-validations:
+                    - message: either shards or endpoint/tls must be set for unmanaged etcd
+                      rule: has(self.shards) || (has(self.endpoint) && has(self.tls))
                     type: object
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml`
around lines 2707 - 2846, The unmanaged etcd schema allows an empty object
because the top-level required for endpoint/tls was removed; update the
UnmanagedEtcdSpec validation to require either a non-empty shards array or the
legacy single-etcd fields (endpoint and tls). Concretely, add an
x-kubernetes-validations rule on the unmanaged object (the "unmanaged" schema)
that enforces: either self.shards != null && self.shards.size() >= 1 OR
(self.endpoint != null && self.tls != null), and keep the existing shard-level
validations for resourcePrefixes; reference the "unmanaged", "shards",
"endpoint", and "tls" schema entries when adding this rule.
♻️ Duplicate comments (3)
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go (1)

45-50: ⚠️ Potential issue | 🔴 Critical

Use the canonical discovery Service name for named shards.

Named shards are created as etcd-discovery-<shard>, not etcd-<shard>-discovery. The current value breaks both sts.Spec.ServiceName and the peer DNS names in ETCD_INITIAL_CLUSTER.

Suggested fix
 	if defaultShard.Name == "default" {
 		sts.Name = "etcd"
 		sts.Spec.ServiceName = "etcd-discovery"
 	} else {
 		sts.Name = fmt.Sprintf("etcd-%s", defaultShard.Name)
-		sts.Spec.ServiceName = fmt.Sprintf("etcd-%s-discovery", defaultShard.Name)
+		sts.Spec.ServiceName = fmt.Sprintf("etcd-discovery-%s", defaultShard.Name)
 	}
@@
 		if defaultShard.Name == "default" {
 			podPrefix = "etcd"
 			discoveryService = "etcd-discovery"
 		} else {
 			podPrefix = fmt.Sprintf("etcd-%s", defaultShard.Name)
-			discoveryService = fmt.Sprintf("etcd-%s-discovery", defaultShard.Name)
+			discoveryService = fmt.Sprintf("etcd-discovery-%s", defaultShard.Name)
 		}

Also applies to: 69-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`
around lines 45 - 50, The service name for named shards is constructed
incorrectly; when defaultShard.Name != "default" set sts.Name to
fmt.Sprintf("etcd-%s", defaultShard.Name) but sts.Spec.ServiceName must use the
canonical pattern "etcd-discovery-<shard>" (not "etcd-<shard>-discovery"), and
the same correction must be applied to the other occurrence around the
ETCD_INITIAL_CLUSTER construction (the block that builds peer DNS names and
service references); update sts.Spec.ServiceName and any code that formats
peer/service DNS (referencing defaultShard.Name, sts.Name, and
ETCD_INITIAL_CLUSTER assembly) to use "etcd-discovery-%s" for named shards.
api/hypershift/v1beta1/hostedcluster_types.go (2)

1916-1926: ⚠️ Potential issue | 🟠 Major

The catch-all prefix rule is still “at least one”, not “exactly one”.

self.exists(s, '/' in s.resourcePrefixes) permits multiple shards with "/", even though the doc and error message say there must be exactly one. That ambiguity leaks directly into controller-side shard selection.

Suggested fix
-	// +kubebuilder:validation:XValidation:rule="self.exists(s, '/' in s.resourcePrefixes)",message="exactly one shard must have '/' prefix"
+	// +kubebuilder:validation:XValidation:rule="self.filter(s, '/' in s.resourcePrefixes).size() == 1",message="exactly one shard must have '/' prefix"
@@
-	// +kubebuilder:validation:XValidation:rule="self.exists(s, '/' in s.resourcePrefixes)",message="exactly one shard must have '/' prefix"
+	// +kubebuilder:validation:XValidation:rule="self.filter(s, '/' in s.resourcePrefixes).size() == 1",message="exactly one shard must have '/' prefix"

Also applies to: 2078-2088

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/hypershift/v1beta1/hostedcluster_types.go` around lines 1916 - 1926, The
XValidation rule on the Shards field currently uses self.exists(...) which
allows multiple shards to include "/" but the requirement is exactly one;
replace the rule with a CEL expression that asserts exactly one shard has "/"
(for example use self.filter(s, '/' in s.resourcePrefixes).size() == 1) and
update the XValidation message accordingly; apply the same change to the other
duplicate Shards validation instance elsewhere in the file.

2062-2089: ⚠️ Potential issue | 🟠 Major

Reject mixed or empty unmanaged etcd configuration.

endpoint/tls are documented as legacy single-shard fields, but nothing currently enforces either “shards” or “endpoint + tls”. As written, this accepts both modes at once and also accepts an unmanaged spec with neither configured.

Suggested fix
+// +kubebuilder:validation:XValidation:rule="has(self.shards) || (self.endpoint != '' && has(self.tls.clientSecret.name))",message="either shards must be specified, or both endpoint and tls are required"
+// +kubebuilder:validation:XValidation:rule="!has(self.shards) || (self.endpoint == '' && !has(self.tls.clientSecret.name))",message="endpoint and tls must not be set when shards is specified"
 type UnmanagedEtcdSpec struct {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/hypershift/v1beta1/hostedcluster_types.go` around lines 2062 - 2089,
UnmanagedEtcdSpec currently allows empty or mixed configs; add kubebuilder
XValidation rules on UnmanagedEtcdSpec to enforce that exactly one mode is
chosen: either Shards is present (len>0) OR Endpoint is set (non-empty) — but
not both — and when Shards is empty Endpoint must be provided; update the
UnmanagedEtcdSpec validation to include an XValidation rule that enforces mutual
exclusivity between Shards and Endpoint/TLS (XOR) and another rule ensuring
Endpoint is non-empty when Shards is absent, referencing the UnmanagedEtcdSpec
type and its fields Endpoint, TLS, and Shards so the API server rejects mixed or
empty unmanaged etcd configurations.
🟡 Minor comments (3)
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml-2452-2519 (1)

2452-2519: ⚠️ Potential issue | 🟡 Minor

Mirror the restore snapshot add/remove guard on per-shard storage.

spec.etcd.managed.storage prevents adding or removing restoreSnapshotURL after creation, but the duplicated per-shard storage schema does not. That lets users make day-2 changes that won't be honored once the shard has already bootstrapped its PV.

Suggested fix
                               required:
                               - type
                               type: object
+                              x-kubernetes-validations:
+                              - message: restoreSnapshotURL cannot be added or removed after creation
+                                rule: has(self.restoreSnapshotURL) == has(oldSelf.restoreSnapshotURL)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml`
around lines 2452 - 2519, The per-shard storage schema's restoreSnapshotURL must
enforce the same add/remove guard as spec.etcd.managed.storage: update the
per-shard storage property named restoreSnapshotURL to include
x-kubernetes-list-type: set and the same x-kubernetes-validations as the
top-level managed.storage (a max size of 1 rule like rule: self.size() <= 1, an
immutability rule like rule: self == oldSelf, and a URL-scheme validation like
rule: self.size() == 0 || self[0].matches('^(https|s3)://.*')) so users cannot
add/remove or change the URL after initial creation; locate the
restoreSnapshotURL block in the per-shard storage definition and add those
validations to match the behavior of
spec.etcd.managed.storage.restoreSnapshotURL.
docs/content/reference/aggregated-docs.md-17732-17733 (1)

17732-17733: ⚠️ Potential issue | 🟡 Minor

Reference links appear to target internal paths.

Line 17732-Line 17733 references .spec/... files, which are typically not available in published docs. This will create dead references for readers.

Suggested doc fix
-- Implementation Plan: `.spec/001-etcd-shard/implementation-plan.md`
-- Requirements: `.spec/001-etcd-shard/requirements.md`
+- Implementation Plan: (link to published design doc/KEP in repository docs)
+- Requirements: (link to published requirements doc in repository docs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/reference/aggregated-docs.md` around lines 17732 - 17733, The
doc references use internal repo paths
`.spec/001-etcd-shard/implementation-plan.md` and
`.spec/001-etcd-shard/requirements.md` which will be dead in published docs;
update those link targets in the aggregated-docs entry so they point to
public/relative published pages or remove them and replace with a note
describing that the implementation plan and requirements are internal, e.g.,
change the link targets to the public equivalents or to a single
external/permalink for the ETCD shard design, and update the link text
(Implementation Plan / Requirements) accordingly in aggregated-docs.md.
docs/content/reference/api.md-6735-6760 (1)

6735-6760: ⚠️ Potential issue | 🟡 Minor

Add semantic descriptions for EtcdShardPriority enum values.

All four enum values are documented with empty descriptions, which makes priority-based behavior hard to interpret from API docs. Please define each value’s operational meaning (e.g., scheduling/backup expectations) in the source type comments so generated docs are self-explanatory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/reference/api.md` around lines 6735 - 6760, The docs show empty
descriptions for the EtcdShardPriority enum; update the source Go type comments
for the EtcdShardPriority enum (the declaration named EtcdShardPriority used by
ManagedEtcdShardSpec and UnmanagedEtcdShardSpec) to provide clear semantic
meanings for each value: "Critical" = highest operational priority (prefer
dedicated nodes, avoid eviction, highest backup frequency and fastest restore),
"High" = prefer scheduling on stable nodes with frequent backups and priority
for recovery, "Medium" = default operational priority with standard
scheduling/backups and normal eviction tolerance, and "Low" = lowest priority
(best-effort scheduling, infrequent backups, and acceptable for
eviction/deprioritized recovery); regenerate docs so the table entries for
Critical, High, Medium, Low contain these descriptions.
🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go (1)

91-101: Fragile string replacement may silently fail if base template changes.

The strings.ReplaceAll calls depend on exact string matches in the base template script. If the YAML asset is modified (e.g., formatting changes, variable reordering), these replacements will silently do nothing, leaving non-default shards with incorrect service names.

Consider either:

  1. Validating that the replacement actually occurred (check if string changed)
  2. Rebuilding the script programmatically rather than patching it
  3. Adding a comment documenting the exact strings expected in the base template
🛡️ Option 1: Add validation that replacement occurred
 		if len(c.Args) >= 2 {
 			script := c.Args[1]
+			originalScript := script
 			// Replace ETCDCTL_ENDPOINTS
 			script = strings.ReplaceAll(script, "export ETCDCTL_ENDPOINTS=https://etcd-client:2379",
 				fmt.Sprintf("export ETCDCTL_ENDPOINTS=https://%s:2379", clientService))
 			// Replace member add peer-urls
 			script = strings.ReplaceAll(script, "https://${HOSTNAME}.etcd-discovery.${NAMESPACE}.svc:2380",
 				fmt.Sprintf("https://${HOSTNAME}.%s.${NAMESPACE}.svc:2380", discoveryService))
+			// For non-default shards, verify replacements actually happened
+			if shard.Name != "default" && script == originalScript {
+				// Log warning or consider returning error - service names may be incorrect
+			}
 
 			c.Args = []string{"-c", script}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go`
around lines 91 - 101, The current replacement logic that mutates c.Args by
creating script and calling strings.ReplaceAll may silently fail if the base
template changes; modify the block that manipulates c.Args (the script variable
and the two strings.ReplaceAll calls) to validate that each replacement actually
occurred by comparing the script before and after each ReplaceAll (or using
strings.Contains to confirm expected tokens), and if a replacement did not
happen, log or return an error (or fallback to a programmatic rebuild)
referencing clientService and discoveryService so the failure is visible and
handled rather than leaving stale placeholders in c.Args.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5fd96a88-afb0-49cc-b91a-c5dae1cae4a1

📥 Commits

Reviewing files that changed from the base of the PR and between 9808425 and f4e2db7.

⛔ Files ignored due to path filters (4)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_helpers.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/vendor/**, !vendor/**, !**/zz_generated*
📒 Files selected for processing (41)
  • api/hypershift/v1beta1/hostedcluster_helpers.go
  • api/hypershift/v1beta1/hostedcluster_types.go
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AAA_ungated.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDC.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUpstreamParity.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/GCPPlatform.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HCPEtcdBackup.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ImageStreamImportMode.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/KMSEncryptionProvider.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/OpenStack.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AAA_ungated.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ExternalOIDC.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ExternalOIDCWithUpstreamParity.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/GCPPlatform.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/HCPEtcdBackup.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/ImageStreamImportMode.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/KMSEncryptionProvider.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/OpenStack.yaml
  • cmd/cluster/core/create.go
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-Default.crd.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-Default.crd.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml
  • control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go
  • control-plane-operator/controllers/hostedcontrolplane/etcd/statefulset.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go
  • docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • cmd/cluster/core/create.go
  • control-plane-operator/controllers/hostedcontrolplane/etcd/shards_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/deployment.go

Comment on lines +53 to +63
// Add shard label to pod template
if sts.Spec.Template.Labels == nil {
sts.Spec.Template.Labels = make(map[string]string)
}
sts.Spec.Template.Labels["hypershift.openshift.io/etcd-shard"] = defaultShard.Name

// Update selector to include shard label
if sts.Spec.Selector.MatchLabels == nil {
sts.Spec.Selector.MatchLabels = make(map[string]string)
}
sts.Spec.Selector.MatchLabels["hypershift.openshift.io/etcd-shard"] = defaultShard.Name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Is apps/v1 StatefulSet.spec.selector immutable after creation in Kubernetes?

💡 Result:

Yes, in Kubernetes apps/v1, StatefulSet.spec.selector is immutable after creation. Attempting to update it results in a validation error. The Kubernetes API server enforces this through validation logic in ValidateStatefulSetUpdate (pkg/apis/apps/validation/validation.go), which checks changes to spec fields other than the allowed ones like 'replicas', 'template', 'updateStrategy', etc. The selector is explicitly treated as immutable, as evidenced by ongoing migration to declarative validation with +k8s:immutable tag (PR #138080, March 2026). Historical context from issue #50808 confirms selectors for controllers (including StatefulSet) were made immutable in apps/v1 to prevent undefined behaviors. Multiple error reports confirm updates to spec.selector trigger "Forbidden: updates to statefulset spec for fields other than..." errors, where selector is not in the allowed list.

Citations:


🏁 Script executed:

cat -n control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go | head -100

Repository: openshift/hypershift

Length of output: 1891


🏁 Script executed:

wc -l control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go

Repository: openshift/hypershift

Length of output: 1891


🏁 Script executed:

# Search for the function containing this code and understand its purpose
rg -A 20 -B 10 "Add shard label to pod template" control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go

Repository: openshift/hypershift

Length of output: 1891


🏁 Script executed:

# Search for where this function is called
rg -n "def\|func" control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go | head -20

Repository: openshift/hypershift

Length of output: 1891


Remove selector mutation or implement migration-safe path for existing StatefulSets.

Mutating sts.Spec.Selector.MatchLabels will fail on already-created StatefulSets. In Kubernetes apps/v1, spec.selector is immutable after creation—the API server enforces this through validation logic that explicitly excludes selector from the allowed mutation list. Attempting updates will trigger a "Forbidden: updates to statefulset spec for fields other than..." error. Either remove the selector mutation or add conditional logic that preserves the selector for existing StatefulSets while applying it only to new ones.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go`
around lines 53 - 63, The code is mutating sts.Spec.Selector.MatchLabels which
is immutable on existing StatefulSets; instead, only set the selector when
creating a new StatefulSet or when the selector is nil/empty (preserve existing
selector values for updates). Keep adding the shard label to
sts.Spec.Template.Labels (pod template) as before, but change the selector
mutation to a migration-safe check: if sts.Generation/ResourceVersion indicates
a new object or if sts.Spec.Selector == nil ||
len(sts.Spec.Selector.MatchLabels) == 0 then set
MatchLabels["hypershift.openshift.io/etcd-shard"]=defaultShard.Name, otherwise
leave sts.Spec.Selector.MatchLabels untouched so updates won’t be rejected by
the API server; reference the sts variable and the fields
sts.Spec.Selector.MatchLabels and sts.Spec.Template.Labels when making this
change.

Add nil check for hcp.Spec.Etcd.Managed in KAS deployment switch
to prevent panic when ManagementType is Managed but Managed spec
is not set. Update TestControlPlaneComponents fixtures for the
if/else to switch refactor.
Signed-off-by: Jesse Jaggars <jjaggars@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add nil check for hcp.Spec.Etcd.Managed in KAS deployment switch to
prevent panic when ManagementType is Managed but Managed spec is not
set.
Remove ReconcileControlPlaneComponent call from etcd shard
reconciliation since the CRD is managed by the v2 framework and is
not available in all environments.
Update TestControlPlaneComponents fixtures for the switch refactor.
Signed-off-by: Jesse Jaggars <jjaggars@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 35.07463% with 783 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.71%. Comparing base (1180bca) to head (d24c75f).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
cmd/cluster/core/create.go 2.70% 208 Missing and 8 partials ⚠️
...controllers/hostedcontrolplane/etcd/statefulset.go 64.03% 77 Missing and 5 partials ⚠️
...ator/controllers/hostedcontrolplane/etcd/status.go 30.70% 78 Missing and 1 partial ⚠️
...trollers/hostedcontrolplane/v2/etcd/statefulset.go 0.00% 52 Missing ⚠️
.../controllers/hostedcontrolplane/v2/etcd/service.go 0.00% 50 Missing ⚠️
...r/controllers/hostedcontrolplane/manifests/etcd.go 0.00% 40 Missing ⚠️
...ator/controllers/hostedcontrolplane/etcd/params.go 0.00% 34 Missing ⚠️
...llers/hostedcontrolplane/v2/etcd/servicemonitor.go 0.00% 33 Missing ⚠️
...perator/controllers/hostedcontrolplane/pki/etcd.go 0.00% 30 Missing ⚠️
...ostedcontrolplane/hostedcontrolplane_controller.go 48.14% 19 Missing and 9 partials ⚠️
... and 11 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7375      +/-   ##
==========================================
+ Coverage   35.62%   35.71%   +0.09%     
==========================================
  Files         767      776       +9     
  Lines       93330    94542    +1212     
==========================================
+ Hits        33245    33768     +523     
- Misses      57396    58034     +638     
- Partials     2689     2740      +51     
Files with missing lines Coverage Δ
...ostedcontrolplane/v2/oauth_apiserver/deployment.go 100.00% <100.00%> (ø)
.../controllers/hostedcontrolplane/etcd/monitoring.go 90.32% <90.32%> (ø)
...or/controllers/hostedcontrolplane/v2/kas/params.go 90.34% <94.33%> (+1.54%) ⬆️
...perator/controllers/hostedcontrolplane/etcd/pdb.go 64.70% <64.70%> (ø)
...or/controllers/hostedcontrolplane/etcd/services.go 66.66% <66.66%> (ø)
...or/controllers/hostedcontrolplane/v2/kas/config.go 87.98% <46.66%> (-1.69%) ⬇️
...ontrollers/hostedcontrolplane/v2/etcd/component.go 19.23% <0.00%> (+5.34%) ⬆️
...tor/controllers/hostedcontrolplane/etcd/cleanup.go 78.18% <78.18%> (ø)
...r/controllers/hostedcontrolplane/v2/oapi/config.go 0.00% <0.00%> (ø)
...ator/controllers/hostedcontrolplane/etcd/shards.go 52.17% <52.17%> (ø)
... and 12 more

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

  • Prow Job: codecov/patch (GitHub Actions / Codecov integration)
  • Build ID: Check Run 72152549656
  • Repository: openshift/hypershift
  • PR: feat(etcd): implement multi-shard etcd support #7375feat(etcd): implement multi-shard etcd support
  • Check Result: failure — patch coverage 35.07% is below the 35.62% target threshold

Test Failure Analysis

Error

codecov/patch — 35.07% of diff hit (target 35.62%)

783 lines in the changeset lack test coverage. Patch coverage falls 0.55 percentage
points below the required threshold.

Summary

The codecov/patch check failed because the PR adds ~1,200 new lines of code for multi-shard etcd support, but only 35.07% of those changed lines are covered by tests — missing the 35.62% target by 0.55 percentage points. The gap is driven primarily by 6 entirely untested new files (0% coverage) and the CLI layer in cmd/cluster/core/create.go (2.70% coverage, 216 uncovered lines alone). The PR does include some tests (shards_test.go, statefulset_test.go, kas/params_test.go) that cover parts of the new code well, but the CLI parsing logic and several new infrastructure files were left without tests.

Root Cause

The codecov/patch gate enforces a minimum test coverage percentage on lines changed in the PR. This PR introduces a large feature (multi-shard etcd) spanning 9 new files and many modified files, but several key areas lack any test coverage:

  1. cmd/cluster/core/create.go — 298 new lines with only 2.70% coverage (216 uncovered lines). This is the single largest contributor to the coverage gap (~28% of all missing lines). It adds CLI flags, parsing logic, and validation for etcd sharding configuration but has no unit tests for these code paths.

  2. Six files at 0% coverage contribute ~239 additional uncovered lines:

    • control-plane-operator/.../v2/etcd/statefulset.go (52 lines)
    • control-plane-operator/.../v2/etcd/service.go (50 lines)
    • control-plane-operator/.../manifests/etcd.go (40 lines)
    • control-plane-operator/.../etcd/params.go (34 lines)
    • control-plane-operator/.../v2/etcd/servicemonitor.go (33 lines)
    • control-plane-operator/.../pki/etcd.go (30 lines)
  3. Partial coverage files also contribute:

    • etcd/status.go — 30.70% coverage (78 uncovered + 1 partial)
    • etcd/statefulset.go — 64.03% coverage (77 uncovered + 5 partial)
    • hostedcontrolplane_controller.go — 48.14% coverage (19 uncovered + 9 partial)

The project-level coverage (codecov/project) actually passed — project coverage rose +0.09% to 35.71%. The failure is strictly on the patch-level gate, meaning the new code is less well-tested than the existing codebase average.

Recommendations
  1. Highest impact fix: Add unit tests for cmd/cluster/core/create.go. This file alone accounts for 216 of 783 uncovered lines. Even testing the etcd sharding flag parsing and validation logic would likely push coverage above the threshold since only ~43 more covered lines are needed (0.55% × ~1,200 lines ≈ 7 lines minimum, but more is better).

  2. Add tests for 0%-coverage new files: The six new files in v2/etcd/, manifests/, pki/ packages have zero tests. These are infrastructure/construction functions (building StatefulSets, Services, ServiceMonitors, PKI certs) that are typically straightforward to unit test with table-driven tests.

  3. Expand existing test files: shards_test.go and statefulset_test.go already exist — add test cases for uncovered branches in etcd/status.go and etcd/shards.go to improve their partial coverage.

  4. Quick win: The threshold miss is only 0.55 percentage points. Covering just the most easily testable functions across 2-3 files would be sufficient to pass the gate.

Evidence
Evidence Detail
Check run conclusion failure — 35.07% patch coverage vs 35.62% target
Coverage gap 0.55 percentage points below threshold (783 uncovered lines)
Project coverage 35.71% (+0.09%) — passed
Largest uncovered file cmd/cluster/core/create.go — 2.70% patch coverage, 216 uncovered lines
Files at 0% coverage 6 new files in v2/etcd/, manifests/, pki/ packages (~239 lines)
Lines changed +1,212 lines added; +523 hits, +638 misses, +51 partials
Tests included in PR shards_test.go (338 lines), statefulset_test.go (270 lines), kas/params_test.go, oauth_apiserver/deployment_test.go
Well-covered files oauth_apiserver/deployment.go (100%), kas/params.go (94.33%)

@JoelSpeed
Copy link
Copy Markdown
Contributor

/remove-lifecycle stale

@openshift-ci openshift-ci Bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants