Skip to content

OCPBUGS-85243: Set aws-load-balancer-scheme on public HCP router service#8458

Open
typeid wants to merge 1 commit intoopenshift:mainfrom
typeid:OCPBUGS-85243-router-lb-scheme
Open

OCPBUGS-85243: Set aws-load-balancer-scheme on public HCP router service#8458
typeid wants to merge 1 commit intoopenshift:mainfrom
typeid:OCPBUGS-85243-router-lb-scheme

Conversation

@typeid
Copy link
Copy Markdown
Member

@typeid typeid commented May 7, 2026

Summary

  • Set service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing on the public HCP router service so it creates an internet-facing NLB on both OpenShift (in-tree cloud provider) and EKS (AWS Load Balancer Controller) management clusters
  • Clean up mutually exclusive annotations (aws-load-balancer-internal vs aws-load-balancer-scheme) when switching between internal and external

Problem

The AWS Load Balancer Controller (used on EKS) defaults to scheme=internal when aws-load-balancer-scheme is not set (docs). The in-tree AWS cloud provider (used on OpenShift) defaults to internet-facing (docs).

For PublicAndPrivate clusters using type: Route with explicit hostnames, HyperShift deploys a dedicated HCP router with a public LoadBalancer service. Without the scheme annotation, this NLB is created as internal on EKS, making KAS and OAuth unreachable from outside the VPC.

Test plan

  • Existing unit tests updated to verify internet-facing annotation on public services
  • New tests for external scheme and internal-to-external switching
  • Verified internal services do NOT get the scheme annotation
  • Deploy PublicAndPrivate HCP on EKS and verify public router NLB is internet-facing

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed AWS load balancer annotation handling so external routers are marked internet-facing, internal routers use the internal-LB annotation, and conflicting scheme annotations are removed during transitions.
  • Tests

    • Added tests covering external vs internal AWS annotation behavior and mode-switching scenarios; updated fixtures to reflect the expected scheme for public/private routers.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This pull request references Jira Issue OCPBUGS-85243, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Set service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing on the public HCP router service so it creates an internet-facing NLB on both OpenShift (in-tree cloud provider) and EKS (AWS Load Balancer Controller) management clusters
  • Clean up mutually exclusive annotations (aws-load-balancer-internal vs aws-load-balancer-scheme) when switching between internal and external

Problem

The AWS Load Balancer Controller (used on EKS) defaults to scheme=internal when aws-load-balancer-scheme is not set (docs). The in-tree AWS cloud provider (used on OpenShift) defaults to internet-facing (docs).

For PublicAndPrivate clusters using type: Route with explicit hostnames, HyperShift deploys a dedicated HCP router with a public LoadBalancer service. Without the scheme annotation, this NLB is created as internal on EKS, making KAS and OAuth unreachable from outside the VPC.

Test plan

  • Existing unit tests updated to verify internet-facing annotation on public services
  • New tests for external scheme and internal-to-external switching
  • Verified internal services do NOT get the scheme annotation
  • E2E: deploy PublicAndPrivate HCP on EKS and verify public router NLB is internet-facing

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 74d796b7-d0b2-4bfd-addc-6e8cc3360b57

📥 Commits

Reviewing files that changed from the base of the PR and between e1bf047 and 01f2912.

📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_PublicAndPrivate_Route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Public_Route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router.go
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go
✅ Files skipped from review due to trivial changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Public_Route.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router.go
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go

Walkthrough

Router service reconciliation now manages AWS load balancer scheme: for external routers it sets service.beta.kubernetes.io/aws-load-balancer-scheme=internet-facing and removes the internal annotation; for internal routers it ensures service.beta.kubernetes.io/aws-load-balancer-internal=true and removes any scheme annotation. Tests and fixtures updated accordingly.

Changes

AWS Router Scheme Handling

Layer / File(s) Summary
Fixture / Expected Data
control-plane-operator/controllers/hostedcontrolplane/infra/testdata/...zz_fixture_TestReconcileInfrastructure_AWS_PublicAndPrivate_Route.yaml, ..._Public_Route.yaml
Added service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing to public router fixtures and adjusted private-router fixture expectations.
Core Reconciliation Logic
control-plane-operator/controllers/hostedcontrolplane/ingress/router.go
When platform is AWS: if internal == false delete internal annotation and set aws-load-balancer-scheme=internet-facing; if internal == true delete scheme annotation and rely on internal-LB annotation.
Infra Test Expectations
control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go
Updated expected public router service to include scheme annotation; expected private router derived from public fixture but removes scheme annotation while retaining internal-LB annotation.
Unit Tests
control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go
Added tests: external-AWS case asserts aws-load-balancer-scheme=internet-facing and absence of internal annotation; parameterized test verifies switching between internal/external removes conflicting annotations.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler
    participant KubeAPI as Kubernetes API (Service)
    participant AWS as AWS Load Balancer

    Reconciler->>KubeAPI: Get router Service
    alt platform == AWS and internal == false
        Reconciler->>KubeAPI: Remove aws-load-balancer-internal annotation (if present)
        Reconciler->>KubeAPI: Set aws-load-balancer-scheme = "internet-facing"
    else platform == AWS and internal == true
        Reconciler->>KubeAPI: Remove aws-load-balancer-scheme annotation (if present)
        Reconciler->>KubeAPI: Ensure aws-load-balancer-internal = "true"
    end
    Reconciler->>KubeAPI: Update Service
    KubeAPI->>AWS: AWS provisions/updates LB based on annotations
    AWS-->>KubeAPI: LB status
    KubeAPI-->>Reconciler: Service updated/status observed
Loading

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning New tests lack required assertion messages. Lines 72 and 117 use g.Expect(err).ToNot(HaveOccurred()) without message strings. Add message strings to error assertions: g.Expect(err).ToNot(HaveOccurred(), "message") for both test functions.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding the aws-load-balancer-scheme annotation to the public HCP router service, which is the primary objective addressed across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The repository uses standard Go testing package, not Ginkgo. New tests have stable, descriptive names without dynamic content. Check is not applicable.
Microshift Test Compatibility ✅ Passed The PR adds standard Go unit tests (not Ginkgo e2e tests). The custom check applies only to Ginkgo e2e tests. This PR does not add any Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. The new tests are unit tests using Go testing.T, not integration/e2e tests. Unit tests for load balancer annotations do not have SNO topology concerns.
Topology-Aware Scheduling Compatibility ✅ Passed Changes only modify AWS LoadBalancer Service annotations, not pod affinity, replica counts, or node selection. No topology-aware scheduling constraints introduced.
Ote Binary Stdout Contract ✅ Passed PR modifies control-plane-operator unit tests and implementation, not an OTE e2e test binary. OTE Binary Stdout Contract check is not applicable. No stdout violations found.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. All new tests are Go unit tests using the standard testing.T interface. The custom check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 7, 2026
@typeid typeid force-pushed the OCPBUGS-85243-router-lb-scheme branch from e53b716 to c29c038 Compare May 7, 2026 12:44
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.54%. Comparing base (2fc8a13) to head (01f2912).
⚠️ Report is 19 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8458      +/-   ##
==========================================
+ Coverage   37.50%   37.54%   +0.03%     
==========================================
  Files         751      751              
  Lines       91992    92034      +42     
==========================================
+ Hits        34505    34552      +47     
+ Misses      54844    54841       -3     
+ Partials     2643     2641       -2     
Files with missing lines Coverage Δ
...r/controllers/hostedcontrolplane/ingress/router.go 57.69% <100.00%> (+4.00%) ⬆️

... and 4 files with indirect coverage changes

Flag Coverage Δ
cmd-support 32.76% <ø> (+0.07%) ⬆️
cpo-hostedcontrolplane 36.80% <100.00%> (+0.03%) ⬆️
cpo-other 37.76% <ø> (+0.03%) ⬆️
hypershift-operator 47.93% <ø> (ø)
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@typeid typeid force-pushed the OCPBUGS-85243-router-lb-scheme branch from c29c038 to 8050e6e Compare May 7, 2026 12:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go (1)

629-633: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Verify that CompareWithFixture golden files have been regenerated with the new AWS load balancer annotation.

The fixture files in control-plane-operator/controllers/hostedcontrolplane/infra/testdata/ for AWS_Public_Route, AWS_PublicAndPrivate_Route, AWS_Public_KAS_LoadBalancer, and AWS_PublicAndPrivate_KAS_LoadBalancer test cases do not currently contain the service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing annotation on their router services. However, the test code defines the public router service with this annotation. When testutil.CompareWithFixture runs, it will compare generated resources (which now include the annotation) against these fixture files (which do not), causing test failures. The fixtures must be regenerated to include the new annotation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go`
around lines 629 - 633, The test failures are due to golden fixtures missing the
new AWS annotation on router Services; regenerate the fixtures used by
testutil.CompareWithFixture so the files under the testdata for the cases
AWS_Public_Route, AWS_PublicAndPrivate_Route, AWS_Public_KAS_LoadBalancer and
AWS_PublicAndPrivate_KAS_LoadBalancer include the
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing annotation
on the router Service objects; run the same test/golden-generation workflow you
used previously (the code path that builds resources via collectResources and
validates via testutil.CompareWithFixture) to update the fixture files so the
generated resources match the expected fixtures.
🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go (1)

62-103: ⚡ Quick win

Consider adding a reverse-direction test for the external→internal annotation cleanup.

TestReconcileRouterService_SwitchFromInternalToExternal covers the internal→external path. The symmetric case — where a service object already carries aws-load-balancer-scheme: internet-facing and is then reconciled as internal=true — is never explicitly asserted. That covers the delete at router.go line 32 in a meaningful way (currently TestReconcileRouterServiceAnnotations starts with an empty service so the delete is a no-op there).

🧪 Suggested additional test function
+// When switching a service from external to internal
+// it should remove the aws-load-balancer-scheme annotation.
+func TestReconcileRouterService_SwitchFromExternalToInternal(t *testing.T) {
+	hcp := &hyperv1.HostedControlPlane{}
+	hcp.Spec.Platform.Type = hyperv1.AWSPlatform
+
+	svc := &corev1.Service{}
+	svc.Annotations = map[string]string{
+		"service.beta.kubernetes.io/aws-load-balancer-scheme": "internet-facing",
+	}
+
+	if err := ReconcileRouterService(svc, true /* internal */, false /* cross-zone */, hcp); err != nil {
+		t.Fatalf("ReconcileRouterService returned error: %v", err)
+	}
+
+	if _, exists := svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-scheme"]; exists {
+		t.Fatalf("expected aws-load-balancer-scheme to be removed after switching to internal")
+	}
+	if got := svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-internal"]; got != "true" {
+		t.Fatalf("expected aws-load-balancer-internal to be 'true', got %q", got)
+	}
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go`
around lines 62 - 103, Add a symmetric test that starts with a Service already
annotated
"service.beta.kubernetes.io/aws-load-balancer-scheme":"internet-facing", call
ReconcileRouterService(svc, true /* internal */, true /* cross-zone */, hcp) and
assert that the scheme annotation is removed and the
"service.beta.kubernetes.io/aws-load-balancer-internal" annotation is now
present (e.g. set to "true"); name it
TestReconcileRouterService_SwitchFromExternalToInternal to mirror
TestReconcileRouterService_SwitchFromInternalToExternal and ensure this
exercises the code path in ReconcileRouterService that deletes the scheme
annotation and sets the internal annotation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go`:
- Around line 629-633: The test failures are due to golden fixtures missing the
new AWS annotation on router Services; regenerate the fixtures used by
testutil.CompareWithFixture so the files under the testdata for the cases
AWS_Public_Route, AWS_PublicAndPrivate_Route, AWS_Public_KAS_LoadBalancer and
AWS_PublicAndPrivate_KAS_LoadBalancer include the
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing annotation
on the router Service objects; run the same test/golden-generation workflow you
used previously (the code path that builds resources via collectResources and
validates via testutil.CompareWithFixture) to update the fixture files so the
generated resources match the expected fixtures.

---

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go`:
- Around line 62-103: Add a symmetric test that starts with a Service already
annotated
"service.beta.kubernetes.io/aws-load-balancer-scheme":"internet-facing", call
ReconcileRouterService(svc, true /* internal */, true /* cross-zone */, hcp) and
assert that the scheme annotation is removed and the
"service.beta.kubernetes.io/aws-load-balancer-internal" annotation is now
present (e.g. set to "true"); name it
TestReconcileRouterService_SwitchFromExternalToInternal to mirror
TestReconcileRouterService_SwitchFromInternalToExternal and ensure this
exercises the code path in ReconcileRouterService that deletes the scheme
annotation and sets the internal annotation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e3090993-8f32-4a26-a773-3c7e60c6300a

📥 Commits

Reviewing files that changed from the base of the PR and between 2fc8a13 and e53b716.

📒 Files selected for processing (3)
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router.go
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/ingress/router.go`:
- Around line 30-37: The svc annotation handling in router.go (inside the
reconcile that sets svc.Annotations based on the internal boolean) only adds the
desired AWS annotation but does not remove the opposite key, which leaves stale
mutually-exclusive keys (service.beta.kubernetes.io/aws-load-balancer-internal
vs service.beta.kubernetes.io/aws-load-balancer-scheme) after an
internal↔external flip; update the logic in that reconcile block (where svc and
internal are used) so that when internal is true you set/reset
"service.beta.kubernetes.io/aws-load-balancer-internal"="true" and explicitly
delete "service.beta.kubernetes.io/aws-load-balancer-scheme", and when internal
is false you set
"service.beta.kubernetes.io/aws-load-balancer-scheme"="internet-facing" and
explicitly delete "service.beta.kubernetes.io/aws-load-balancer-internal" to
ensure no stale mutually-exclusive annotations remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fbda403e-5c8f-4819-ae99-29888467b063

📥 Commits

Reviewing files that changed from the base of the PR and between e53b716 and 8050e6e.

⛔ Files ignored due to path filters (2)
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_PublicAndPrivate_Route.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Public_Route.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router.go
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go

@typeid typeid force-pushed the OCPBUGS-85243-router-lb-scheme branch from 8050e6e to b891f34 Compare May 7, 2026 12:57
@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 7, 2026

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This pull request references Jira Issue OCPBUGS-85243, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
control-plane-operator/controllers/hostedcontrolplane/ingress/router.go (1)

30-38: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove stale mutually-exclusive AWS annotations during reconciliation.

Line 31 and Line 38 set opposite-mode annotations, but neither branch removes the other key. This can leave conflicting annotations on pre-existing Services and cause incorrect NLB exposure after mode changes.

Proposed fix
 		svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-type"] = "nlb"
 		if internal {
+			delete(svc.Annotations, "service.beta.kubernetes.io/aws-load-balancer-scheme")
 			svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-internal"] = "true"
 		} else {
+			delete(svc.Annotations, "service.beta.kubernetes.io/aws-load-balancer-internal")
 			// The AWS Load Balancer Controller (used on EKS) defaults to scheme=internal:
 			//   https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.4/guide/service/annotations/
 			// The in-tree AWS cloud provider (used on OpenShift) defaults to internet-facing:
 			//   https://cloud-provider-aws.sigs.k8s.io/service_controller/
 			// Set the annotation explicitly so the public router works on both.
 			svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-scheme"] = "internet-facing"
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/ingress/router.go`
around lines 30 - 38, The reconciliation currently sets
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-internal"] =
"true" in the internal branch and
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-scheme"] =
"internet-facing" in the public branch but never removes the opposite key,
leaving conflicting annotations on pre-existing Services; update the code in
router.go (the block that mutates svc.Annotations) so that when setting the
internal annotation you also delete
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-scheme"], and when
setting the scheme you delete
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-internal"],
ensuring the Service has only the correct AWS annotation after reconciliation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@control-plane-operator/controllers/hostedcontrolplane/ingress/router.go`:
- Around line 30-38: The reconciliation currently sets
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-internal"] =
"true" in the internal branch and
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-scheme"] =
"internet-facing" in the public branch but never removes the opposite key,
leaving conflicting annotations on pre-existing Services; update the code in
router.go (the block that mutates svc.Annotations) so that when setting the
internal annotation you also delete
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-scheme"], and when
setting the scheme you delete
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-internal"],
ensuring the Service has only the correct AWS annotation after reconciliation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 73e026e2-d497-42d1-8885-c0b1c9286c69

📥 Commits

Reviewing files that changed from the base of the PR and between 8050e6e and b891f34.

⛔ Files ignored due to path filters (2)
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_PublicAndPrivate_Route.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Public_Route.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router.go
  • control-plane-operator/controllers/hostedcontrolplane/ingress/router_test.go

@typeid typeid marked this pull request as ready for review May 7, 2026 13:02
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026
@openshift-ci openshift-ci Bot requested review from csrwng and jparrill May 7, 2026 13:03
@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 7, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This pull request references Jira Issue OCPBUGS-85243, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 7, 2026

/test conformance-e2e

@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This pull request references Jira Issue OCPBUGS-85243, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Set service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing on the public HCP router service so it creates an internet-facing NLB on both OpenShift (in-tree cloud provider) and EKS (AWS Load Balancer Controller) management clusters
  • Clean up mutually exclusive annotations (aws-load-balancer-internal vs aws-load-balancer-scheme) when switching between internal and external

Problem

The AWS Load Balancer Controller (used on EKS) defaults to scheme=internal when aws-load-balancer-scheme is not set (docs). The in-tree AWS cloud provider (used on OpenShift) defaults to internet-facing (docs).

For PublicAndPrivate clusters using type: Route with explicit hostnames, HyperShift deploys a dedicated HCP router with a public LoadBalancer service. Without the scheme annotation, this NLB is created as internal on EKS, making KAS and OAuth unreachable from outside the VPC.

Test plan

  • Existing unit tests updated to verify internet-facing annotation on public services
  • New tests for external scheme and internal-to-external switching
  • Verified internal services do NOT get the scheme annotation
  • Deploy PublicAndPrivate HCP on EKS and verify public router NLB is internet-facing

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Fixed AWS load balancer annotation handling so external routers receive an internet-facing scheme while internal routers use the internal annotation only.

  • Tests

  • Added tests confirming correct annotation behavior for external vs. internal AWS router services.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 7, 2026

/test e2e-conformance

@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 7, 2026

/test e2e-aks e2e-aws e2e-aws-upgrade-hypershift-operator e2e-azure-self-managed e2e-kubevirt-aws-ovn-reduced e2e-v2-aws

@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 7, 2026

/retest

1 similar comment
@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 7, 2026

/retest

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2052416128847712256 | Cost: $2.29890125 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2052416128709300224 | Cost: $3.168518449999999 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2052416128759631872 | Cost: $2.6617562499999994 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 7, 2026

Test Results

e2e-aws

e2e-aks

@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 8, 2026

/retest

1 similar comment
@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 8, 2026

/retest

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review with two suggestions -- both non-blocking. The fix is correct and safe for merge as-is.

// The in-tree AWS cloud provider (used on OpenShift) defaults to internet-facing:
// https://cloud-provider-aws.sigs.k8s.io/service_controller/
// Set the annotation explicitly so the public router works on both.
svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-scheme"] = "internet-facing"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): The PR description mentions "clean up mutually exclusive annotations" but the production code only adds -- it never deletes the conflicting annotation. While the current architecture uses separate Service objects for public/private (so the transition scenario can't happen today), defensive cleanup costs nothing and follows the pattern already used in oauth/service.go:77:

if internal {
	svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-internal"] = "true"
	delete(svc.Annotations, "service.beta.kubernetes.io/aws-load-balancer-scheme")
} else {
	svc.Annotations["service.beta.kubernetes.io/aws-load-balancer-scheme"] = "internet-facing"
	delete(svc.Annotations, "service.beta.kubernetes.io/aws-load-balancer-internal")
}

This would also make the code match what the infra_test.go fixture helper already expects (the privateService helper does delete(s.Annotations, "...scheme")).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks! Updated.


// When reconciling an external (non-internal) AWS router service
// it should set aws-load-balancer-scheme to internet-facing.
func TestReconcileRouterService_WhenExternalAWS_ItShouldSetInternetFacingScheme(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Good test -- gomega + Gherkin naming, positive and negative assertions. Consider adding a complementary subtest that pre-populates the conflicting annotation to verify cleanup:

t.Run("When switching from internal to external it should remove the internal annotation", func(t *testing.T) {
	g := NewGomegaWithT(t)
	svc := &corev1.Service{}
	svc.Annotations = map[string]string{
		"service.beta.kubernetes.io/aws-load-balancer-internal": "true",
	}
	err := ReconcileRouterService(svc, false, true, hcp)
	g.Expect(err).ToNot(HaveOccurred())
	g.Expect(svc.Annotations).To(HaveKeyWithValue(
		"service.beta.kubernetes.io/aws-load-balancer-scheme", "internet-facing"))
	g.Expect(svc.Annotations).ToNot(HaveKey(
		"service.beta.kubernetes.io/aws-load-balancer-internal"))
})

This would pair with the defensive delete() calls suggested in router.go.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case for the switching back and forth.

@typeid typeid force-pushed the OCPBUGS-85243-router-lb-scheme branch from b891f34 to e1bf047 Compare May 8, 2026 10:20
The AWS Load Balancer Controller (used on EKS) defaults to
scheme=internal when the aws-load-balancer-scheme annotation is not
set, while the in-tree AWS cloud provider (used on OpenShift) defaults
to internet-facing. This causes the public HCP router NLB to be
created as internal on EKS, making KAS and OAuth unreachable from
outside the VPC for PublicAndPrivate clusters.

Set the annotation explicitly on the public router service so it works
on both OpenShift and EKS management clusters. Also clean up mutually
exclusive annotations (aws-load-balancer-internal vs
aws-load-balancer-scheme) when switching between internal and external.

Signed-off-by: Claudio Busse <cbusse@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@typeid typeid force-pushed the OCPBUGS-85243-router-lb-scheme branch from e1bf047 to 01f2912 Compare May 8, 2026 10:39
@hypershift-jira-solve-ci
Copy link
Copy Markdown

I have all the evidence needed. Here is the analysis:

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-security
  • Build ID: 2052699968610766848
  • Target: security
  • Job Type: presubmit
  • Cluster: build01
  • State: error
  • Duration: 30m 0s (entire duration spent in Pending)
  • PR: openshift/hypershift#8458 — OCPBUGS-85243: Set aws-load-balancer-scheme on public HCP router service

Test Failure Analysis

Error

Pod scheduling timeout. 0/73 nodes are available: 1 Insufficient memory, 1 node(s) had untolerated
taint {node-role.kubernetes.io/ci-builds-tmpfs-worker: ci-builds-tmpfs-worker}, 13 node(s) had
untolerated taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, 15 node(s) didn't
match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had
untolerated taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, 3 node(s) had
untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint
{node-role.kubernetes.io/master: }, 3 node(s) were unschedulable, 30 node(s) had untolerated taint
{node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}. preemption: 0/73 nodes are available:
1 Insufficient memory, 2 No preemption victims found for incoming pod, 70 Preemption is not helpful
for scheduling.

Summary

This is a CI infrastructure failure, not a code or test problem. The Prow CI pod for the security job was never scheduled on the build01 cluster — it remained in Pending state for the full 30-minute scheduling timeout and was then terminated by Prow. The pod required nodes with ci-workload: prowjobs node selector and toleration for node-role.kubernetes.io/ci-prowjobs-worker, but none of the 73 nodes in the cluster could satisfy these constraints due to a combination of taints, insufficient memory, node affinity mismatches, and unschedulable nodes. 52 consecutive FailedScheduling events were emitted over 30 minutes. The PR's code changes are unrelated to this failure.

Root Cause

The CI pod could not be scheduled because the build01 cluster had no available nodes in the ci-prowjobs-worker pool at the time of execution. The pod's scheduling constraints required:

  1. Node selector: ci-workload: prowjobs
  2. Toleration: node-role.kubernetes.io/ci-prowjobs-worker

Of the 73 nodes in the cluster, all were excluded:

  • 30 nodes: tainted ci-tests-worker (not tolerated by prowjob pods)
  • 15 nodes: didn't match pod's node affinity/selector
  • 13 nodes: tainted ci-builds-worker (not tolerated)
  • 3 nodes: tainted master role
  • 3 nodes: tainted infra role
  • 3 nodes: marked unschedulable (cordoned/draining)
  • 2 nodes: tainted ci-longtests-worker
  • 2 nodes: anti-affinity conflict
  • 1 node: tainted ci-builds-tmpfs-worker
  • 1 node: insufficient memory

This means the small pool of nodes matching ci-prowjobs-worker was either fully consumed by other pods, had insufficient memory, or was cordoned. This is transient CI infrastructure pressure — not caused by the PR under test.

Recommendations
  1. Retest the PR — run /retest or /test security on the PR. This is a transient infrastructure issue and will likely pass on retry.
  2. No code changes needed — the PR's changes (setting aws-load-balancer-scheme on public HCP router service) are completely unrelated to this scheduling failure.
  3. If retests continue to fail with scheduling timeouts, escalate to the CI infrastructure team (Test Platform / DPTP) — it may indicate sustained capacity issues on the build01 cluster's ci-prowjobs-worker node pool.
Evidence
Evidence Detail
Job state error (not failure — the job errored before any test code ran)
Prow description "Pod scheduling timeout."
Pod phase Pending (never reached Running)
Duration in Pending 30 minutes (full scheduling timeout)
FailedScheduling events 52 events over 30 minutes
Build log Does not exist — pod never started
Test artifacts None — no artifacts/ directory exists
Node selector ci-workload: prowjobs
Pod toleration node-role.kubernetes.io/ci-prowjobs-worker
Cluster build01 (73 nodes, 0 schedulable for this pod)
Scheduling message "0/73 nodes are available" — all excluded by taints, affinity, memory, or unschedulable state
PR relation None — PR changes are to router service LB scheme annotations, unrelated to CI scheduling

@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 8, 2026

/test ci/prow/security ci/prow/e2e-aks ci/prow/e2e-aws ci/prow/e2e-aws-upgrade-hypershift-operator ci/prow/e2e-azure-self-managed ci/prow/e2e-kubevirt-aws-ovn-reduced ci/prow/e2e-v2-aws

@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 8, 2026

/test e2e-conformance

@typeid
Copy link
Copy Markdown
Member Author

typeid commented May 8, 2026

/test security e2e-aks e2e-aws e2e-aws-upgrade-hypershift-operator e2e-azure-self-managed e2e-kubevirt-aws-ovn-reduced e2e-v2-aws

@jparrill
Copy link
Copy Markdown
Contributor

jparrill commented May 8, 2026

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill, typeid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

@typeid: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants