Skip to content

[occm] Tag load balancers with cluster identity to prevent name collisions#3103

Open
enginrect wants to merge 4 commits intokubernetes:masterfrom
enginrect:occm-cluster-id-tag
Open

[occm] Tag load balancers with cluster identity to prevent name collisions#3103
enginrect wants to merge 4 commits intokubernetes:masterfrom
enginrect:occm-cluster-id-tag

Conversation

@enginrect
Copy link
Copy Markdown

What this PR does / why we need it:

OCCM identifies an existing Octavia load balancer for a Service by name on
the first reconcile (via getLoadbalancerByName). The name format
kube_service_<cluster-name>_<namespace>_<service> defaults to a
<cluster-name> of kubernetes, so two Kubernetes clusters in the same
OpenStack project that happen to use the default cluster-name and have
Services with identical namespace/name produce identical load balancer
names. Octavia does not enforce uniqueness of names, so OCCM in cluster B
ends up adopting and overwriting cluster A's load balancer. This has been
reported repeatedly (see #2241, #2571, #2624) and the standing guidance
"set a unique --cluster-name" is correct but does not actually defend
against the failure mode.

This PR adds a stable Kubernetes cluster identifier - the UID of the
kube-system namespace - as a load balancer tag of the form
kube_cluster_id_<uid>. Lookup behaviour:

  • LBs that carry the matching kube_cluster_id_<our-uid> tag are kept.
  • LBs that carry no kube_cluster_id_* tag fall back to the legacy
    behaviour (preserves existing deployments and externally-created LBs).
  • LBs that carry only foreign kube_cluster_id_* tags are treated as
    NotFound, with a warning. OCCM will then create its own load balancer
    rather than overwriting one that belongs to another cluster.

The cluster UID is read once at controller-manager start-up. If the
lookup fails (RBAC denial, missing namespace, etc.) the safeguard is
disabled and OCCM falls back to the legacy name-based behaviour, so the
change is strictly additive. Pre-existing load balancers also gain the
kube_cluster_id_* tag during the next reconciliation.

Which issue this PR fixes(if applicable):
fixes #3102

Special notes for reviewers:

  • Backward compatibility:
    • Load balancers without any kube_cluster_id_* tag keep the previous
      behaviour. They are tagged on the next successful reconcile.
    • Load balancers looked up via the existing
      loadbalancer.openstack.org/load-balancer-id annotation (i.e. on
      every reconcile after the first one) go through GetLoadbalancerByID,
      which is unaffected.
  • New RBAC: get on namespaces is added to both the manifest
    ClusterRole (manifests/controller-manager/cloud-controller-manager-roles.yaml)
    and the helm chart (charts/openstack-cloud-controller-manager/templates/clusterrole.yaml).
    If the verb is unavailable the safeguard simply degrades to the legacy
    behaviour with a warning log; OCCM does not refuse to start.
  • Octavia API >= v2.5 (Stein) is required for the tag feature. This is
    already gated by svcConf.supportLBTags and behaves as before on older
    clouds.
  • New unit tests:
    • TestFilterLoadBalancersByClusterID covers the matching, legacy,
      foreign-only, and mixed cases.
    • TestFetchClusterUID covers happy path and graceful degradation
      (missing namespace, forbidden) with a fake clientset.

How to verify manually:

go test ./pkg/openstack/...

A reproduction of the original failure mode (two clusters in the same
project, same --cluster-name, same Service ns/name) is described in
#3102.

Release note:

[openstack-cloud-controller-manager] Octavia load balancers now carry a
stable cluster-identity tag (`kube_cluster_id_<kube-system-uid>`) so OCCM
will no longer adopt a load balancer that belongs to a different
Kubernetes cluster sharing the same OpenStack project, even when the load
balancer name collides. Pre-existing load balancers gain the tag on the
next reconcile; load balancers without the tag keep the previous
behaviour. The cloud-controller-manager ClusterRole gains `get` on
`namespaces`.

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 30, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 30, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @enginrect!

It looks like this is your first PR to kubernetes/cloud-provider-openstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/cloud-provider-openstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @enginrect. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kayrus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 30, 2026
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 30, 2026
@enginrect
Copy link
Copy Markdown
Author

Hi @kayrus @stephenfin @zetaab — first-time contributor here. This PR addresses the long-standing cross-cluster LB collision issue (refs #2241, #2571, #2624) with an additive, backward-compatible kube_cluster_id_<kube-system-uid> tag on Octavia load balancers. Lookups now reject LBs tagged for a different cluster, fall back to legacy behaviour for untagged LBs, and tag pre-existing LBs on the next reconcile. The safeguard degrades gracefully (warning log + legacy behaviour) if the new get on namespaces RBAC is not granted, so the change is strictly additive.

Could one of you take a look and add /ok-to-test when convenient? The failing "Lint Charts" check is unrelated to this PR — it is a pre-existing repository-policy issue on master where the workflow uses unpinned action tags, and it currently fails on every PR.

Thanks!

…sions

OCCM constructs Octavia load balancer names as
kube_service_<cluster-name>_<namespace>_<service>. When two Kubernetes
clusters share the same OpenStack project and use the same
--cluster-name (default "kubernetes"), services with identical
namespace/name produce identical load balancer names. Octavia does not
enforce uniqueness on load balancer names, so OCCM's first-time
name-based lookup can adopt and overwrite a load balancer that actually
belongs to a different cluster (see issues kubernetes#2241, kubernetes#2571, kubernetes#2624).

This commit adds a stable Kubernetes cluster identifier - the UID of
the kube-system namespace - as a load balancer tag of the form
kube_cluster_id_<uid>. getLoadbalancerByName now ignores load balancers
that carry a cluster-id tag for a different cluster and falls back to
the legacy behaviour for load balancers without any cluster-id tag, so
existing deployments keep working unchanged. Pre-existing load
balancers gain the new tag during the next reconciliation.

The cluster UID is read once at controller-manager start-up via the
kube-system namespace; failure to read it (RBAC denial, missing
namespace) is non-fatal and disables the safeguard, falling back to
legacy name-based lookup. The cloud-controller-manager ClusterRole and
the helm chart gain "get" on namespaces.

Made-with: Cursor
@enginrect enginrect force-pushed the occm-cluster-id-tag branch from 123ffe4 to bbf5f4f Compare May 4, 2026 05:35
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 4, 2026
@enginrect enginrect force-pushed the occm-cluster-id-tag branch from 1b0cbe1 to 9bb737f Compare May 4, 2026 05:40
The previous commit added a "get" on "namespaces" rule to the
Helm chart's ClusterRole template. chart-testing requires a
version bump on any chart modification. Bumping the patch
version since the change is additive and backward-compatible.
@enginrect enginrect force-pushed the occm-cluster-id-tag branch from 9bb737f to b620ccf Compare May 4, 2026 05:43
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 4, 2026
@enginrect
Copy link
Copy Markdown
Author

Quick update: I've rebased on top of master (which now has the SHA-pinned actions from #3100) and bumped the helm chart patch version to 2.35.1 — Lint Charts and EasyCLA are both green now. The PR is still blocked on needs-ok-to-test. @kayrus @stephenfin @zetaab — would one of you mind adding /ok-to-test when convenient? Happy to address review feedback as it comes in.

Comment thread pkg/openstack/openstack.go Outdated
// identifier on OpenStack load balancer tags. May be empty if the lookup
// failed or RBAC does not allow it; in that case OCCM falls back to the
// legacy name-based load balancer identification.
clusterUID string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterUID can be defined only in LoadBalancer struct, please remove it from here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — thanks! You're absolutely right, that was a cohesion miss on my part. clusterUID is only consumed by the LB code path so it shouldn't pollute the global OpenStack struct. Removed in 56756ea; the field now lives only on LoadBalancer where it belongs.

Comment thread pkg/openstack/openstack.go Outdated
klog.V(1).Info("Claiming to support LoadBalancer")

return &LbaasV2{LoadBalancer{secret, network, lb, os.lbOpts, os.kclient, os.eventRecorder}}, true
return &LbaasV2{LoadBalancer{secret, network, lb, os.lbOpts, os.kclient, os.eventRecorder, os.clusterUID}}, true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can set clusterUID value here with:

clusterID := fetchClusterUID(os.kclient)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, that's a much cleaner pattern. Fetching lazily inside LoadBalancer() keeps the kube-system namespace lookup out of the global Initialize() path and limits the change to the LB construction site. As a nice side effect, clusters that disable LB now skip the lookup entirely. Done in 56756ea.

Comment on lines 192 to 194
opts := loadbalancers.ListOpts{
Name: name,
}
Copy link
Copy Markdown
Contributor

@kayrus kayrus May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can filter by tags using ListOpts.Tags. filterLoadBalancersByClusterID doesn't make sense.
UPD: please ignore this comment

Comment thread pkg/openstack/loadbalancer.go Outdated
// balancer with a matching name that belongs to a different Kubernetes
// cluster (different cluster-id tag). The lookup is treated as NotFound
// so OCCM creates a new load balancer instead of stealing an existing one.
eventLBStolen = "LoadBalancerNameCollision"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it used somewhere?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question — and the honest answer is: no, it's not used anywhere, sorry about the leftover. The original intent was to emit a Warning event via eventRecorder from getLoadbalancerByName when we drop a foreign-tagged LB, but plumbing the event recorder and *v1.Service into that free-standing function felt out of scope for this PR (the existing warning log already covers operator visibility), so I dropped the emission and forgot to remove the constant. Cleaned up in 56756ea.

- Remove duplicate clusterUID field from the OpenStack struct so the
  identifier lives only on the LoadBalancer struct that actually uses it
  (better cohesion).
- Drop fetchClusterUID() out of Initialize() and call it lazily inside
  the LoadBalancer() factory instead. Clusters that disable LB now skip
  the kube-system namespace lookup entirely, and the change touches only
  the LB construction path.
- Remove the unused eventLBStolen constant. It was a leftover from an
  earlier draft that intended to emit a Warning event from
  getLoadbalancerByName(); plumbing the eventRecorder + *v1.Service into
  that free-standing function felt out of scope, so the emission was
  dropped but the constant was left behind.
@enginrect
Copy link
Copy Markdown
Author

enginrect commented May 6, 2026

Thanks for the review @kayrus! Pushed the changes in 56756ea:

  • Dropped the duplicate clusterUID field from the OpenStack struct
    (cohesion fix; thanks for catching this).
  • Moved the kube-system UID lookup into the LoadBalancer() factory
    as a lazy fetch, so it lives in the LB code path only.
  • Removed the unused eventLBStolen constant — explained in the inline
    thread.

go test ./pkg/openstack/... is green locally and the existing checks
(Lint Charts, EasyCLA) are still passing.

@kayrus
Copy link
Copy Markdown
Contributor

kayrus commented May 7, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 7, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 7, 2026
The pull-cloud-provider-openstack-check prow job runs golangci-lint
v2.3.1 with staticcheck enabled, which flags fake.NewSimpleClientset as
SA1019 (deprecated). TestFetchClusterUID, added earlier in this PR,
used the deprecated function. Swap it for fake.NewClientset; the
signature is identical (objects ...runtime.Object) and the unit tests
still pass.
@enginrect
Copy link
Copy Markdown
Author

enginrect commented May 7, 2026

The check job was tripping on SA1019 because the new TestFetchClusterUID was using the deprecated fake.NewSimpleClientset. Pushed bdabd79 swapping it for fake.NewClientset (identical signature, same behaviour for our tests).

Verified locally:

  • go test ./pkg/openstack/... → pass
  • go run github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.3.1 run --timeout=20m ./... → 0 issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[occm] Cross-cluster load balancer name collision when multiple Kubernetes clusters share an OpenStack project

3 participants