Skip to content

ARO-26764: narrow HostedCluster primary watch to actionable metadata updates#8424

Open
tuxerrante wants to merge 5 commits intoopenshift:mainfrom
tuxerrante:tuxerrante/hostedcluster-generation-sync
Open

ARO-26764: narrow HostedCluster primary watch to actionable metadata updates#8424
tuxerrante wants to merge 5 commits intoopenshift:mainfrom
tuxerrante:tuxerrante/hostedcluster-generation-sync

Conversation

@tuxerrante
Copy link
Copy Markdown

@tuxerrante tuxerrante commented May 5, 2026

What this PR does / why we need it:

The HostedCluster controller currently receives reconcile requests from its primary watch for updates that do not represent new desired state, especially status-only churn. That makes the controller noisier than it needs to be and makes it harder to reason about which reconciliations are actually driven by user intent versus controller-written status.

This change narrows only the primary HostedCluster watch so that it continues to reconcile on meaningful inputs, while avoiding self-induced reconciles from status-only updates.

The key design point is that this cannot be implemented as a pure generation-only filter. For HostedCluster, some metadata-only updates are already real control inputs for reconciliation and must continue to enqueue even when the object generation does not change. Examples include specific annotations consumed by reconciliation, mirrored api.openshift.com/* labels, scope transitions, and deletion start.

To keep the behavior correct while reducing noise, this PR introduces a dedicated primary predicate that allows:

  • generation-changing updates
  • deletion timestamp transitions
  • actionable annotation changes
  • actionable mirrored label changes

and filters out status-only updates from the primary watch.

The change is intentionally narrow:

  • child-resource watches are left unchanged
  • the API / CRD surface is unchanged
  • the existing ObservedGeneration status model remains in place
  • the reconcile loop still re-reads the current HostedCluster, so stale queued requests continue to act on the latest available object state

Alongside the watch change, this PR fixes two same-area behaviors that became important once metadata-only updates were treated more explicitly:

  1. ValidReleaseImage is re-evaluated when SkipReleaseImageValidation is added or removed without a generation bump, so a previously cached True condition cannot remain stale.
  2. mirrored api.openshift.com/* labels are actively removed from HostedControlPlane when they are removed from the HostedCluster, instead of only being copied on add/update.

Overall, the goal is to reduce unnecessary reconciliations without weakening any existing metadata-driven control path.

Which issue(s) this PR fixes:

Fixes ARO-26764

Notes:

This PR is easier to review if you read it as a queueing-boundary change rather than as a reconcile algorithm rewrite.

The new predicate does not try to guarantee one reconcile per generation. Instead, it makes the enqueue boundary stricter and more explicit. The "latest object wins" behavior still comes from the existing reconcile entrypoint, which fetches the current HostedCluster by NamespacedName when the request is processed. That means older queued requests still reconcile against the latest stored object state, not the old event payload.

Because of that, the review focus should be:

  • whether the allowlist of actionable metadata inputs is complete
  • whether status-only updates are now correctly filtered out from the primary watch
  • whether same-generation metadata changes still propagate correctly
  • whether the blast radius is appropriately contained to the shared HostedCluster primary watch and adjacent regression fixes

Expected impact:

  • fewer self-induced HostedCluster reconciliations from status-only updates
  • no intended behavior change for spec updates
  • no intended behavior change for metadata-only inputs that already affect reconciliation
  • HCP-wide effect because this path is shared across Hosted Control Plane distributions, not specific to one platform

Test plan:

  • make verify
  • predicate unit tests cover generation changes, status-only updates, actionable annotations, actionable labels, deletion timestamp changes, and scope transitions
  • falsification tests verify that stale queued requests still reconcile the latest HostedCluster generation and latest actionable metadata
  • regression tests cover same-generation SkipReleaseImageValidation toggles
  • regression tests cover removal of mirrored api.openshift.com/* labels from HostedControlPlane

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • Added option to skip release image validation via annotation.
  • Bug Fixes

    • Improved synchronization of annotations and labels between hosted cluster components with automatic removal of stale entries.
    • Enhanced reconciliation condition tracking to align with current generation.
  • Tests

    • Added comprehensive test coverage for predicates, label handling, reconciliation conditions, and release image validation flows.

tuxerrante and others added 3 commits May 5, 2026 17:13
Reduce self-induced HostedCluster reconciles by filtering the primary watch to
spec changes and explicit metadata triggers that already affect reconciliation.

Preserve mirrored annotation and label behavior with focused falsification tests
so stale queued requests still reconcile the latest HostedCluster state.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Co-authored-by: Cursor <cursoragent@cursor.com>
Track when release image validation was skipped so same-generation annotation
changes can invalidate a cached True ValidReleaseImage condition.

Add a focused regression test matrix covering skip annotation add and remove
transitions without a generation bump.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Co-authored-by: Cursor <cursoragent@cursor.com>
Update the new HostedCluster predicate code to use the k8sutil-scoped
annotation constants introduced on current main so the branch verifies cleanly
after rebasing.

Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 5, 2026

@tuxerrante: This pull request references ARO-26764 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the enhancement to target either version "5.0." or "openshift-5.0.", but it targets "ARO-Installer-4.21" instead.

Details

In response to this:

Summary

  • narrow the shared HostedCluster primary watch so HCP stops reconciling on status-only churn while still enqueueing on explicit metadata-only control inputs
  • keep child-resource watches and API/CRD surfaces unchanged, while preserving actionable annotations, mirrored api.openshift.com/* labels, delete-start handling, and release-image validation semantics
  • add falsification and regression coverage for latest-object-wins behavior, same-generation actionable metadata changes, mirrored label removal, and SkipReleaseImageValidation toggles

Test plan

  • make verify
  • Focused HostedCluster predicate, reconciliation, and release-image validation tests covered through the package test suite exercised by make verify
  • Optional follow-up: manually confirm a draft PR title/body and Jira linkage rendering as expected

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9e56608c-0949-412c-a3b7-2cb35b87c624

📥 Commits

Reviewing files that changed from the base of the PR and between 1280cf1 and 037e346.

📒 Files selected for processing (1)
  • hypershift-operator/controllers/hostedcluster/hostedcluster_predicates_test.go

📝 Walkthrough

Walkthrough

The PR modifies HostedCluster reconciliation predicates, replacing the prior scoping predicate with hostedClusterPrimaryPredicate that fires on generation, deletion timestamp, or actionable annotation/label changes (including prefix-based changes). It changes HostedControlPlane annotation and label mirroring to remove stale prefixed entries and copy only keys matching configured actionable prefixes. The release image validation flow was refactored with helpers shouldValidateReleaseImage and hasSkipReleaseImageValidationAnnotation, allowing conditional skipping of validation and centralizing condition reasoning.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant API as API Server
    participant Controller as HostedCluster Controller
    participant HC as HostedCluster
    participant HCP as HostedControlPlane
    participant Validator as ReleaseImageValidator

    API->>Controller: HostedCluster update event
    Controller->>Controller: hostedClusterPrimaryPredicate(r) evaluates
    alt Predicate allows update
        Controller->>API: Get HostedCluster
        API-->>Controller: HostedCluster (annotations, labels, generation, deletionTimestamp)
        Controller->>HCP: Get HostedControlPlane
        HCP-->>Controller: HostedControlPlane (annotations, labels)
        Controller->>Controller: Compute actionable annotation/label diffs (prefix-aware)
        Controller->>HCP: Remove stale prefixed annotations/labels
        Controller->>HCP: Copy actionable prefixed annotations/labels from HC
        Controller->>Controller: shouldValidateReleaseImage(hcluster, condition)?
        alt Skip annotation present
            Controller->>Controller: Set ValidReleaseImage True (ReleaseImageValidationSkipped)
        else Validate
            Controller->>Validator: validateReleaseImage(releaseImageRef, secrets...)
            alt Validation succeeds
                Validator-->>Controller: success
                Controller->>Controller: Set ValidReleaseImage True (AsExpected)
            else Validation fails
                Validator-->>Controller: error
                Controller->>Controller: Set ValidReleaseImage False (reason)
            end
        end
        Controller->>API: Update HCP and HC status/conditions
        API-->>Controller: persisted
    else Predicate blocks update
        Controller-->>API: no reconcile triggered
    end
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Assertion messages lack diagnostic details. hcp_labels_test.go and predicates_test.go have assertions missing variables. Violates requirement 4. Add variables to all assertions showing actual vs expected values. Example: change "expected stale label removed" to "expected stale label removed, got: %v" with the actual label map.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: narrowing the HostedCluster primary watch to only respond to actionable metadata updates, which directly aligns with the core objective of reducing unnecessary reconciliations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic. Static test function names, table-driven test cases with hardcoded strings, no dynamic content like timestamps, UUIDs, or pod names.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added. All new test files are standard Go unit tests using the testing package, not Ginkgo framework. The check is not applicable to this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests present. PR adds only standard Go unit tests using func Test*(t *testing.T) patterns. SNO compatibility check applies only to Ginkgo e2e tests, not present here.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies controller reconciliation logic only. No scheduling constraints, deployment specs, affinity rules, or topology-incompatible assumptions are introduced.
Ote Binary Stdout Contract ✅ Passed PR does not contain OTE binary code. It adds standard Go unit tests (testing.T framework) with no Ginkgo/suite setup. No stdout-writing patterns detected in test or production code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR adds only standard Go unit tests (func Test...(*testing.T)), not Ginkgo e2e tests. The custom check applies only to new Ginkgo e2e tests, therefore this check is not applicable to this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release label May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tuxerrante
Once this PR has been reviewed and has the lgtm label, please assign jparrill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (1)

2399-2407: ⚡ Quick win

Avoid hardcoded actionable-label prefix in HCP label reconciliation.

Line 2400 and Line 2405 duplicate "api.openshift.com" instead of reusing hostedClusterActionableLabelPrefixes. Keeping one source of truth prevents future drift between watch triggering and label mirroring behavior.

♻️ Suggested refactor
-	for key := range hcp.Labels {
-		if strings.HasPrefix(key, "api.openshift.com") {
-			delete(hcp.Labels, key)
-		}
-	}
+	for key := range hcp.Labels {
+		for _, prefix := range hostedClusterActionableLabelPrefixes {
+			if strings.HasPrefix(key, prefix) {
+				delete(hcp.Labels, key)
+				break
+			}
+		}
+	}
 	for key, val := range hcluster.Labels {
-		if strings.HasPrefix(key, "api.openshift.com") {
-			hcp.Labels[key] = val
-		}
+		for _, prefix := range hostedClusterActionableLabelPrefixes {
+			if strings.HasPrefix(key, prefix) {
+				hcp.Labels[key] = val
+				break
+			}
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`
around lines 2399 - 2407, Replace the two hardcoded "api.openshift.com" checks
with the shared constant hostedClusterActionableLabelPrefixes: iterate the slice
hostedClusterActionableLabelPrefixes when pruning hcp.Labels and when copying
labels from hcluster to hcp (the code blocks that currently loop over hcp.Labels
and hcluster.Labels), check each label key against all prefixes in
hostedClusterActionableLabelPrefixes (e.g., using strings.HasPrefix in an inner
loop or a small helper like hasActionablePrefix(key)), and use that single
source of truth so both the deletion and assignment logic reference
hostedClusterActionableLabelPrefixes instead of the literal string.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 2399-2407: Replace the two hardcoded "api.openshift.com" checks
with the shared constant hostedClusterActionableLabelPrefixes: iterate the slice
hostedClusterActionableLabelPrefixes when pruning hcp.Labels and when copying
labels from hcluster to hcp (the code blocks that currently loop over hcp.Labels
and hcluster.Labels), check each label key against all prefixes in
hostedClusterActionableLabelPrefixes (e.g., using strings.HasPrefix in an inner
loop or a small helper like hasActionablePrefix(key)), and use that single
source of truth so both the deletion and assignment logic reference
hostedClusterActionableLabelPrefixes instead of the literal string.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 186ff352-7783-4ad9-9175-cd058d919415

📥 Commits

Reviewing files that changed from the base of the PR and between e09cc2d and 553451f.

📒 Files selected for processing (6)
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_hcp_labels_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_predicates.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_predicates_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_reconciliation_condition_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_release_image_validation_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.45%. Comparing base (e09cc2d) to head (037e346).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
...trollers/hostedcluster/hostedcluster_controller.go 71.42% 10 Missing ⚠️
...trollers/hostedcluster/hostedcluster_predicates.go 89.04% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8424      +/-   ##
==========================================
+ Coverage   37.39%   37.45%   +0.05%     
==========================================
  Files         751      752       +1     
  Lines       91806    92008     +202     
==========================================
+ Hits        34333    34461     +128     
- Misses      54838    54906      +68     
- Partials     2635     2641       +6     
Files with missing lines Coverage Δ
...trollers/hostedcluster/hostedcluster_predicates.go 89.04% <89.04%> (ø)
...trollers/hostedcluster/hostedcluster_controller.go 42.61% <71.42%> (-0.62%) ⬇️

... and 3 files with indirect coverage changes

Flag Coverage Δ
cmd-support 32.63% <ø> (+0.07%) ⬆️
cpo-hostedcontrolplane 36.48% <ø> (ø)
cpo-other 37.73% <ø> (ø)
hypershift-operator 47.96% <83.33%> (+0.11%) ⬆️
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Keep HostedControlPlane label mirroring aligned with the shared actionable label
prefix list so the watch predicate and reconciliation logic stay in sync.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (1)

1191-1218: 💤 Low value

Optional: gate the validateReleaseImage call on skipReleaseImageValidation for clarity.

When the skip annotation is present, r.validateReleaseImage is still invoked but its result is discarded by the case skipReleaseImageValidation: branch. The function happens to short-circuit internally (line 3708), so this is functionally correct, but the caller reads as if validation always runs. Guarding the call makes intent explicit and avoids relying on the callee's internal guard surviving future edits.

♻️ Suggested refactor
 		if shouldValidateReleaseImage(hcluster, condition) {
 			condition := metav1.Condition{
 				Type:               string(hyperv1.ValidReleaseImage),
 				ObservedGeneration: hcluster.Generation,
 			}
 			skipReleaseImageValidation := hasSkipReleaseImageValidationAnnotation(hcluster)
-			err := r.validateReleaseImage(ctx, hcluster, releaseProvider)
+			var err error
+			if !skipReleaseImageValidation {
+				err = r.validateReleaseImage(ctx, hcluster, releaseProvider)
+			}
 			switch {
 			case skipReleaseImageValidation:
 				condition.Status = metav1.ConditionTrue
 				condition.Message = "Release image validation is skipped by annotation"
 				condition.Reason = releaseImageValidationSkippedReason
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`
around lines 1191 - 1218, The call to r.validateReleaseImage should be avoided
when the skip annotation is present to make the intent explicit: check
hasSkipReleaseImageValidationAnnotation(hcluster) (skipReleaseImageValidation)
before invoking r.validateReleaseImage and, if true, set the metav1.Condition
(Type string(hyperv1.ValidReleaseImage), Status True, Message "Release image
validation is skipped by annotation", Reason
releaseImageValidationSkippedReason) and call meta.SetStatusCondition without
calling validateReleaseImage; otherwise call r.validateReleaseImage and handle
err as currently done (setting Status/Message/Reason or AsExpectedReason) so the
skip path no longer relies on validateReleaseImage's internal short‑circuit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 1191-1218: The call to r.validateReleaseImage should be avoided
when the skip annotation is present to make the intent explicit: check
hasSkipReleaseImageValidationAnnotation(hcluster) (skipReleaseImageValidation)
before invoking r.validateReleaseImage and, if true, set the metav1.Condition
(Type string(hyperv1.ValidReleaseImage), Status True, Message "Release image
validation is skipped by annotation", Reason
releaseImageValidationSkippedReason) and call meta.SetStatusCondition without
calling validateReleaseImage; otherwise call r.validateReleaseImage and handle
err as currently done (setting Status/Message/Reason or AsExpectedReason) so the
skip path no longer relies on validateReleaseImage's internal short‑circuit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f30f4f99-4b2a-4808-a4c7-d870c58c50b9

📥 Commits

Reviewing files that changed from the base of the PR and between 553451f and 1280cf1.

📒 Files selected for processing (1)
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 6, 2026

@tuxerrante: This pull request references ARO-26764 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the enhancement to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

The HostedCluster controller currently receives reconcile requests from its primary watch for updates that do not represent new desired state, especially status-only churn. That makes the controller noisier than it needs to be and makes it harder to reason about which reconciliations are actually driven by user intent versus controller-written status.

This change narrows only the primary HostedCluster watch so that it continues to reconcile on meaningful inputs, while avoiding self-induced reconciles from status-only updates.

The key design point is that this cannot be implemented as a pure generation-only filter. For HostedCluster, some metadata-only updates are already real control inputs for reconciliation and must continue to enqueue even when the object generation does not change. Examples include specific annotations consumed by reconciliation, mirrored api.openshift.com/* labels, scope transitions, and deletion start.

To keep the behavior correct while reducing noise, this PR introduces a dedicated primary predicate that allows:

  • generation-changing updates
  • deletion timestamp transitions
  • actionable annotation changes
  • actionable mirrored label changes

and filters out status-only updates from the primary watch.

The change is intentionally narrow:

  • child-resource watches are left unchanged
  • the API / CRD surface is unchanged
  • the existing ObservedGeneration status model remains in place
  • the reconcile loop still re-reads the current HostedCluster, so stale queued requests continue to act on the latest available object state

Alongside the watch change, this PR fixes two same-area behaviors that became important once metadata-only updates were treated more explicitly:

  1. ValidReleaseImage is re-evaluated when SkipReleaseImageValidation is added or removed without a generation bump, so a previously cached True condition cannot remain stale.
  2. mirrored api.openshift.com/* labels are actively removed from HostedControlPlane when they are removed from the HostedCluster, instead of only being copied on add/update.

Overall, the goal is to reduce unnecessary reconciliations without weakening any existing metadata-driven control path.

Which issue(s) this PR fixes:

Fixes ARO-26764

Special notes for your reviewer:

This PR is easier to review if you read it as a queueing-boundary change rather than as a reconcile algorithm rewrite.

The new predicate does not try to guarantee one reconcile per generation. Instead, it makes the enqueue boundary stricter and more explicit. The "latest object wins" behavior still comes from the existing reconcile entrypoint, which fetches the current HostedCluster by NamespacedName when the request is processed. That means older queued requests still reconcile against the latest stored object state, not the old event payload.

Because of that, the review focus should be:

  • whether the allowlist of actionable metadata inputs is complete
  • whether status-only updates are now correctly filtered out from the primary watch
  • whether same-generation metadata changes still propagate correctly
  • whether the blast radius is appropriately contained to the shared HostedCluster primary watch and adjacent regression fixes

Expected impact:

  • fewer self-induced HostedCluster reconciliations from status-only updates
  • no intended behavior change for spec updates
  • no intended behavior change for metadata-only inputs that already affect reconciliation
  • HCP-wide effect because this path is shared across Hosted Control Plane distributions, not specific to one platform

Test plan:

  • make verify
  • predicate unit tests cover generation changes, status-only updates, actionable annotations, actionable labels, deletion timestamp changes, and scope transitions
  • falsification tests verify that stale queued requests still reconcile the latest HostedCluster generation and latest actionable metadata
  • regression tests cover same-generation SkipReleaseImageValidation toggles
  • regression tests cover removal of mirrored api.openshift.com/* labels from HostedControlPlane

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Replace the dynamic platform override subtest titles with explicit static names
so the HostedCluster predicate tests satisfy deterministic naming checks
without changing the existing assertion style used in this package.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Co-authored-by: Cursor <cursoragent@cursor.com>
@tuxerrante tuxerrante marked this pull request as ready for review May 6, 2026 09:18
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026
@openshift-ci openshift-ci Bot requested review from Nirshal and muraee May 6, 2026 09:19
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@tuxerrante: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

I now have all the evidence needed to produce the report. Here is the complete analysis:

Test Failure Analysis Complete

Job Information

  • Prow Job: Red Hat Konflux / hypershift-cli-mce-50-on-pull-request
  • Build ID: hypershift-cli-mce-50-on-pull-request-pthb2
  • Pipeline: Konflux PipelineRun (not a Prow CI job)
  • PR: ARO-26764: narrow HostedCluster primary watch to actionable metadata updates #8424 (ARO-26764: narrow HostedCluster primary watch to actionable metadata updates)
  • Commit: 037e3469f3afc4e63fe9e2aff4e3323e0f1b01cf
  • Started: 2026-05-06T09:18:27Z
  • Completed: 2026-05-06T09:37:28Z
  • Failed Task: sast-unicode-check (duration: 5 seconds)

Test Failure Analysis

Error

task sast-unicode-check has the status "TaskRunImagePullFailed":
the step "upload" in TaskRun "hypershift-cli-mce-50-on-pull-request-pthb2-sast-unicode-check"
failed to pull the image "". The pod errored with the message:
"Back-off pulling image "quay.io/konflux-ci/oras:latest@sha256:da693a7dcbadafc9f4422ae6600b41b2847944f7f14c5622827d6f58c727cf08"."

Summary

This is a transient Konflux infrastructure failure, not a code defect. The sast-unicode-check task failed because its upload step could not pull the container image quay.io/konflux-ci/oras:latest@sha256:da693a... from Quay.io. All 15 other pipeline tasks (including build-images, clone-repository, clair-scan, sast-snyk-check, etc.) completed successfully. The PR's code changes — limited to Go source files in the hostedcluster_controller package — have no relationship to the Konflux pipeline image configuration. The same hypershift-cli-mce-50-on-pull-request check is passing on other contemporaneous PRs (#8439, #8246, #8413, #8458), confirming this is a one-off image-pull failure.

Root Cause

The root cause is a transient container image pull failure in the Konflux CI infrastructure.

Specifically, the sast-unicode-check task's upload step attempted to pull quay.io/konflux-ci/oras:latest@sha256:da693a7dcbadafc9f4422ae6600b41b2847944f7f14c5622827d6f58c727cf08 and received a back-off error. This means the Kubernetes pod backing the TaskRun encountered an ImagePullBackOff condition — either because of a transient network failure between the Konflux build cluster (stone-prd-rh01.pg1f.p1.openshiftapps.com) and quay.io, a momentary Quay.io registry unavailability, or rate-limiting on the image pull.

Key evidence this is infrastructure-only:

  1. All other tasks succeeded — 15 of 16 pipeline tasks passed, including other tasks that pull images from quay.io (e.g., clair-scan, sast-snyk-check, clamav-scan).
  2. Other PRs pass the same check — PRs OCPBUGS-77856: fix: use NodePort for HCP router Service on non-cloud platforms #8439, OCPBUGS-81686: fix(authentication): use v2 auth validation for CEL and expression support. #8246, GCP-636: feat(gcp): support for managing GCP OIDC discovery documents #8413, and OCPBUGS-85243: Set aws-load-balancer-scheme on public HCP router service #8458 all show pass for the identical hypershift-cli-mce-50-on-pull-request check, proving the pipeline definition is functional.
  3. PR changes are irrelevant — The PR only modifies 6 Go source files under hypershift-operator/controllers/hostedcluster/ (controller logic and unit tests). No Dockerfiles, build configs, Konflux pipeline definitions, or dependency files were changed.
  4. The task failed in 5 seconds — The extremely short duration indicates the task never ran its logic; it failed immediately at the image-pull stage before any analysis code executed.
Recommendations
  1. Re-trigger the pipeline — This is a transient failure. Re-running the Konflux pipeline (by pushing a new commit or using the Konflux UI to re-run the PipelineRun) should resolve it.
  2. No code changes needed — The PR's code changes to hostedcluster_controller.go, hostedcluster_predicates.go, and associated test files have zero relationship to this failure.
  3. If the failure persists on retry — Check Quay.io status (https://status.quay.io) and verify the image quay.io/konflux-ci/oras:latest is accessible. If Quay.io is having issues, wait and retry later.
  4. Consider filing a Konflux infrastructure issue — If this pattern recurs frequently, it may warrant filing an issue with the Konflux team about image pull reliability or adding retry logic to the sast-unicode-check task definition.
Evidence
Evidence Detail
Failed Task sast-unicode-check — the only task that failed out of 16 total
Failure Status TaskRunImagePullFailed — not a test/scan/build failure
Failed Step upload step in TaskRun hypershift-cli-mce-50-on-pull-request-pthb2-sast-unicode-check
Failed Image quay.io/konflux-ci/oras:latest@sha256:da693a7dcbadafc9f4422ae6600b41b2847944f7f14c5622827d6f58c727cf08
Error Message Back-off pulling image (Kubernetes ImagePullBackOff)
Task Duration 5 seconds (failed before executing any logic)
Other Tasks All 15 other tasks succeeded (init, clone, build, scans, etc.)
Same Check on Other PRs Passing on PRs #8439, #8246, #8413, #8458
PR Changed Files 6 Go files in hypershift-operator/controllers/hostedcluster/ only
Build/Dockerfile Changes None — no build config or pipeline config files modified
Pipeline Run hypershift-cli-mce-50-on-pull-request-pthb2 in namespace crt-redhat-acm-tenant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants