ARO-26764: narrow HostedCluster primary watch to actionable metadata updates#8424
ARO-26764: narrow HostedCluster primary watch to actionable metadata updates#8424tuxerrante wants to merge 5 commits intoopenshift:mainfrom
Conversation
Reduce self-induced HostedCluster reconciles by filtering the primary watch to spec changes and explicit metadata triggers that already affect reconciliation. Preserve mirrored annotation and label behavior with focused falsification tests so stale queued requests still reconcile the latest HostedCluster state. Signed-off-by: Alessandro Affinito <aaffinit@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code) Co-authored-by: Cursor <cursoragent@cursor.com>
Track when release image validation was skipped so same-generation annotation changes can invalidate a cached True ValidReleaseImage condition. Add a focused regression test matrix covering skip annotation add and remove transitions without a generation bump. Signed-off-by: Alessandro Affinito <aaffinit@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code) Co-authored-by: Cursor <cursoragent@cursor.com>
Update the new HostedCluster predicate code to use the k8sutil-scoped annotation constants introduced on current main so the branch verifies cleanly after rebasing. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@tuxerrante: This pull request references ARO-26764 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the enhancement to target either version "5.0." or "openshift-5.0.", but it targets "ARO-Installer-4.21" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR modifies HostedCluster reconciliation predicates, replacing the prior scoping predicate with Sequence Diagram(s)sequenceDiagram
autonumber
participant API as API Server
participant Controller as HostedCluster Controller
participant HC as HostedCluster
participant HCP as HostedControlPlane
participant Validator as ReleaseImageValidator
API->>Controller: HostedCluster update event
Controller->>Controller: hostedClusterPrimaryPredicate(r) evaluates
alt Predicate allows update
Controller->>API: Get HostedCluster
API-->>Controller: HostedCluster (annotations, labels, generation, deletionTimestamp)
Controller->>HCP: Get HostedControlPlane
HCP-->>Controller: HostedControlPlane (annotations, labels)
Controller->>Controller: Compute actionable annotation/label diffs (prefix-aware)
Controller->>HCP: Remove stale prefixed annotations/labels
Controller->>HCP: Copy actionable prefixed annotations/labels from HC
Controller->>Controller: shouldValidateReleaseImage(hcluster, condition)?
alt Skip annotation present
Controller->>Controller: Set ValidReleaseImage True (ReleaseImageValidationSkipped)
else Validate
Controller->>Validator: validateReleaseImage(releaseImageRef, secrets...)
alt Validation succeeds
Validator-->>Controller: success
Controller->>Controller: Set ValidReleaseImage True (AsExpected)
else Validation fails
Validator-->>Controller: error
Controller->>Controller: Set ValidReleaseImage False (reason)
end
end
Controller->>API: Update HCP and HC status/conditions
API-->>Controller: persisted
else Predicate blocks update
Controller-->>API: no reconcile triggered
end
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tuxerrante The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (1)
2399-2407: ⚡ Quick winAvoid hardcoded actionable-label prefix in HCP label reconciliation.
Line 2400 and Line 2405 duplicate
"api.openshift.com"instead of reusinghostedClusterActionableLabelPrefixes. Keeping one source of truth prevents future drift between watch triggering and label mirroring behavior.♻️ Suggested refactor
- for key := range hcp.Labels { - if strings.HasPrefix(key, "api.openshift.com") { - delete(hcp.Labels, key) - } - } + for key := range hcp.Labels { + for _, prefix := range hostedClusterActionableLabelPrefixes { + if strings.HasPrefix(key, prefix) { + delete(hcp.Labels, key) + break + } + } + } for key, val := range hcluster.Labels { - if strings.HasPrefix(key, "api.openshift.com") { - hcp.Labels[key] = val - } + for _, prefix := range hostedClusterActionableLabelPrefixes { + if strings.HasPrefix(key, prefix) { + hcp.Labels[key] = val + break + } + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go` around lines 2399 - 2407, Replace the two hardcoded "api.openshift.com" checks with the shared constant hostedClusterActionableLabelPrefixes: iterate the slice hostedClusterActionableLabelPrefixes when pruning hcp.Labels and when copying labels from hcluster to hcp (the code blocks that currently loop over hcp.Labels and hcluster.Labels), check each label key against all prefixes in hostedClusterActionableLabelPrefixes (e.g., using strings.HasPrefix in an inner loop or a small helper like hasActionablePrefix(key)), and use that single source of truth so both the deletion and assignment logic reference hostedClusterActionableLabelPrefixes instead of the literal string.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 2399-2407: Replace the two hardcoded "api.openshift.com" checks
with the shared constant hostedClusterActionableLabelPrefixes: iterate the slice
hostedClusterActionableLabelPrefixes when pruning hcp.Labels and when copying
labels from hcluster to hcp (the code blocks that currently loop over hcp.Labels
and hcluster.Labels), check each label key against all prefixes in
hostedClusterActionableLabelPrefixes (e.g., using strings.HasPrefix in an inner
loop or a small helper like hasActionablePrefix(key)), and use that single
source of truth so both the deletion and assignment logic reference
hostedClusterActionableLabelPrefixes instead of the literal string.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 186ff352-7783-4ad9-9175-cd058d919415
📒 Files selected for processing (6)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/hostedcluster_hcp_labels_test.gohypershift-operator/controllers/hostedcluster/hostedcluster_predicates.gohypershift-operator/controllers/hostedcluster/hostedcluster_predicates_test.gohypershift-operator/controllers/hostedcluster/hostedcluster_reconciliation_condition_test.gohypershift-operator/controllers/hostedcluster/hostedcluster_release_image_validation_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8424 +/- ##
==========================================
+ Coverage 37.39% 37.45% +0.05%
==========================================
Files 751 752 +1
Lines 91806 92008 +202
==========================================
+ Hits 34333 34461 +128
- Misses 54838 54906 +68
- Partials 2635 2641 +6
... and 3 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Keep HostedControlPlane label mirroring aligned with the shared actionable label prefix list so the watch predicate and reconciliation logic stay in sync. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (1)
1191-1218: 💤 Low valueOptional: gate the
validateReleaseImagecall onskipReleaseImageValidationfor clarity.When the skip annotation is present,
r.validateReleaseImageis still invoked but its result is discarded by thecase skipReleaseImageValidation:branch. The function happens to short-circuit internally (line 3708), so this is functionally correct, but the caller reads as if validation always runs. Guarding the call makes intent explicit and avoids relying on the callee's internal guard surviving future edits.♻️ Suggested refactor
if shouldValidateReleaseImage(hcluster, condition) { condition := metav1.Condition{ Type: string(hyperv1.ValidReleaseImage), ObservedGeneration: hcluster.Generation, } skipReleaseImageValidation := hasSkipReleaseImageValidationAnnotation(hcluster) - err := r.validateReleaseImage(ctx, hcluster, releaseProvider) + var err error + if !skipReleaseImageValidation { + err = r.validateReleaseImage(ctx, hcluster, releaseProvider) + } switch { case skipReleaseImageValidation: condition.Status = metav1.ConditionTrue condition.Message = "Release image validation is skipped by annotation" condition.Reason = releaseImageValidationSkippedReason🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go` around lines 1191 - 1218, The call to r.validateReleaseImage should be avoided when the skip annotation is present to make the intent explicit: check hasSkipReleaseImageValidationAnnotation(hcluster) (skipReleaseImageValidation) before invoking r.validateReleaseImage and, if true, set the metav1.Condition (Type string(hyperv1.ValidReleaseImage), Status True, Message "Release image validation is skipped by annotation", Reason releaseImageValidationSkippedReason) and call meta.SetStatusCondition without calling validateReleaseImage; otherwise call r.validateReleaseImage and handle err as currently done (setting Status/Message/Reason or AsExpectedReason) so the skip path no longer relies on validateReleaseImage's internal short‑circuit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 1191-1218: The call to r.validateReleaseImage should be avoided
when the skip annotation is present to make the intent explicit: check
hasSkipReleaseImageValidationAnnotation(hcluster) (skipReleaseImageValidation)
before invoking r.validateReleaseImage and, if true, set the metav1.Condition
(Type string(hyperv1.ValidReleaseImage), Status True, Message "Release image
validation is skipped by annotation", Reason
releaseImageValidationSkippedReason) and call meta.SetStatusCondition without
calling validateReleaseImage; otherwise call r.validateReleaseImage and handle
err as currently done (setting Status/Message/Reason or AsExpectedReason) so the
skip path no longer relies on validateReleaseImage's internal short‑circuit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: f30f4f99-4b2a-4808-a4c7-d870c58c50b9
📒 Files selected for processing (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
|
@tuxerrante: This pull request references ARO-26764 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the enhancement to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Replace the dynamic platform override subtest titles with explicit static names so the HostedCluster predicate tests satisfy deterministic naming checks without changing the existing assertion style used in this package. Signed-off-by: Alessandro Affinito <aaffinit@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code) Co-authored-by: Cursor <cursoragent@cursor.com>
|
@tuxerrante: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I now have all the evidence needed to produce the report. Here is the complete analysis: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThis is a transient Konflux infrastructure failure, not a code defect. The Root CauseThe root cause is a transient container image pull failure in the Konflux CI infrastructure. Specifically, the Key evidence this is infrastructure-only:
Recommendations
Evidence
|
What this PR does / why we need it:
The
HostedClustercontroller currently receives reconcile requests from its primary watch for updates that do not represent new desired state, especially status-only churn. That makes the controller noisier than it needs to be and makes it harder to reason about which reconciliations are actually driven by user intent versus controller-written status.This change narrows only the primary
HostedClusterwatch so that it continues to reconcile on meaningful inputs, while avoiding self-induced reconciles from status-only updates.The key design point is that this cannot be implemented as a pure generation-only filter. For
HostedCluster, some metadata-only updates are already real control inputs for reconciliation and must continue to enqueue even when the object generation does not change. Examples include specific annotations consumed by reconciliation, mirroredapi.openshift.com/*labels, scope transitions, and deletion start.To keep the behavior correct while reducing noise, this PR introduces a dedicated primary predicate that allows:
and filters out status-only updates from the primary watch.
The change is intentionally narrow:
ObservedGenerationstatus model remains in placeHostedCluster, so stale queued requests continue to act on the latest available object stateAlongside the watch change, this PR fixes two same-area behaviors that became important once metadata-only updates were treated more explicitly:
ValidReleaseImageis re-evaluated whenSkipReleaseImageValidationis added or removed without a generation bump, so a previously cachedTruecondition cannot remain stale.api.openshift.com/*labels are actively removed fromHostedControlPlanewhen they are removed from theHostedCluster, instead of only being copied on add/update.Overall, the goal is to reduce unnecessary reconciliations without weakening any existing metadata-driven control path.
Which issue(s) this PR fixes:
Fixes ARO-26764
Notes:
This PR is easier to review if you read it as a queueing-boundary change rather than as a reconcile algorithm rewrite.
The new predicate does not try to guarantee one reconcile per generation. Instead, it makes the enqueue boundary stricter and more explicit. The "latest object wins" behavior still comes from the existing reconcile entrypoint, which fetches the current
HostedClusterbyNamespacedNamewhen the request is processed. That means older queued requests still reconcile against the latest stored object state, not the old event payload.Because of that, the review focus should be:
HostedClusterprimary watch and adjacent regression fixesExpected impact:
HostedClusterreconciliations from status-only updatesTest plan:
make verifyHostedClustergeneration and latest actionable metadataSkipReleaseImageValidationtogglesapi.openshift.com/*labels fromHostedControlPlaneChecklist:
Summary by CodeRabbit
New Features
Bug Fixes
Tests