OCPBUGS-85192: Add retry logic for Azure DNS cleanup in hcp destroy by bryan-cox · Pull Request #8454 · openshift/hypershift

bryan-cox · 2026-05-06T21:46:14Z

Summary

Add retry loop with wait.PollUntilContextCancel for Azure resource cleanup when using --preserve-resource-group, mirroring the AWS destroy path's retry pattern
Introduce resourceDeleter interface for testable resource deletion with mock-based unit tests
Detect NonRetriable errors via structural typing to stop retrying on permanent failures (auth errors, invalid requests)
Return joined errors instead of silently swallowing deletion failures, and log remaining resources on timeout

Test plan

19 unit tests pass (go test -v ./cmd/infra/azure/...), including 6 new retry behavior tests
Gitlint validation passes (make run-gitlint)
/test e2e-aks — verify no regression on Azure E2E
Manual test: hcp destroy azure --preserve-resource-group with DNS zone having active vnet links

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added --azure-infra-grace-period (default 5m) to control a retrying grace period for Azure infra destruction; negative values rejected, zero disables retries.
- Preserve-resource-group cleanup now retries transient failures until success or timeout.
Bug Fixes
- Aggregates per-resource deletion errors and stops retries on non-retriable failures for more reliable cleanup.
Tests
- Expanded tests for retries, error handling, re-listing, ordering, and validation.

AI Assistance

Tool: Claude Code 2.1.131
Model: claude-opus-4-6

Skills used:
- personal-claude-skills:tdd@0.1.0 (bryan-cox/personal-claude-skills@cd96dea7) — Test-driven development with vertical slicing
- superpowers:verification-before-completion@5.1.0 (anthropics/claude-plugins-official@a01a135f) — Evidence-based completion verification
- superpowers:finishing-a-development-branch@5.1.0 (anthropics/claude-plugins-official@a01a135f) — Branch completion and PR workflow
- ai-helpers:code-review:pre-commit-review@0.0.7 (openshift-eng/ai-helpers@74735316) — Multi-dimensional code review with Go and HyperShift profile
- ai-helpers:ai-sbom:generate@0.0.1 (openshift-eng/ai-helpers@b56bad09) — AI SBOM declaration generation
- repo:git-commit-format — Conventional commit formatting
- user:restructure-hypershift-commits — Component-based commit restructuring

openshift-merge-bot · 2026-05-06T21:46:16Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-06T21:46:18Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-05-06T21:46:21Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-85192, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Add retry loop with wait.PollUntilContextCancel for Azure resource cleanup when using --preserve-resource-group, mirroring the AWS destroy path's retry pattern

Introduce resourceDeleter interface for testable resource deletion with mock-based unit tests

Detect NonRetriable errors via structural typing to stop retrying on permanent failures (auth errors, invalid requests)

Return joined errors instead of silently swallowing deletion failures, and log remaining resources on timeout

Test plan

19 unit tests pass (go test -v ./cmd/infra/azure/...), including 6 new retry behavior tests

Gitlint validation passes (make run-gitlint)

/test e2e-aks — verify no regression on Azure E2E

Manual test: hcp destroy azure --preserve-resource-group with DNS zone having active vnet links

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-06T21:46:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a retrying cleanup path for Azure infra destruction when --preserve-resource-group is used. Introduces exported DestroyInfraOptions.AzureInfraGracePeriod (CLI flag --azure-infra-grace-period, default DefaultInfraGracePeriod, validated non-negative) and a resourceDeleter abstraction with azureResourceDeleter to list and delete resources. The new retryDeleteClusterResources loop re-lists and retries deletions until success or grace-period timeout, aborts on nonRetriableError, treats HTTP 429 throttling as retryable, and performs a final listing on timeout. deleteClusterResourcesInGroup now returns aggregated errors (errors.Join) for the retry loop to evaluate. Deletion ordering, API-version fallback constant, helper renaming (GetResourceGroupName → getResourceGroupName), CLI flag wiring, and tests were updated/refactored to support this behavior.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI (user)
    participant Cmd as Destroy Command
    participant Retry as retryDeleteClusterResources
    participant Deleter as resourceDeleter
    participant ARM as Azure Resource Manager

    CLI->>Cmd: invoke destroy with --preserve-resource-group + --azure-infra-grace-period
    Cmd->>Retry: start retry loop (use AzureInfraGracePeriod, retryInterval)
    loop until timeout or success
        Retry->>Deleter: ListByResourceGroup(resourceGroup)
        Deleter->>ARM: Query resources
        ARM-->>Deleter: resources list
        Deleter-->>Retry: resources (filtered by isClusterResource)
        alt no resources
            Retry-->>Cmd: success (stop)
        else resources exist
            Retry->>Deleter: DeleteByID(resourceID) for each resource (ordered)
            Deleter->>ARM: Begin delete & poll until finished
            ARM-->>Deleter: deletion result (success / transient / non-retriable / 429)
            Deleter-->>Retry: per-resource results (errors aggregated)
            alt any non-retriable error
                Retry-->>Cmd: abort immediately with error
            else transient or 429 errors
                Retry->>Retry: wait retryInterval then retry (re-list on next pass)
            end
        end
    end
    Retry-->>Cmd: final status (success or aggregated error)
    Cmd-->>CLI: return exit status

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title references a Jira issue but does not accurately reflect the main changeset, which introduces a comprehensive retry mechanism for cluster resource cleanup with a grace period configuration, not just DNS cleanup.	Update the title to reflect the actual scope: consider 'Add retry logic with grace period for Azure cluster resource cleanup in destroy' or similar to accurately represent the changes across all modified files.
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	No Ginkgo test patterns found. PR uses standard Go testing with t.Run(). All test titles are static, descriptive strings with no dynamic values.
Test Structure And Quality	✅ Passed	Tests meet all criteria: single responsibility per subtest; proper cleanup with mocks; timeout control via grace period; assertions with helpful context; consistent with codebase patterns.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR. All tests in destroy_test.go are standard Go unit tests using testing.T. The custom check only applies to Ginkgo e2e tests, making it inapplicable here.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds standard Go unit tests only, not Ginkgo e2e tests. The custom check applies only to Ginkgo e2e tests, so it is not applicable here.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only CLI command files for Azure infrastructure destruction. No deployment manifests, operator code, or pod scheduling constraints are introduced. Check is not applicable.
Ote Binary Stdout Contract	✅ Passed	No stdout contract violations. All logging uses zap logger (stderr). No fmt.Print/log.Print/klog to stdout. No main/init functions. Proper Cobra setup with SilenceUsage.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR adds standard Go unit tests (using *testing.T), not Ginkgo e2e tests. Custom check applies only to Ginkgo tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-06T21:47:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-05-06T21:51:29Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-85192, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

Add retry loop with wait.PollUntilContextCancel for Azure resource cleanup when using --preserve-resource-group, mirroring the AWS destroy path's retry pattern

Introduce resourceDeleter interface for testable resource deletion with mock-based unit tests

Detect NonRetriable errors via structural typing to stop retrying on permanent failures (auth errors, invalid requests)

Return joined errors instead of silently swallowing deletion failures, and log remaining resources on timeout

Test plan

19 unit tests pass (go test -v ./cmd/infra/azure/...), including 6 new retry behavior tests

Gitlint validation passes (make run-gitlint)

/test e2e-aks — verify no regression on Azure E2E

Manual test: hcp destroy azure --preserve-resource-group with DNS zone having active vnet links

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features

Added grace period configuration option for Azure infrastructure destruction, now exposed as a CLI flag and configuration parameter.

Enhanced cluster resource cleanup with retry-driven mechanism supporting configurable retry intervals and improved error aggregation.

Tests

Added comprehensive test coverage for Azure destroy workflows, including retry behavior, error handling, and resource re-enumeration scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

codecov · 2026-05-06T21:52:26Z

Codecov Report

❌ Patch coverage is 66.94915% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.57%. Comparing base (b0a10c5) to head (9b717a0).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
cmd/infra/azure/destroy.go	68.10%	36 Missing and 1 partial ⚠️
cmd/cluster/azure/destroy.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8454      +/-   ##
==========================================
+ Coverage   37.49%   37.57%   +0.08%     
==========================================
  Files         751      751              
  Lines       91984    92056      +72     
==========================================
+ Hits        34487    34593     +106     
+ Misses      54854    54816      -38     
- Partials     2643     2647       +4

Files with missing lines	Coverage Δ
cmd/cluster/core/destroy.go	`8.91% <ø> (ø)`
cmd/cluster/azure/destroy.go	`17.46% <0.00%> (-0.29%)`	⬇️
cmd/infra/azure/destroy.go	`46.25% <68.10%> (+24.21%)`	⬆️

... and 1 file with indirect coverage changes

Flag	Coverage Δ
cmd-support	`32.90% <66.94%> (+0.27%)`	⬆️
cpo-hostedcontrolplane	`36.77% <ø> (ø)`
cpo-other	`37.73% <ø> (ø)`
hypershift-operator	`47.93% <ø> (ø)`
other	`27.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (4)

cmd/infra/azure/destroy_test.go (1)

28-75: 💤 Low value

Good basic coverage, consider adding test for ResourceNotFound skip behavior.

The tests cover success and failure paths. The deleteClusterResourcesInGroup function has special handling for ResourceNotFound errors (skips them silently), which isn't covered by these tests.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy_test.go` around lines 28 - 75, Add a test case for
deleteClusterResourcesInGroup that verifies ResourceNotFound errors are skipped:
create a new t.Run that uses mockResourceDeleter (listFunc returning a resource
as in existing tests) and a deleteFunc that returns the same ResourceNotFound
error type your code checks for, then call opts.deleteClusterResourcesInGroup
with a DestroyInfraOptions instance and assert no error is returned and that
deleteFunc was invoked; reference the deleteClusterResourcesInGroup function,
mockResourceDeleter, and DestroyInfraOptions to locate and implement the test.

cmd/infra/azure/destroy.go (3)

278-291: 💤 Low value

Consider logging retry attempts with count for better observability.

The retry loop logs warnings on each failure but doesn't indicate how many attempts have been made or how much time remains. This could help operators understand cleanup progress.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy.go` around lines 278 - 291, The retry loop using
wait.PollUntilContextCancel around deleteClusterResourcesInGroup should track
and log the attempt count (and optionally remaining time) on each retry to
improve observability; introduce a counter variable outside the closure,
increment it on each invocation where err != nil, and update the logger.Info
call inside the closure to include fields like "attempt", attempt, and if
infraCtx has a deadline compute and include "remaining" time, so the pollErr/log
output shows attempt number and time-left along with the existing error and
retry message.

405-411: 💤 Low value

Consider using sort.Slice instead of manual bubble sort.

The current O(n²) bubble sort works but sort.Slice is more idiomatic and efficient for Go.

Idiomatic sort implementation

+import "sort"

-	// Sort by priority (lower number = delete first)
-	for i := 0; i < len(resources); i++ {
-		for j := i + 1; j < len(resources); j++ {
-			if priority(resources[i].resourceType) > priority(resources[j].resourceType) {
-				resources[i], resources[j] = resources[j], resources[i]
-			}
-		}
-	}
+	sort.Slice(resources, func(i, j int) bool {
+		return priority(resources[i].resourceType) < priority(resources[j].resourceType)
+	})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy.go` around lines 405 - 411, Replace the manual O(n²)
bubble sort loop over the resources slice with Go's idiomatic sort.Slice: import
"sort" and call sort.Slice(resources, func(i, j int) bool { return
priority(resources[i].resourceType) < priority(resources[j].resourceType) }) so
ordering uses the same priority(...) comparator; this removes the nested
for-loops and yields clearer, more efficient sorting for the resources slice.

263-266: 💤 Low value

Zero grace period defaults to 5 minutes, which may be surprising if user explicitly sets --azure-infra-grace-period=0.

If a user explicitly sets the flag to 0, they might expect immediate failure or no timeout, but they'll get the 5-minute default instead. Consider documenting the default in the flag help text or logging when the default is applied.

Suggested flag description update

-	cmd.Flags().DurationVar(&opts.AzureInfraGracePeriod, "azure-infra-grace-period", opts.AzureInfraGracePeriod, "Timeout for destroying infrastructure resources when using --preserve-resource-group")
+	cmd.Flags().DurationVar(&opts.AzureInfraGracePeriod, "azure-infra-grace-period", 5*time.Minute, "Timeout for destroying infrastructure resources when using --preserve-resource-group (default 5m)")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy.go` around lines 263 - 266, The code currently treats
any zero value of o.AzureInfraGracePeriod as the 5-minute default which can
surprise users who explicitly passed --azure-infra-grace-period=0; update the
flag help/description for --azure-infra-grace-period to clearly state the
default and how zero is handled, and change the runtime behavior to log when the
default is applied (i.e., in destroy.go where gracePeriod is computed from
o.AzureInfraGracePeriod and the local variable gracePeriod is set) so users see
that their zero was overridden to 5m; alternatively, if you want zero to mean
“no timeout” rather than default, modify the flag parsing to use a nullable/
sentinel value (e.g., pointer or negative duration) instead of using zero as the
unset indicator and adjust the gracePeriod assignment accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cmd/infra/azure/destroy_test.go`:
- Around line 28-75: Add a test case for deleteClusterResourcesInGroup that
verifies ResourceNotFound errors are skipped: create a new t.Run that uses
mockResourceDeleter (listFunc returning a resource as in existing tests) and a
deleteFunc that returns the same ResourceNotFound error type your code checks
for, then call opts.deleteClusterResourcesInGroup with a DestroyInfraOptions
instance and assert no error is returned and that deleteFunc was invoked;
reference the deleteClusterResourcesInGroup function, mockResourceDeleter, and
DestroyInfraOptions to locate and implement the test.

In `@cmd/infra/azure/destroy.go`:
- Around line 278-291: The retry loop using wait.PollUntilContextCancel around
deleteClusterResourcesInGroup should track and log the attempt count (and
optionally remaining time) on each retry to improve observability; introduce a
counter variable outside the closure, increment it on each invocation where err
!= nil, and update the logger.Info call inside the closure to include fields
like "attempt", attempt, and if infraCtx has a deadline compute and include
"remaining" time, so the pollErr/log output shows attempt number and time-left
along with the existing error and retry message.
- Around line 405-411: Replace the manual O(n²) bubble sort loop over the
resources slice with Go's idiomatic sort.Slice: import "sort" and call
sort.Slice(resources, func(i, j int) bool { return
priority(resources[i].resourceType) < priority(resources[j].resourceType) }) so
ordering uses the same priority(...) comparator; this removes the nested
for-loops and yields clearer, more efficient sorting for the resources slice.
- Around line 263-266: The code currently treats any zero value of
o.AzureInfraGracePeriod as the 5-minute default which can surprise users who
explicitly passed --azure-infra-grace-period=0; update the flag help/description
for --azure-infra-grace-period to clearly state the default and how zero is
handled, and change the runtime behavior to log when the default is applied
(i.e., in destroy.go where gracePeriod is computed from o.AzureInfraGracePeriod
and the local variable gracePeriod is set) so users see that their zero was
overridden to 5m; alternatively, if you want zero to mean “no timeout” rather
than default, modify the flag parsing to use a nullable/ sentinel value (e.g.,
pointer or negative duration) instead of using zero as the unset indicator and
adjust the gracePeriod assignment accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 82747af0-0003-4963-abd6-3db71a25843b

📥 Commits

Reviewing files that changed from the base of the PR and between b0a10c5 and def78cf.

📒 Files selected for processing (2)

cmd/infra/azure/destroy.go
cmd/infra/azure/destroy_test.go

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/cluster/azure/destroy.go`:
- Line 46: The flag binding for "azure-infra-grace-period" currently uses the
zero-value of opts.AzurePlatform.AzureInfraGracePeriod so the help shows 0s
instead of the documented 5m; set the default to 5 minutes explicitly by either
initializing opts.AzurePlatform.AzureInfraGracePeriod = 5*time.Minute before the
call or by passing a literal 5*time.Minute as the default argument to
cmd.Flags().DurationVar, and ensure the time package is imported if needed so
the help and actual default match the "default 5m" text.

In `@cmd/infra/azure/destroy.go`:
- Around line 311-313: The diagnostic call after detecting errors.Is(pollErr,
context.DeadlineExceeded) currently calls deleter.ListByResourceGroup with the
parent ctx which may be long-lived; replace that call to use a short bounded
context (e.g., ctxShort created via context.WithTimeout(ctx, shortDuration) with
defer cancel()) and pass ctxShort to
deleter.ListByResourceGroup(resourceGroupName) so the post-timeout listing
cannot block indefinitely; ensure you handle cancel and propagate listErr as
before.
- Around line 275-278: The code sets gracePeriod := o.AzureInfraGracePeriod and
only defaults when it equals zero, but negative values should be rejected;
update the validation in the Azure destroy command (where gracePeriod and
o.AzureInfraGracePeriod are used) to check if o.AzureInfraGracePeriod < 0 and
return a user-facing error (e.g., fmt.Errorf or Cobra flag validation) instead
of accepting it, otherwise keep the existing zero->defaultInfraGracePeriod
logic; reference the gracePeriod variable, o.AzureInfraGracePeriod input, and
defaultInfraGracePeriod constant when adding this validation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 78139657-6dab-4ef3-8650-ca92cb91eeaa

📥 Commits

Reviewing files that changed from the base of the PR and between def78cf and 7a851fb.

📒 Files selected for processing (4)

cmd/cluster/azure/destroy.go
cmd/cluster/core/destroy.go
cmd/infra/azure/destroy.go
cmd/infra/azure/destroy_test.go

✅ Files skipped from review due to trivial changes (1)

cmd/cluster/core/destroy.go

coderabbitai

🧹 Nitpick comments (1)

cmd/infra/azure/destroy_test.go (1)
220-259: 💤 Low value

Consider adding t.Parallel() to subtests for consistency.

The subtests in TestRetryDeleteClusterResources don't have t.Parallel() unlike other test functions in this file. Since each subtest has its own mock and options, they should be safe to run in parallel.
♻️ Suggested change
 	t.Run("When DNS zone deletion fails on first pass it should succeed on retry", func(t *testing.T) {
+		t.Parallel()
 		g := NewWithT(t)
Apply similarly to other subtests at lines 261, 290, and 318.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy_test.go` around lines 220 - 259, Add t.Parallel() at
the start of each t.Run subtest within TestRetryDeleteClusterResources so the
subtests run concurrently like the rest of the file; locate the t.Run blocks in
TestRetryDeleteClusterResources and insert t.Parallel() as the first statement
inside each subtest closure (the closures that set up the mockResourceDeleter
and DestroyInfraOptions and call opts.retryDeleteClusterResources), and do the
same for the other sibling subtests in that test.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cmd/infra/azure/destroy_test.go`:
- Around line 220-259: Add t.Parallel() at the start of each t.Run subtest
within TestRetryDeleteClusterResources so the subtests run concurrently like the
rest of the file; locate the t.Run blocks in TestRetryDeleteClusterResources and
insert t.Parallel() as the first statement inside each subtest closure (the
closures that set up the mockResourceDeleter and DestroyInfraOptions and call
opts.retryDeleteClusterResources), and do the same for the other sibling
subtests in that test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a0076f33-d239-4dc5-af71-9012bbb034f9

📥 Commits

Reviewing files that changed from the base of the PR and between d31e385 and 687b77d.

📒 Files selected for processing (4)

cmd/cluster/azure/destroy.go
cmd/cluster/core/destroy.go
cmd/infra/azure/destroy.go
cmd/infra/azure/destroy_test.go

✅ Files skipped from review due to trivial changes (1)

cmd/cluster/core/destroy.go

🚧 Files skipped from review as they are similar to previous changes (1)

cmd/cluster/azure/destroy.go

Add retry loop with grace period for deleting cluster-specific resources when using --preserve-resource-group on Azure. This mirrors the AWS destroy path's PollUntilContextCancel pattern to handle async Azure-side deletion of virtualNetworkLinks and DNS zones. - Add resourceDeleter interface for testable resource deletion - Wrap deleteClusterResourcesInGroup in wait.PollUntilContextCancel with 10s interval and configurable grace period (default 5m) - Detect NonRetriable errors via structural typing to stop retrying on permanent failures (auth errors, invalid requests) - Re-list resources on each retry pass to pick up async deletions - Return joined errors instead of swallowing deletion failures - Add --azure-infra-grace-period flag for timeout configuration - Add 6 unit tests covering retry, NonRetriable, timeout, and re-listing behavior Signed-off-by: Bryan Cox <brcox@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 6, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 6, 2026

openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/platform/azure PR/issue for Azure (AzurePlatform) platform and removed do-not-merge/needs-area labels May 6, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2026

bryan-cox force-pushed the investigate-azure-cli branch from 214da2e to def78cf Compare May 6, 2026 21:48

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

bryan-cox force-pushed the investigate-azure-cli branch from def78cf to 7a851fb Compare May 6, 2026 22:06

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread cmd/cluster/azure/destroy.go Outdated

Comment thread cmd/infra/azure/destroy.go Outdated

Comment thread cmd/infra/azure/destroy.go

bryan-cox force-pushed the investigate-azure-cli branch 2 times, most recently from d31e385 to 687b77d Compare May 7, 2026 01:37

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

bryan-cox force-pushed the investigate-azure-cli branch from 687b77d to f216edf Compare May 7, 2026 01:45

bryan-cox force-pushed the investigate-azure-cli branch from f216edf to 9b717a0 Compare May 7, 2026 02:01

This was referenced May 7, 2026

OCPBUGS-84528: clarify pull secret in-place update behavior and add CP watches #8327

Merged

AUTOSCALE-558: Expose KubeletConfig on OpenShiftEC2Nodeclass as structured fields + preserveunknown/overflow #8192

Open

Conversation

bryan-cox commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

AI Assistance

Uh oh!

openshift-merge-bot Bot commented May 6, 2026

Uh oh!

openshift-ci Bot commented May 6, 2026

Uh oh!

openshift-ci-robot commented May 6, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci Bot commented May 6, 2026

Uh oh!

openshift-ci-robot commented May 6, 2026

Summary

Test plan

Summary by CodeRabbit

Uh oh!

codecov Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bryan-cox commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading

codecov Bot commented May 6, 2026 •

edited

Loading