Skip to content

OCPBUGS-85192: Add retry logic for Azure DNS cleanup in hcp destroy#8454

Draft
bryan-cox wants to merge 1 commit intoopenshift:mainfrom
bryan-cox:investigate-azure-cli
Draft

OCPBUGS-85192: Add retry logic for Azure DNS cleanup in hcp destroy#8454
bryan-cox wants to merge 1 commit intoopenshift:mainfrom
bryan-cox:investigate-azure-cli

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented May 6, 2026

Summary

  • Add retry loop with wait.PollUntilContextCancel for Azure resource cleanup when using --preserve-resource-group, mirroring the AWS destroy path's retry pattern
  • Introduce resourceDeleter interface for testable resource deletion with mock-based unit tests
  • Detect NonRetriable errors via structural typing to stop retrying on permanent failures (auth errors, invalid requests)
  • Return joined errors instead of silently swallowing deletion failures, and log remaining resources on timeout

Test plan

  • 19 unit tests pass (go test -v ./cmd/infra/azure/...), including 6 new retry behavior tests
  • Gitlint validation passes (make run-gitlint)
  • /test e2e-aks — verify no regression on Azure E2E
  • Manual test: hcp destroy azure --preserve-resource-group with DNS zone having active vnet links

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added --azure-infra-grace-period (default 5m) to control a retrying grace period for Azure infra destruction; negative values rejected, zero disables retries.
    • Preserve-resource-group cleanup now retries transient failures until success or timeout.
  • Bug Fixes

    • Aggregates per-resource deletion errors and stops retries on non-retriable failures for more reliable cleanup.
  • Tests

    • Expanded tests for retries, error handling, re-listing, ordering, and validation.

AI Assistance

Tool: Claude Code 2.1.131
Model: claude-opus-4-6

Skills used:
- personal-claude-skills:tdd@0.1.0 (bryan-cox/personal-claude-skills@cd96dea7) — Test-driven development with vertical slicing
- superpowers:verification-before-completion@5.1.0 (anthropics/claude-plugins-official@a01a135f) — Evidence-based completion verification
- superpowers:finishing-a-development-branch@5.1.0 (anthropics/claude-plugins-official@a01a135f) — Branch completion and PR workflow
- ai-helpers:code-review:pre-commit-review@0.0.7 (openshift-eng/ai-helpers@74735316) — Multi-dimensional code review with Go and HyperShift profile
- ai-helpers:ai-sbom:generate@0.0.1 (openshift-eng/ai-helpers@b56bad09) — AI SBOM declaration generation
- repo:git-commit-format — Conventional commit formatting
- user:restructure-hypershift-commits — Component-based commit restructuring

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-85192, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Add retry loop with wait.PollUntilContextCancel for Azure resource cleanup when using --preserve-resource-group, mirroring the AWS destroy path's retry pattern
  • Introduce resourceDeleter interface for testable resource deletion with mock-based unit tests
  • Detect NonRetriable errors via structural typing to stop retrying on permanent failures (auth errors, invalid requests)
  • Return joined errors instead of silently swallowing deletion failures, and log remaining resources on timeout

Test plan

  • 19 unit tests pass (go test -v ./cmd/infra/azure/...), including 6 new retry behavior tests
  • Gitlint validation passes (make run-gitlint)
  • /test e2e-aks — verify no regression on Azure E2E
  • Manual test: hcp destroy azure --preserve-resource-group with DNS zone having active vnet links

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a retrying cleanup path for Azure infra destruction when --preserve-resource-group is used. Introduces exported DestroyInfraOptions.AzureInfraGracePeriod (CLI flag --azure-infra-grace-period, default DefaultInfraGracePeriod, validated non-negative) and a resourceDeleter abstraction with azureResourceDeleter to list and delete resources. The new retryDeleteClusterResources loop re-lists and retries deletions until success or grace-period timeout, aborts on nonRetriableError, treats HTTP 429 throttling as retryable, and performs a final listing on timeout. deleteClusterResourcesInGroup now returns aggregated errors (errors.Join) for the retry loop to evaluate. Deletion ordering, API-version fallback constant, helper renaming (GetResourceGroupNamegetResourceGroupName), CLI flag wiring, and tests were updated/refactored to support this behavior.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI (user)
    participant Cmd as Destroy Command
    participant Retry as retryDeleteClusterResources
    participant Deleter as resourceDeleter
    participant ARM as Azure Resource Manager

    CLI->>Cmd: invoke destroy with --preserve-resource-group + --azure-infra-grace-period
    Cmd->>Retry: start retry loop (use AzureInfraGracePeriod, retryInterval)
    loop until timeout or success
        Retry->>Deleter: ListByResourceGroup(resourceGroup)
        Deleter->>ARM: Query resources
        ARM-->>Deleter: resources list
        Deleter-->>Retry: resources (filtered by isClusterResource)
        alt no resources
            Retry-->>Cmd: success (stop)
        else resources exist
            Retry->>Deleter: DeleteByID(resourceID) for each resource (ordered)
            Deleter->>ARM: Begin delete & poll until finished
            ARM-->>Deleter: deletion result (success / transient / non-retriable / 429)
            Deleter-->>Retry: per-resource results (errors aggregated)
            alt any non-retriable error
                Retry-->>Cmd: abort immediately with error
            else transient or 429 errors
                Retry->>Retry: wait retryInterval then retry (re-list on next pass)
            end
        end
    end
    Retry-->>Cmd: final status (success or aggregated error)
    Cmd-->>CLI: return exit status
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title references a Jira issue but does not accurately reflect the main changeset, which introduces a comprehensive retry mechanism for cluster resource cleanup with a grace period configuration, not just DNS cleanup. Update the title to reflect the actual scope: consider 'Add retry logic with grace period for Azure cluster resource cleanup in destroy' or similar to accurately represent the changes across all modified files.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test patterns found. PR uses standard Go testing with t.Run(). All test titles are static, descriptive strings with no dynamic values.
Test Structure And Quality ✅ Passed Tests meet all criteria: single responsibility per subtest; proper cleanup with mocks; timeout control via grace period; assertions with helpful context; consistent with codebase patterns.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All tests in destroy_test.go are standard Go unit tests using testing.T. The custom check only applies to Ginkgo e2e tests, making it inapplicable here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds standard Go unit tests only, not Ginkgo e2e tests. The custom check applies only to Ginkgo e2e tests, so it is not applicable here.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only CLI command files for Azure infrastructure destruction. No deployment manifests, operator code, or pod scheduling constraints are introduced. Check is not applicable.
Ote Binary Stdout Contract ✅ Passed No stdout contract violations. All logging uses zap logger (stderr). No fmt.Print/log.Print/klog to stdout. No main/init functions. Proper Cobra setup with SilenceUsage.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds standard Go unit tests (using *testing.T), not Ginkgo e2e tests. Custom check applies only to Ginkgo tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/platform/azure PR/issue for Azure (AzurePlatform) platform and removed do-not-merge/needs-area labels May 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2026
@bryan-cox bryan-cox force-pushed the investigate-azure-cli branch from 214da2e to def78cf Compare May 6, 2026 21:48
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-85192, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Add retry loop with wait.PollUntilContextCancel for Azure resource cleanup when using --preserve-resource-group, mirroring the AWS destroy path's retry pattern
  • Introduce resourceDeleter interface for testable resource deletion with mock-based unit tests
  • Detect NonRetriable errors via structural typing to stop retrying on permanent failures (auth errors, invalid requests)
  • Return joined errors instead of silently swallowing deletion failures, and log remaining resources on timeout

Test plan

  • 19 unit tests pass (go test -v ./cmd/infra/azure/...), including 6 new retry behavior tests
  • Gitlint validation passes (make run-gitlint)
  • /test e2e-aks — verify no regression on Azure E2E
  • Manual test: hcp destroy azure --preserve-resource-group with DNS zone having active vnet links

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added grace period configuration option for Azure infrastructure destruction, now exposed as a CLI flag and configuration parameter.

  • Enhanced cluster resource cleanup with retry-driven mechanism supporting configurable retry intervals and improved error aggregation.

  • Tests

  • Added comprehensive test coverage for Azure destroy workflows, including retry behavior, error handling, and resource re-enumeration scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 66.94915% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.57%. Comparing base (b0a10c5) to head (9b717a0).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
cmd/infra/azure/destroy.go 68.10% 36 Missing and 1 partial ⚠️
cmd/cluster/azure/destroy.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8454      +/-   ##
==========================================
+ Coverage   37.49%   37.57%   +0.08%     
==========================================
  Files         751      751              
  Lines       91984    92056      +72     
==========================================
+ Hits        34487    34593     +106     
+ Misses      54854    54816      -38     
- Partials     2643     2647       +4     
Files with missing lines Coverage Δ
cmd/cluster/core/destroy.go 8.91% <ø> (ø)
cmd/cluster/azure/destroy.go 17.46% <0.00%> (-0.29%) ⬇️
cmd/infra/azure/destroy.go 46.25% <68.10%> (+24.21%) ⬆️

... and 1 file with indirect coverage changes

Flag Coverage Δ
cmd-support 32.90% <66.94%> (+0.27%) ⬆️
cpo-hostedcontrolplane 36.77% <ø> (ø)
cpo-other 37.73% <ø> (ø)
hypershift-operator 47.93% <ø> (ø)
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
cmd/infra/azure/destroy_test.go (1)

28-75: 💤 Low value

Good basic coverage, consider adding test for ResourceNotFound skip behavior.

The tests cover success and failure paths. The deleteClusterResourcesInGroup function has special handling for ResourceNotFound errors (skips them silently), which isn't covered by these tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy_test.go` around lines 28 - 75, Add a test case for
deleteClusterResourcesInGroup that verifies ResourceNotFound errors are skipped:
create a new t.Run that uses mockResourceDeleter (listFunc returning a resource
as in existing tests) and a deleteFunc that returns the same ResourceNotFound
error type your code checks for, then call opts.deleteClusterResourcesInGroup
with a DestroyInfraOptions instance and assert no error is returned and that
deleteFunc was invoked; reference the deleteClusterResourcesInGroup function,
mockResourceDeleter, and DestroyInfraOptions to locate and implement the test.
cmd/infra/azure/destroy.go (3)

278-291: 💤 Low value

Consider logging retry attempts with count for better observability.

The retry loop logs warnings on each failure but doesn't indicate how many attempts have been made or how much time remains. This could help operators understand cleanup progress.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy.go` around lines 278 - 291, The retry loop using
wait.PollUntilContextCancel around deleteClusterResourcesInGroup should track
and log the attempt count (and optionally remaining time) on each retry to
improve observability; introduce a counter variable outside the closure,
increment it on each invocation where err != nil, and update the logger.Info
call inside the closure to include fields like "attempt", attempt, and if
infraCtx has a deadline compute and include "remaining" time, so the pollErr/log
output shows attempt number and time-left along with the existing error and
retry message.

405-411: 💤 Low value

Consider using sort.Slice instead of manual bubble sort.

The current O(n²) bubble sort works but sort.Slice is more idiomatic and efficient for Go.

Idiomatic sort implementation
+import "sort"

-	// Sort by priority (lower number = delete first)
-	for i := 0; i < len(resources); i++ {
-		for j := i + 1; j < len(resources); j++ {
-			if priority(resources[i].resourceType) > priority(resources[j].resourceType) {
-				resources[i], resources[j] = resources[j], resources[i]
-			}
-		}
-	}
+	sort.Slice(resources, func(i, j int) bool {
+		return priority(resources[i].resourceType) < priority(resources[j].resourceType)
+	})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy.go` around lines 405 - 411, Replace the manual O(n²)
bubble sort loop over the resources slice with Go's idiomatic sort.Slice: import
"sort" and call sort.Slice(resources, func(i, j int) bool { return
priority(resources[i].resourceType) < priority(resources[j].resourceType) }) so
ordering uses the same priority(...) comparator; this removes the nested
for-loops and yields clearer, more efficient sorting for the resources slice.

263-266: 💤 Low value

Zero grace period defaults to 5 minutes, which may be surprising if user explicitly sets --azure-infra-grace-period=0.

If a user explicitly sets the flag to 0, they might expect immediate failure or no timeout, but they'll get the 5-minute default instead. Consider documenting the default in the flag help text or logging when the default is applied.

Suggested flag description update
-	cmd.Flags().DurationVar(&opts.AzureInfraGracePeriod, "azure-infra-grace-period", opts.AzureInfraGracePeriod, "Timeout for destroying infrastructure resources when using --preserve-resource-group")
+	cmd.Flags().DurationVar(&opts.AzureInfraGracePeriod, "azure-infra-grace-period", 5*time.Minute, "Timeout for destroying infrastructure resources when using --preserve-resource-group (default 5m)")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy.go` around lines 263 - 266, The code currently treats
any zero value of o.AzureInfraGracePeriod as the 5-minute default which can
surprise users who explicitly passed --azure-infra-grace-period=0; update the
flag help/description for --azure-infra-grace-period to clearly state the
default and how zero is handled, and change the runtime behavior to log when the
default is applied (i.e., in destroy.go where gracePeriod is computed from
o.AzureInfraGracePeriod and the local variable gracePeriod is set) so users see
that their zero was overridden to 5m; alternatively, if you want zero to mean
“no timeout” rather than default, modify the flag parsing to use a nullable/
sentinel value (e.g., pointer or negative duration) instead of using zero as the
unset indicator and adjust the gracePeriod assignment accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cmd/infra/azure/destroy_test.go`:
- Around line 28-75: Add a test case for deleteClusterResourcesInGroup that
verifies ResourceNotFound errors are skipped: create a new t.Run that uses
mockResourceDeleter (listFunc returning a resource as in existing tests) and a
deleteFunc that returns the same ResourceNotFound error type your code checks
for, then call opts.deleteClusterResourcesInGroup with a DestroyInfraOptions
instance and assert no error is returned and that deleteFunc was invoked;
reference the deleteClusterResourcesInGroup function, mockResourceDeleter, and
DestroyInfraOptions to locate and implement the test.

In `@cmd/infra/azure/destroy.go`:
- Around line 278-291: The retry loop using wait.PollUntilContextCancel around
deleteClusterResourcesInGroup should track and log the attempt count (and
optionally remaining time) on each retry to improve observability; introduce a
counter variable outside the closure, increment it on each invocation where err
!= nil, and update the logger.Info call inside the closure to include fields
like "attempt", attempt, and if infraCtx has a deadline compute and include
"remaining" time, so the pollErr/log output shows attempt number and time-left
along with the existing error and retry message.
- Around line 405-411: Replace the manual O(n²) bubble sort loop over the
resources slice with Go's idiomatic sort.Slice: import "sort" and call
sort.Slice(resources, func(i, j int) bool { return
priority(resources[i].resourceType) < priority(resources[j].resourceType) }) so
ordering uses the same priority(...) comparator; this removes the nested
for-loops and yields clearer, more efficient sorting for the resources slice.
- Around line 263-266: The code currently treats any zero value of
o.AzureInfraGracePeriod as the 5-minute default which can surprise users who
explicitly passed --azure-infra-grace-period=0; update the flag help/description
for --azure-infra-grace-period to clearly state the default and how zero is
handled, and change the runtime behavior to log when the default is applied
(i.e., in destroy.go where gracePeriod is computed from o.AzureInfraGracePeriod
and the local variable gracePeriod is set) so users see that their zero was
overridden to 5m; alternatively, if you want zero to mean “no timeout” rather
than default, modify the flag parsing to use a nullable/ sentinel value (e.g.,
pointer or negative duration) instead of using zero as the unset indicator and
adjust the gracePeriod assignment accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 82747af0-0003-4963-abd6-3db71a25843b

📥 Commits

Reviewing files that changed from the base of the PR and between b0a10c5 and def78cf.

📒 Files selected for processing (2)
  • cmd/infra/azure/destroy.go
  • cmd/infra/azure/destroy_test.go

@bryan-cox bryan-cox force-pushed the investigate-azure-cli branch from def78cf to 7a851fb Compare May 6, 2026 22:06
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/cluster/azure/destroy.go`:
- Line 46: The flag binding for "azure-infra-grace-period" currently uses the
zero-value of opts.AzurePlatform.AzureInfraGracePeriod so the help shows 0s
instead of the documented 5m; set the default to 5 minutes explicitly by either
initializing opts.AzurePlatform.AzureInfraGracePeriod = 5*time.Minute before the
call or by passing a literal 5*time.Minute as the default argument to
cmd.Flags().DurationVar, and ensure the time package is imported if needed so
the help and actual default match the "default 5m" text.

In `@cmd/infra/azure/destroy.go`:
- Around line 311-313: The diagnostic call after detecting errors.Is(pollErr,
context.DeadlineExceeded) currently calls deleter.ListByResourceGroup with the
parent ctx which may be long-lived; replace that call to use a short bounded
context (e.g., ctxShort created via context.WithTimeout(ctx, shortDuration) with
defer cancel()) and pass ctxShort to
deleter.ListByResourceGroup(resourceGroupName) so the post-timeout listing
cannot block indefinitely; ensure you handle cancel and propagate listErr as
before.
- Around line 275-278: The code sets gracePeriod := o.AzureInfraGracePeriod and
only defaults when it equals zero, but negative values should be rejected;
update the validation in the Azure destroy command (where gracePeriod and
o.AzureInfraGracePeriod are used) to check if o.AzureInfraGracePeriod < 0 and
return a user-facing error (e.g., fmt.Errorf or Cobra flag validation) instead
of accepting it, otherwise keep the existing zero->defaultInfraGracePeriod
logic; reference the gracePeriod variable, o.AzureInfraGracePeriod input, and
defaultInfraGracePeriod constant when adding this validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 78139657-6dab-4ef3-8650-ca92cb91eeaa

📥 Commits

Reviewing files that changed from the base of the PR and between def78cf and 7a851fb.

📒 Files selected for processing (4)
  • cmd/cluster/azure/destroy.go
  • cmd/cluster/core/destroy.go
  • cmd/infra/azure/destroy.go
  • cmd/infra/azure/destroy_test.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/cluster/core/destroy.go

Comment thread cmd/cluster/azure/destroy.go Outdated
Comment thread cmd/infra/azure/destroy.go Outdated
Comment thread cmd/infra/azure/destroy.go
@bryan-cox bryan-cox force-pushed the investigate-azure-cli branch 2 times, most recently from d31e385 to 687b77d Compare May 7, 2026 01:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/infra/azure/destroy_test.go (1)

220-259: 💤 Low value

Consider adding t.Parallel() to subtests for consistency.

The subtests in TestRetryDeleteClusterResources don't have t.Parallel() unlike other test functions in this file. Since each subtest has its own mock and options, they should be safe to run in parallel.

♻️ Suggested change
 	t.Run("When DNS zone deletion fails on first pass it should succeed on retry", func(t *testing.T) {
+		t.Parallel()
 		g := NewWithT(t)

Apply similarly to other subtests at lines 261, 290, and 318.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/azure/destroy_test.go` around lines 220 - 259, Add t.Parallel() at
the start of each t.Run subtest within TestRetryDeleteClusterResources so the
subtests run concurrently like the rest of the file; locate the t.Run blocks in
TestRetryDeleteClusterResources and insert t.Parallel() as the first statement
inside each subtest closure (the closures that set up the mockResourceDeleter
and DestroyInfraOptions and call opts.retryDeleteClusterResources), and do the
same for the other sibling subtests in that test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cmd/infra/azure/destroy_test.go`:
- Around line 220-259: Add t.Parallel() at the start of each t.Run subtest
within TestRetryDeleteClusterResources so the subtests run concurrently like the
rest of the file; locate the t.Run blocks in TestRetryDeleteClusterResources and
insert t.Parallel() as the first statement inside each subtest closure (the
closures that set up the mockResourceDeleter and DestroyInfraOptions and call
opts.retryDeleteClusterResources), and do the same for the other sibling
subtests in that test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a0076f33-d239-4dc5-af71-9012bbb034f9

📥 Commits

Reviewing files that changed from the base of the PR and between d31e385 and 687b77d.

📒 Files selected for processing (4)
  • cmd/cluster/azure/destroy.go
  • cmd/cluster/core/destroy.go
  • cmd/infra/azure/destroy.go
  • cmd/infra/azure/destroy_test.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/cluster/core/destroy.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/cluster/azure/destroy.go

@bryan-cox bryan-cox force-pushed the investigate-azure-cli branch from 687b77d to f216edf Compare May 7, 2026 01:45
Add retry loop with grace period for deleting cluster-specific
resources when using --preserve-resource-group on Azure. This mirrors
the AWS destroy path's PollUntilContextCancel pattern to handle async
Azure-side deletion of virtualNetworkLinks and DNS zones.

- Add resourceDeleter interface for testable resource deletion
- Wrap deleteClusterResourcesInGroup in wait.PollUntilContextCancel
  with 10s interval and configurable grace period (default 5m)
- Detect NonRetriable errors via structural typing to stop retrying
  on permanent failures (auth errors, invalid requests)
- Re-list resources on each retry pass to pick up async deletions
- Return joined errors instead of swallowing deletion failures
- Add --azure-infra-grace-period flag for timeout configuration
- Add 6 unit tests covering retry, NonRetriable, timeout, and
  re-listing behavior

Signed-off-by: Bryan Cox <brcox@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/platform/azure PR/issue for Azure (AzurePlatform) platform do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants