OCPBUGS-85192: Add retry logic for Azure DNS cleanup in hcp destroy#8454
OCPBUGS-85192: Add retry logic for Azure DNS cleanup in hcp destroy#8454bryan-cox wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-85192, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a retrying cleanup path for Azure infra destruction when Sequence DiagramsequenceDiagram
participant CLI as CLI (user)
participant Cmd as Destroy Command
participant Retry as retryDeleteClusterResources
participant Deleter as resourceDeleter
participant ARM as Azure Resource Manager
CLI->>Cmd: invoke destroy with --preserve-resource-group + --azure-infra-grace-period
Cmd->>Retry: start retry loop (use AzureInfraGracePeriod, retryInterval)
loop until timeout or success
Retry->>Deleter: ListByResourceGroup(resourceGroup)
Deleter->>ARM: Query resources
ARM-->>Deleter: resources list
Deleter-->>Retry: resources (filtered by isClusterResource)
alt no resources
Retry-->>Cmd: success (stop)
else resources exist
Retry->>Deleter: DeleteByID(resourceID) for each resource (ordered)
Deleter->>ARM: Begin delete & poll until finished
ARM-->>Deleter: deletion result (success / transient / non-retriable / 429)
Deleter-->>Retry: per-resource results (errors aggregated)
alt any non-retriable error
Retry-->>Cmd: abort immediately with error
else transient or 429 errors
Retry->>Retry: wait retryInterval then retry (re-list on next pass)
end
end
end
Retry-->>Cmd: final status (success or aggregated error)
Cmd-->>CLI: return exit status
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
214da2e to
def78cf
Compare
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-85192, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8454 +/- ##
==========================================
+ Coverage 37.49% 37.57% +0.08%
==========================================
Files 751 751
Lines 91984 92056 +72
==========================================
+ Hits 34487 34593 +106
+ Misses 54854 54816 -38
- Partials 2643 2647 +4
... and 1 file with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (4)
cmd/infra/azure/destroy_test.go (1)
28-75: 💤 Low valueGood basic coverage, consider adding test for ResourceNotFound skip behavior.
The tests cover success and failure paths. The
deleteClusterResourcesInGroupfunction has special handling forResourceNotFounderrors (skips them silently), which isn't covered by these tests.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/infra/azure/destroy_test.go` around lines 28 - 75, Add a test case for deleteClusterResourcesInGroup that verifies ResourceNotFound errors are skipped: create a new t.Run that uses mockResourceDeleter (listFunc returning a resource as in existing tests) and a deleteFunc that returns the same ResourceNotFound error type your code checks for, then call opts.deleteClusterResourcesInGroup with a DestroyInfraOptions instance and assert no error is returned and that deleteFunc was invoked; reference the deleteClusterResourcesInGroup function, mockResourceDeleter, and DestroyInfraOptions to locate and implement the test.cmd/infra/azure/destroy.go (3)
278-291: 💤 Low valueConsider logging retry attempts with count for better observability.
The retry loop logs warnings on each failure but doesn't indicate how many attempts have been made or how much time remains. This could help operators understand cleanup progress.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/infra/azure/destroy.go` around lines 278 - 291, The retry loop using wait.PollUntilContextCancel around deleteClusterResourcesInGroup should track and log the attempt count (and optionally remaining time) on each retry to improve observability; introduce a counter variable outside the closure, increment it on each invocation where err != nil, and update the logger.Info call inside the closure to include fields like "attempt", attempt, and if infraCtx has a deadline compute and include "remaining" time, so the pollErr/log output shows attempt number and time-left along with the existing error and retry message.
405-411: 💤 Low valueConsider using
sort.Sliceinstead of manual bubble sort.The current O(n²) bubble sort works but
sort.Sliceis more idiomatic and efficient for Go.Idiomatic sort implementation
+import "sort" - // Sort by priority (lower number = delete first) - for i := 0; i < len(resources); i++ { - for j := i + 1; j < len(resources); j++ { - if priority(resources[i].resourceType) > priority(resources[j].resourceType) { - resources[i], resources[j] = resources[j], resources[i] - } - } - } + sort.Slice(resources, func(i, j int) bool { + return priority(resources[i].resourceType) < priority(resources[j].resourceType) + })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/infra/azure/destroy.go` around lines 405 - 411, Replace the manual O(n²) bubble sort loop over the resources slice with Go's idiomatic sort.Slice: import "sort" and call sort.Slice(resources, func(i, j int) bool { return priority(resources[i].resourceType) < priority(resources[j].resourceType) }) so ordering uses the same priority(...) comparator; this removes the nested for-loops and yields clearer, more efficient sorting for the resources slice.
263-266: 💤 Low valueZero grace period defaults to 5 minutes, which may be surprising if user explicitly sets
--azure-infra-grace-period=0.If a user explicitly sets the flag to
0, they might expect immediate failure or no timeout, but they'll get the 5-minute default instead. Consider documenting the default in the flag help text or logging when the default is applied.Suggested flag description update
- cmd.Flags().DurationVar(&opts.AzureInfraGracePeriod, "azure-infra-grace-period", opts.AzureInfraGracePeriod, "Timeout for destroying infrastructure resources when using --preserve-resource-group") + cmd.Flags().DurationVar(&opts.AzureInfraGracePeriod, "azure-infra-grace-period", 5*time.Minute, "Timeout for destroying infrastructure resources when using --preserve-resource-group (default 5m)")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/infra/azure/destroy.go` around lines 263 - 266, The code currently treats any zero value of o.AzureInfraGracePeriod as the 5-minute default which can surprise users who explicitly passed --azure-infra-grace-period=0; update the flag help/description for --azure-infra-grace-period to clearly state the default and how zero is handled, and change the runtime behavior to log when the default is applied (i.e., in destroy.go where gracePeriod is computed from o.AzureInfraGracePeriod and the local variable gracePeriod is set) so users see that their zero was overridden to 5m; alternatively, if you want zero to mean “no timeout” rather than default, modify the flag parsing to use a nullable/ sentinel value (e.g., pointer or negative duration) instead of using zero as the unset indicator and adjust the gracePeriod assignment accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cmd/infra/azure/destroy_test.go`:
- Around line 28-75: Add a test case for deleteClusterResourcesInGroup that
verifies ResourceNotFound errors are skipped: create a new t.Run that uses
mockResourceDeleter (listFunc returning a resource as in existing tests) and a
deleteFunc that returns the same ResourceNotFound error type your code checks
for, then call opts.deleteClusterResourcesInGroup with a DestroyInfraOptions
instance and assert no error is returned and that deleteFunc was invoked;
reference the deleteClusterResourcesInGroup function, mockResourceDeleter, and
DestroyInfraOptions to locate and implement the test.
In `@cmd/infra/azure/destroy.go`:
- Around line 278-291: The retry loop using wait.PollUntilContextCancel around
deleteClusterResourcesInGroup should track and log the attempt count (and
optionally remaining time) on each retry to improve observability; introduce a
counter variable outside the closure, increment it on each invocation where err
!= nil, and update the logger.Info call inside the closure to include fields
like "attempt", attempt, and if infraCtx has a deadline compute and include
"remaining" time, so the pollErr/log output shows attempt number and time-left
along with the existing error and retry message.
- Around line 405-411: Replace the manual O(n²) bubble sort loop over the
resources slice with Go's idiomatic sort.Slice: import "sort" and call
sort.Slice(resources, func(i, j int) bool { return
priority(resources[i].resourceType) < priority(resources[j].resourceType) }) so
ordering uses the same priority(...) comparator; this removes the nested
for-loops and yields clearer, more efficient sorting for the resources slice.
- Around line 263-266: The code currently treats any zero value of
o.AzureInfraGracePeriod as the 5-minute default which can surprise users who
explicitly passed --azure-infra-grace-period=0; update the flag help/description
for --azure-infra-grace-period to clearly state the default and how zero is
handled, and change the runtime behavior to log when the default is applied
(i.e., in destroy.go where gracePeriod is computed from o.AzureInfraGracePeriod
and the local variable gracePeriod is set) so users see that their zero was
overridden to 5m; alternatively, if you want zero to mean “no timeout” rather
than default, modify the flag parsing to use a nullable/ sentinel value (e.g.,
pointer or negative duration) instead of using zero as the unset indicator and
adjust the gracePeriod assignment accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 82747af0-0003-4963-abd6-3db71a25843b
📒 Files selected for processing (2)
cmd/infra/azure/destroy.gocmd/infra/azure/destroy_test.go
def78cf to
7a851fb
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/cluster/azure/destroy.go`:
- Line 46: The flag binding for "azure-infra-grace-period" currently uses the
zero-value of opts.AzurePlatform.AzureInfraGracePeriod so the help shows 0s
instead of the documented 5m; set the default to 5 minutes explicitly by either
initializing opts.AzurePlatform.AzureInfraGracePeriod = 5*time.Minute before the
call or by passing a literal 5*time.Minute as the default argument to
cmd.Flags().DurationVar, and ensure the time package is imported if needed so
the help and actual default match the "default 5m" text.
In `@cmd/infra/azure/destroy.go`:
- Around line 311-313: The diagnostic call after detecting errors.Is(pollErr,
context.DeadlineExceeded) currently calls deleter.ListByResourceGroup with the
parent ctx which may be long-lived; replace that call to use a short bounded
context (e.g., ctxShort created via context.WithTimeout(ctx, shortDuration) with
defer cancel()) and pass ctxShort to
deleter.ListByResourceGroup(resourceGroupName) so the post-timeout listing
cannot block indefinitely; ensure you handle cancel and propagate listErr as
before.
- Around line 275-278: The code sets gracePeriod := o.AzureInfraGracePeriod and
only defaults when it equals zero, but negative values should be rejected;
update the validation in the Azure destroy command (where gracePeriod and
o.AzureInfraGracePeriod are used) to check if o.AzureInfraGracePeriod < 0 and
return a user-facing error (e.g., fmt.Errorf or Cobra flag validation) instead
of accepting it, otherwise keep the existing zero->defaultInfraGracePeriod
logic; reference the gracePeriod variable, o.AzureInfraGracePeriod input, and
defaultInfraGracePeriod constant when adding this validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 78139657-6dab-4ef3-8650-ca92cb91eeaa
📒 Files selected for processing (4)
cmd/cluster/azure/destroy.gocmd/cluster/core/destroy.gocmd/infra/azure/destroy.gocmd/infra/azure/destroy_test.go
✅ Files skipped from review due to trivial changes (1)
- cmd/cluster/core/destroy.go
d31e385 to
687b77d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cmd/infra/azure/destroy_test.go (1)
220-259: 💤 Low valueConsider adding
t.Parallel()to subtests for consistency.The subtests in
TestRetryDeleteClusterResourcesdon't havet.Parallel()unlike other test functions in this file. Since each subtest has its own mock and options, they should be safe to run in parallel.♻️ Suggested change
t.Run("When DNS zone deletion fails on first pass it should succeed on retry", func(t *testing.T) { + t.Parallel() g := NewWithT(t)Apply similarly to other subtests at lines 261, 290, and 318.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/infra/azure/destroy_test.go` around lines 220 - 259, Add t.Parallel() at the start of each t.Run subtest within TestRetryDeleteClusterResources so the subtests run concurrently like the rest of the file; locate the t.Run blocks in TestRetryDeleteClusterResources and insert t.Parallel() as the first statement inside each subtest closure (the closures that set up the mockResourceDeleter and DestroyInfraOptions and call opts.retryDeleteClusterResources), and do the same for the other sibling subtests in that test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cmd/infra/azure/destroy_test.go`:
- Around line 220-259: Add t.Parallel() at the start of each t.Run subtest
within TestRetryDeleteClusterResources so the subtests run concurrently like the
rest of the file; locate the t.Run blocks in TestRetryDeleteClusterResources and
insert t.Parallel() as the first statement inside each subtest closure (the
closures that set up the mockResourceDeleter and DestroyInfraOptions and call
opts.retryDeleteClusterResources), and do the same for the other sibling
subtests in that test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a0076f33-d239-4dc5-af71-9012bbb034f9
📒 Files selected for processing (4)
cmd/cluster/azure/destroy.gocmd/cluster/core/destroy.gocmd/infra/azure/destroy.gocmd/infra/azure/destroy_test.go
✅ Files skipped from review due to trivial changes (1)
- cmd/cluster/core/destroy.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/cluster/azure/destroy.go
687b77d to
f216edf
Compare
Add retry loop with grace period for deleting cluster-specific resources when using --preserve-resource-group on Azure. This mirrors the AWS destroy path's PollUntilContextCancel pattern to handle async Azure-side deletion of virtualNetworkLinks and DNS zones. - Add resourceDeleter interface for testable resource deletion - Wrap deleteClusterResourcesInGroup in wait.PollUntilContextCancel with 10s interval and configurable grace period (default 5m) - Detect NonRetriable errors via structural typing to stop retrying on permanent failures (auth errors, invalid requests) - Re-list resources on each retry pass to pick up async deletions - Return joined errors instead of swallowing deletion failures - Add --azure-infra-grace-period flag for timeout configuration - Add 6 unit tests covering retry, NonRetriable, timeout, and re-listing behavior Signed-off-by: Bryan Cox <brcox@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)
f216edf to
9b717a0
Compare
Summary
wait.PollUntilContextCancelfor Azure resource cleanup when using--preserve-resource-group, mirroring the AWS destroy path's retry patternresourceDeleterinterface for testable resource deletion with mock-based unit testsNonRetriableerrors via structural typing to stop retrying on permanent failures (auth errors, invalid requests)Test plan
go test -v ./cmd/infra/azure/...), including 6 new retry behavior testsmake run-gitlint)/test e2e-aks— verify no regression on Azure E2Ehcp destroy azure --preserve-resource-groupwith DNS zone having active vnet links🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests
AI Assistance