handleReconcileErr swallows context.Canceled without requeueing, permanently dropping CRs from work queue

### Describe the bug

When a controller's `Reconcile` function returns a `context.Canceled` error, `handleReconcileErr` in `internal/controller/operator/controllers.go` silently discards it and returns `(result, nil)` — which tells controller-runtime **not to requeue** the item. If the cancellation interrupted reconciliation before completion (e.g., finalizer not yet added, child resources not yet created), the CR is permanently dropped from the work queue and never reconciled again.

This affects all controllers that use `handleReconcileErr`, but we hit it specifically on the **VMCluster controller** in a VMDistributed deployment where az2's VMCluster CR was created but never reconciled — no finalizer, no `last-applied-spec` annotation, no child StatefulSets/Deployments, no status. The CR existed as a zombie indefinitely.

### Root cause

https://github.com/VictoriaMetrics/operator/blob/8db45b7b260d1db48036f18fb3f7af86efbb80b3/internal/controller/operator/controllers.go#L127-L129

```go
case errors.Is(err, context.Canceled):
    contextCancelErrorsTotal.Inc()
    return originResult, nil  // ← returns nil error = no requeue
```

Returning `nil` tells controller-runtime the reconciliation succeeded. The item is removed from the work queue. If the CR was only partially reconciled (e.g., created but finalizer not added), no future event will re-trigger reconciliation unless an external change (annotation, spec update) generates a new watch event.

### How to reproduce

1. Deploy a `VMDistributed` CR with 2+ zones on a busy operator instance (multiple CI namespaces sharing the same operator)
2. The VMDistributed controller creates zone VMCluster CRs sequentially
3. If the VMCluster controller's reconciliation of a newly created CR is interrupted by context cancellation (e.g., manager restart, leader election change, or high controller-runtime queue pressure), the CR is permanently orphaned
4. The VMDistributed controller enters an infinite loop: it tries to `waitForStatus` on the orphaned VMCluster, times out after `ReadyTimeout` (default 5m), requeues, and repeats forever

### Observed behavior

- VMCluster CR `az2` existed with `generation: 1` but:
  - No `apps.victoriametrics.com/finalizer`
  - No `operator.victoriametrics/last-applied-spec` annotation
  - No `.status` (empty)
  - No child workloads (no StatefulSets, Deployments, or Pods)
- VMCluster controller processed `az2` in other namespaces (staging, other CI branches) but never in the affected namespace
- VMDistributed controller cycled every ~6 minutes: reconcile az1 (operational) → wait for az2 status → 5min timeout → requeue
- **Workaround:** Adding an annotation to the VMCluster CR (`kubectl annotate vmcluster az2 operator.victoriametrics.com/trigger-reconcile=$(date -u +%s)`) generated a new watch event, re-enqueued the CR, and the VMCluster controller reconciled it successfully within seconds — all pods came up and status reached `operational`
The operator was also restarted, so unclear which solved the issue

### Expected behavior

A `context.Canceled` error during reconciliation should **requeue with backoff**, not silently drop the item. The CR should eventually be reconciled to completion.

### Suggested fix

```go
case errors.Is(err, context.Canceled):
    contextCancelErrorsTotal.Inc()
    return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
```

Or alternatively, return the error to let the rate limiter handle requeueing:

```go
case errors.Is(err, context.Canceled):
    contextCancelErrorsTotal.Inc()
    return originResult, err  // let controller-runtime requeue with backoff
```

The same pattern applies to `handleReconcileErrWithoutStatus` at line ~198 which has the identical issue.

### Environment

- **Kubernetes:** EKS (AWS)
- **Trigger:** VMDistributed with 2 zones, operator shared across multiple CI namespaces
- **CRD affected:** VMCluster (but bug exists in shared `handleReconcileErr` used by all controllers)

### Additional context

The `handleReconcileErrWithoutStatus` function at L185-200 has the same bug — `context.Canceled` returns `nil` without requeueing. Both should be fixed.

This is particularly impactful for VMDistributed deployments because the VMDistributed controller creates child CRs (VMCluster, VMAgent) and then blocks in `waitForStatus` — if the child CR is dropped from the VMCluster controller's queue, the VMDistributed controller enters a permanent timeout loop with no self-healing path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handleReconcileErr swallows context.Canceled without requeueing, permanently dropping CRs from work queue #1962

Describe the bug

Root cause

How to reproduce

Observed behavior

Expected behavior

Suggested fix

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	case errors.Is(err, context.Canceled):
	contextCancelErrorsTotal.Inc()
	return originResult, nil

handleReconcileErr swallows context.Canceled without requeueing, permanently dropping CRs from work queue #1962

Description

Describe the bug

Root cause

How to reproduce

Observed behavior

Expected behavior

Suggested fix

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions