Describe the bug
When a controller's Reconcile function returns a context.Canceled error, handleReconcileErr in internal/controller/operator/controllers.go silently discards it and returns (result, nil) — which tells controller-runtime not to requeue the item. If the cancellation interrupted reconciliation before completion (e.g., finalizer not yet added, child resources not yet created), the CR is permanently dropped from the work queue and never reconciled again.
This affects all controllers that use handleReconcileErr, but we hit it specifically on the VMCluster controller in a VMDistributed deployment where az2's VMCluster CR was created but never reconciled — no finalizer, no last-applied-spec annotation, no child StatefulSets/Deployments, no status. The CR existed as a zombie indefinitely.
Root cause
|
case errors.Is(err, context.Canceled): |
|
contextCancelErrorsTotal.Inc() |
|
return originResult, nil |
case errors.Is(err, context.Canceled):
contextCancelErrorsTotal.Inc()
return originResult, nil // ← returns nil error = no requeue
Returning nil tells controller-runtime the reconciliation succeeded. The item is removed from the work queue. If the CR was only partially reconciled (e.g., created but finalizer not added), no future event will re-trigger reconciliation unless an external change (annotation, spec update) generates a new watch event.
How to reproduce
- Deploy a
VMDistributed CR with 2+ zones on a busy operator instance (multiple CI namespaces sharing the same operator)
- The VMDistributed controller creates zone VMCluster CRs sequentially
- If the VMCluster controller's reconciliation of a newly created CR is interrupted by context cancellation (e.g., manager restart, leader election change, or high controller-runtime queue pressure), the CR is permanently orphaned
- The VMDistributed controller enters an infinite loop: it tries to
waitForStatus on the orphaned VMCluster, times out after ReadyTimeout (default 5m), requeues, and repeats forever
Observed behavior
- VMCluster CR
az2 existed with generation: 1 but:
- No
apps.victoriametrics.com/finalizer
- No
operator.victoriametrics/last-applied-spec annotation
- No
.status (empty)
- No child workloads (no StatefulSets, Deployments, or Pods)
- VMCluster controller processed
az2 in other namespaces (staging, other CI branches) but never in the affected namespace
- VMDistributed controller cycled every ~6 minutes: reconcile az1 (operational) → wait for az2 status → 5min timeout → requeue
- Workaround: Adding an annotation to the VMCluster CR (
kubectl annotate vmcluster az2 operator.victoriametrics.com/trigger-reconcile=$(date -u +%s)) generated a new watch event, re-enqueued the CR, and the VMCluster controller reconciled it successfully within seconds — all pods came up and status reached operational
The operator was also restarted, so unclear which solved the issue
Expected behavior
A context.Canceled error during reconciliation should requeue with backoff, not silently drop the item. The CR should eventually be reconciled to completion.
Suggested fix
case errors.Is(err, context.Canceled):
contextCancelErrorsTotal.Inc()
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
Or alternatively, return the error to let the rate limiter handle requeueing:
case errors.Is(err, context.Canceled):
contextCancelErrorsTotal.Inc()
return originResult, err // let controller-runtime requeue with backoff
The same pattern applies to handleReconcileErrWithoutStatus at line ~198 which has the identical issue.
Environment
- Kubernetes: EKS (AWS)
- Trigger: VMDistributed with 2 zones, operator shared across multiple CI namespaces
- CRD affected: VMCluster (but bug exists in shared
handleReconcileErr used by all controllers)
Additional context
The handleReconcileErrWithoutStatus function at L185-200 has the same bug — context.Canceled returns nil without requeueing. Both should be fixed.
This is particularly impactful for VMDistributed deployments because the VMDistributed controller creates child CRs (VMCluster, VMAgent) and then blocks in waitForStatus — if the child CR is dropped from the VMCluster controller's queue, the VMDistributed controller enters a permanent timeout loop with no self-healing path.
Describe the bug
When a controller's
Reconcilefunction returns acontext.Cancelederror,handleReconcileErrininternal/controller/operator/controllers.gosilently discards it and returns(result, nil)— which tells controller-runtime not to requeue the item. If the cancellation interrupted reconciliation before completion (e.g., finalizer not yet added, child resources not yet created), the CR is permanently dropped from the work queue and never reconciled again.This affects all controllers that use
handleReconcileErr, but we hit it specifically on the VMCluster controller in a VMDistributed deployment where az2's VMCluster CR was created but never reconciled — no finalizer, nolast-applied-specannotation, no child StatefulSets/Deployments, no status. The CR existed as a zombie indefinitely.Root cause
operator/internal/controller/operator/controllers.go
Lines 127 to 129 in 8db45b7
Returning
niltells controller-runtime the reconciliation succeeded. The item is removed from the work queue. If the CR was only partially reconciled (e.g., created but finalizer not added), no future event will re-trigger reconciliation unless an external change (annotation, spec update) generates a new watch event.How to reproduce
VMDistributedCR with 2+ zones on a busy operator instance (multiple CI namespaces sharing the same operator)waitForStatuson the orphaned VMCluster, times out afterReadyTimeout(default 5m), requeues, and repeats foreverObserved behavior
az2existed withgeneration: 1but:apps.victoriametrics.com/finalizeroperator.victoriametrics/last-applied-specannotation.status(empty)az2in other namespaces (staging, other CI branches) but never in the affected namespacekubectl annotate vmcluster az2 operator.victoriametrics.com/trigger-reconcile=$(date -u +%s)) generated a new watch event, re-enqueued the CR, and the VMCluster controller reconciled it successfully within seconds — all pods came up and status reachedoperationalThe operator was also restarted, so unclear which solved the issue
Expected behavior
A
context.Cancelederror during reconciliation should requeue with backoff, not silently drop the item. The CR should eventually be reconciled to completion.Suggested fix
Or alternatively, return the error to let the rate limiter handle requeueing:
The same pattern applies to
handleReconcileErrWithoutStatusat line ~198 which has the identical issue.Environment
handleReconcileErrused by all controllers)Additional context
The
handleReconcileErrWithoutStatusfunction at L185-200 has the same bug —context.Canceledreturnsnilwithout requeueing. Both should be fixed.This is particularly impactful for VMDistributed deployments because the VMDistributed controller creates child CRs (VMCluster, VMAgent) and then blocks in
waitForStatus— if the child CR is dropped from the VMCluster controller's queue, the VMDistributed controller enters a permanent timeout loop with no self-healing path.