Skip to content

handleReconcileErr swallows context.Canceled without requeueing, permanently dropping CRs from work queue #1962

@thejuan

Description

@thejuan

Describe the bug

When a controller's Reconcile function returns a context.Canceled error, handleReconcileErr in internal/controller/operator/controllers.go silently discards it and returns (result, nil) — which tells controller-runtime not to requeue the item. If the cancellation interrupted reconciliation before completion (e.g., finalizer not yet added, child resources not yet created), the CR is permanently dropped from the work queue and never reconciled again.

This affects all controllers that use handleReconcileErr, but we hit it specifically on the VMCluster controller in a VMDistributed deployment where az2's VMCluster CR was created but never reconciled — no finalizer, no last-applied-spec annotation, no child StatefulSets/Deployments, no status. The CR existed as a zombie indefinitely.

Root cause

case errors.Is(err, context.Canceled):
contextCancelErrorsTotal.Inc()
return originResult, nil

case errors.Is(err, context.Canceled):
    contextCancelErrorsTotal.Inc()
    return originResult, nil  // ← returns nil error = no requeue

Returning nil tells controller-runtime the reconciliation succeeded. The item is removed from the work queue. If the CR was only partially reconciled (e.g., created but finalizer not added), no future event will re-trigger reconciliation unless an external change (annotation, spec update) generates a new watch event.

How to reproduce

  1. Deploy a VMDistributed CR with 2+ zones on a busy operator instance (multiple CI namespaces sharing the same operator)
  2. The VMDistributed controller creates zone VMCluster CRs sequentially
  3. If the VMCluster controller's reconciliation of a newly created CR is interrupted by context cancellation (e.g., manager restart, leader election change, or high controller-runtime queue pressure), the CR is permanently orphaned
  4. The VMDistributed controller enters an infinite loop: it tries to waitForStatus on the orphaned VMCluster, times out after ReadyTimeout (default 5m), requeues, and repeats forever

Observed behavior

  • VMCluster CR az2 existed with generation: 1 but:
    • No apps.victoriametrics.com/finalizer
    • No operator.victoriametrics/last-applied-spec annotation
    • No .status (empty)
    • No child workloads (no StatefulSets, Deployments, or Pods)
  • VMCluster controller processed az2 in other namespaces (staging, other CI branches) but never in the affected namespace
  • VMDistributed controller cycled every ~6 minutes: reconcile az1 (operational) → wait for az2 status → 5min timeout → requeue
  • Workaround: Adding an annotation to the VMCluster CR (kubectl annotate vmcluster az2 operator.victoriametrics.com/trigger-reconcile=$(date -u +%s)) generated a new watch event, re-enqueued the CR, and the VMCluster controller reconciled it successfully within seconds — all pods came up and status reached operational
    The operator was also restarted, so unclear which solved the issue

Expected behavior

A context.Canceled error during reconciliation should requeue with backoff, not silently drop the item. The CR should eventually be reconciled to completion.

Suggested fix

case errors.Is(err, context.Canceled):
    contextCancelErrorsTotal.Inc()
    return ctrl.Result{RequeueAfter: 5 * time.Second}, nil

Or alternatively, return the error to let the rate limiter handle requeueing:

case errors.Is(err, context.Canceled):
    contextCancelErrorsTotal.Inc()
    return originResult, err  // let controller-runtime requeue with backoff

The same pattern applies to handleReconcileErrWithoutStatus at line ~198 which has the identical issue.

Environment

  • Kubernetes: EKS (AWS)
  • Trigger: VMDistributed with 2 zones, operator shared across multiple CI namespaces
  • CRD affected: VMCluster (but bug exists in shared handleReconcileErr used by all controllers)

Additional context

The handleReconcileErrWithoutStatus function at L185-200 has the same bug — context.Canceled returns nil without requeueing. Both should be fixed.

This is particularly impactful for VMDistributed deployments because the VMDistributed controller creates child CRs (VMCluster, VMAgent) and then blocks in waitForStatus — if the child CR is dropped from the VMCluster controller's queue, the VMDistributed controller enters a permanent timeout loop with no self-healing path.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions