Problem
When a Hindsight worker pod is terminated (restart, deploy, OOM, node eviction), it does not release its claimed tasks in the async_operations table. The replacement pod gets a new hostname and will never pick up the old pod's tasks. The stuck task remains in processing state forever.
Observed behavior
- Worker
31acaa9db8aa had a consolidation task claimed since Feb 10, 2026 — stuck for 38 days
- Worker stats logged
global: pending=1 | others: 31acaa9db8aa:1 | my_active: none every 30 seconds for the entire duration
- The current pod had all 10 slots available but would not pick up the task because it was claimed by another worker
- 7 dead worker pods have historical task claims in the
async_operations table from previous deployments
Impact
- Any
file_convert_retain or batch_retain task claimed by a dead worker is permanently lost
- Uploaded documents never get processed — users see documents stuck in
pending until the client-side polling times out
global: pending=N grows over time with each pod restart
Evidence
2026-03-19 21:50:17,200 - INFO - hindsight_api.worker.poller - [WORKER_STATS]
worker=hindsight-robin-kb-service-hindsight-6f985468d6-h55lr
slots=0/10 (consolidation=0/2) | available=10 (consolidation=2) |
global: pending=1 (schemas: hindsight) | others: 31acaa9db8aa:1 | my_active: none
-- Stuck task in async_operations
SELECT operation_id, operation_type, status, worker_id, claimed_at
FROM hindsight.async_operations
WHERE status = 'processing' AND worker_id = '31acaa9db8aa';
-- Result: consolidation task claimed 2026-02-10, still "processing" 38 days later
Current workaround
Manual cancellation via the API:
DELETE /v1/default/banks/{bank_id}/operations/{stuck_operation_id}
Suggested fix
Add a stale task reclaim mechanism — standard in production task queues (Celery's visibility_timeout, SQS's message visibility, Sidekiq's death handler):
-
Option A (simplest): Periodic sweeper (cron job or background task) that runs every 5-10 minutes:
UPDATE async_operations
SET status = 'pending', worker_id = NULL, claimed_at = NULL
WHERE status = 'processing'
AND claimed_at < NOW() - INTERVAL '30 minutes';
-
Option B (robust): Worker heartbeat table. Workers update their heartbeat every 30 seconds. Sweeper checks: if worker_id hasn't heartbeated in 2 minutes, release all its tasks.
-
Option C (Kubernetes-aware): On startup, query for tasks claimed by workers that are no longer running and release them.
Environment
- Hindsight deployed as a single-pod StatefulSet in Kubernetes
hindsight-client SDK v0.4.14
- PostgreSQL-backed
async_operations task queue
Problem
When a Hindsight worker pod is terminated (restart, deploy, OOM, node eviction), it does not release its claimed tasks in the
async_operationstable. The replacement pod gets a new hostname and will never pick up the old pod's tasks. The stuck task remains inprocessingstate forever.Observed behavior
31acaa9db8aahad aconsolidationtask claimed since Feb 10, 2026 — stuck for 38 daysglobal: pending=1 | others: 31acaa9db8aa:1 | my_active: noneevery 30 seconds for the entire durationasync_operationstable from previous deploymentsImpact
file_convert_retainorbatch_retaintask claimed by a dead worker is permanently lostpendinguntil the client-side polling times outglobal: pending=Ngrows over time with each pod restartEvidence
Current workaround
Manual cancellation via the API:
Suggested fix
Add a stale task reclaim mechanism — standard in production task queues (Celery's
visibility_timeout, SQS's message visibility, Sidekiq's death handler):Option A (simplest): Periodic sweeper (cron job or background task) that runs every 5-10 minutes:
Option B (robust): Worker heartbeat table. Workers update their heartbeat every 30 seconds. Sweeper checks: if
worker_idhasn't heartbeated in 2 minutes, release all its tasks.Option C (Kubernetes-aware): On startup, query for tasks claimed by workers that are no longer running and release them.
Environment
hindsight-clientSDK v0.4.14async_operationstask queue