fix(shedding): reconcile pending residues + aggregate drain-rate on leader sweep by t03i · Pull Request #30 · t03i/protifer

t03i · 2026-06-17T11:28:56Z

Problem

Two coupled bugs in the gateway's admission-control (request-shedding) accounting:

pending_residues leaks permanently. Decrements came only from BullMQ QueueEvents, which have no replay — a missed terminal event (worker crash, reconnect, restart) left residues pinned forever. estimated_wait = pending / throughput then drifts upward and over-sheds.
throughput was dimensionally wrong. It was a per-job, embedding-only completion rate, divided into a both-queue pending total — inconsistent units, and blind to worker concurrency.

Fix

Leader-sweep reconciliation. The existing leader-elected cleanup sweep now recomputes pending_residues from real queue state (sumQueueResidues over waiting/active/waiting-children on both queues) via an authoritative absolute setPending. Event-driven inc/decrement stays as a between-sweeps fast-path hint; any drift self-heals each sweep.
Conservation-based aggregate drain rate. A monotonic admitted-residues counter + a prior-sweep snapshot yield departures = max(0, arrivals − Δpending), drain_rate = departures / Δt, fed into the existing EWMA. This is concurrency-aware and dimensionally consistent with both-queue pending.
Leak-detector alert (SheddingPendingResidueLeak): fires when shedding_pending_residues > 0 is sustained 15m while queues are idle. promtool-validated.
Idle gauge refresh (found during load verification): the shedding gauges were written only by the request middleware, so at idle they froze at their last request-time value — which would make the leak alert false-fire after a drain. Added a per-instance poller (mirroring queue-depth polling) that refreshes pending/throughput/wait from reconciled Redis state every 15s.

Verification

Unit (state.test.ts, cleanup.test.ts, event-subscriber.test.ts, metrics.test.ts): conservation math, clamps, skip-cases, route-agnostic residue sum, sweep wiring, drift-override, idle gauge refresh. api-gateway suite 327 ✓.
Integration (tests/backend-e2e, real Redis/queues/Triton-stub): 17 ✓. Aligned the "decrements after completion" test with sweep-driven throughput (timestamp/EWMA populate on the sweep, not on completion).
Load (tests/load/shedding.js against the live stack): shedding_pending_residues builds under load and returns to 0 at idle; shedding_residues_per_second is sweep-sampled aggregate (≈170–390 r/s, not the cold-start default); estimated-wait tracks pending/rps. Note: deep sustained overload is throttled by the gateway submission rate limiter (429) before the controller saturates, so calibration-depth shedding wasn't exercised via that path.

Closes the fix-shedding-residue-leak OpenSpec change (all tasks complete).

🤖 Generated with Claude Code

…eader sweep pending_residues leaked permanently: incremented at admission, decremented only via replay-less embedding QueueEvents — any missed event leaked forever, pinning the wait estimate. throughput was a per-job, embedding-only rate, dimensionally inconsistent with both-queue pending. - state.ts: add authoritative setPending (absolute SET, clamped >=0), a monotonic incrAdmitted counter, and a sweep-driven sampleThroughput deriving drain rate by conservation (arrivals − Δpending); drop per-job recordCompletion. - cleanup.ts: sum residues over waiting/active/waiting-children across both queues on the leader sweep, setPending + sampleThroughput (observe-and-set). - middleware/shedding.ts: incrAdmitted alongside incrementPending in both paths. - event-subscriber.ts: drop recordCompletion; keep decrementPending as hint. - monitoring: SheddingPendingResidueLeak alert (pending>0 while queues idle). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d counter - state.test.ts: setPending overwrite/clamp; sampleThroughput conservation math, negative-departure clamp, no-snapshot/zero-Δt skips, cold-start floor. - cleanup.test.ts: sumQueueResidues route-agnostic parent+child sum, drains to zero, queried states; leader-sweep overwrites stale counter, samples throughput sweep-side, and survives a reconciliation failure. - event-subscriber.test.ts: terminal event no longer feeds throughput; sweep reconciliation overrides accumulated event drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lete Code, tests, and the leak-detector alert rule are done (tasks 1–5, 6.1). 6.2 (test:int) and 6.3 (load verification) require the docker stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…self-evident key comments The "decrements after completion" E2E asserted last-sample timestamp and EWMA were written on job completion. Throughput is now leader-sweep-driven, so those fields populate on the sweep (60s default), not on the fast-path decrement — assert only the pending decrement here (sampler math is unit-covered). A fast test sweep is not viable: it would trip the sibling cache-hit test via the intentional parent+child double-count. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…alert clears The shedding gauges were written only by the request middleware, so at idle they froze at their last request-time value. Load verification exposed the gap: after a drain, shedding_pending_residues stayed pinned high while queues sat empty — the exact condition the SheddingPendingResidueLeak alert fires on, i.e. a false positive at idle. Add a per-instance poller (mirroring queue-depth polling) that refreshes the three gauges from reconciled Redis state every 15s. Verified live: the gauge now converges to 0 within one tick after load stops. Also corrects the two gauge help strings to the post-change aggregate, both-queue semantics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… cadence The staleness guard read `lastCompletionTimestampMs` from the throughput sampler's `last_sample_ms`, which the leader sweep only refreshes once per reconcile interval. With both defaults at 60s, the timestamp's age oscillates 0→60s every sweep while pending>0 under healthy load, so the guard could false-fire UPSTREAM_DOWN and shed everything (including enterprise SLO=0) on a working pipeline. Restore completion-driven liveness: stamp `shedding:last_completion_ms` on each terminal embedding event and read that for the guard, distinct from the sampler's sweep-cadence timestamp. decide.ts is unchanged. Also add deep-overload calibration tests for decideAdmission — the regime the gateway submission rate limiter masks from HTTP load tests — covering the SLO boundary, retry-after growth, and enterprise-vs-pro under sustained overload. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…key config Submission ceilings: add a RATE_LIMIT_SUBMISSIONS_{FREE,PRO,ENTERPRISE} config section (defaulting to PLAN_LIMITS) and thread it into the submission rate limiter as an optional per-plan override. Raising the env in a load-test deployment lets traffic reach the shedding controller instead of bouncing off the limiter — without a privileged no-throttle plan in prod code. Default behaviour is unchanged. ops-key decoupling: the CLI loaded the full gateway config (storage/redis/ triton/model) just to mint an API key, so it failed wherever that infra env was absent — including the ops-key integration test, which seeded only the auth/db vars it actually uses. Add a narrow loadOpsKeyConfig (auth/cors/database only) and point ops-key at it; an operator no longer needs Garage secrets to revoke a key. Seed the int test's auth env to match the narrow loader. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The staleness-guard fix moved the completion-liveness timestamp out of the throughput hash's `last_sample_ms` into the dedicated `shedding:last_completion_ms` key, but the backend-e2e helpers still seeded/read/reset the old location — so `lastCompletionTimestamp` came back null and the admin-state E2E failed. Point seedSheddingState/readSheddingRedis/resetSheddingState at the new key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rationale already lives on recordCompletion(); the key name is self-evident. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-event redis - rate-limit: drop DEFAULT_SUBMISSIONS_PER_MINUTE (a second PLAN_LIMITS flattening) and fall back to PLAN_LIMITS inline in the limit callback, so the config section is the single source of the per-plan defaults. Param stays optional; behaviour unchanged. - event-subscriber: recordCompletion() and getJob() are independent, so run them with Promise.all instead of serially — one fewer Redis round-trip per drained job on the leader's hot path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ding spec Move the completed change to archive/2026-06-19 and promote its delta into a new request-shedding capability spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

t03i and others added 11 commits June 14, 2026 21:27

style(shedding): trim verbose comment on LAST_COMPLETION_KEY

777e4a8

Rationale already lives on recordCompletion(); the key name is self-evident. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(openspec): archive fix-shedding-residue-leak; sync request-shed…

266ed1c

…ding spec Move the completed change to archive/2026-06-19 and promote its delta into a new request-shedding capability spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

t03i merged commit e3d1e55 into main Jun 19, 2026
9 checks passed

t03i deleted the fix/shedding-residue-leak branch June 19, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shedding): reconcile pending residues + aggregate drain-rate on leader sweep#30

fix(shedding): reconcile pending residues + aggregate drain-rate on leader sweep#30
t03i merged 11 commits into
mainfrom
fix/shedding-residue-leak

t03i commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

t03i commented Jun 17, 2026

Problem

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant