fix(shedding): reconcile pending residues + aggregate drain-rate on leader sweep#30
Merged
Conversation
…eader sweep pending_residues leaked permanently: incremented at admission, decremented only via replay-less embedding QueueEvents — any missed event leaked forever, pinning the wait estimate. throughput was a per-job, embedding-only rate, dimensionally inconsistent with both-queue pending. - state.ts: add authoritative setPending (absolute SET, clamped >=0), a monotonic incrAdmitted counter, and a sweep-driven sampleThroughput deriving drain rate by conservation (arrivals − Δpending); drop per-job recordCompletion. - cleanup.ts: sum residues over waiting/active/waiting-children across both queues on the leader sweep, setPending + sampleThroughput (observe-and-set). - middleware/shedding.ts: incrAdmitted alongside incrementPending in both paths. - event-subscriber.ts: drop recordCompletion; keep decrementPending as hint. - monitoring: SheddingPendingResidueLeak alert (pending>0 while queues idle). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d counter - state.test.ts: setPending overwrite/clamp; sampleThroughput conservation math, negative-departure clamp, no-snapshot/zero-Δt skips, cold-start floor. - cleanup.test.ts: sumQueueResidues route-agnostic parent+child sum, drains to zero, queried states; leader-sweep overwrites stale counter, samples throughput sweep-side, and survives a reconciliation failure. - event-subscriber.test.ts: terminal event no longer feeds throughput; sweep reconciliation overrides accumulated event drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lete Code, tests, and the leak-detector alert rule are done (tasks 1–5, 6.1). 6.2 (test:int) and 6.3 (load verification) require the docker stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…self-evident key comments The "decrements after completion" E2E asserted last-sample timestamp and EWMA were written on job completion. Throughput is now leader-sweep-driven, so those fields populate on the sweep (60s default), not on the fast-path decrement — assert only the pending decrement here (sampler math is unit-covered). A fast test sweep is not viable: it would trip the sibling cache-hit test via the intentional parent+child double-count. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…alert clears The shedding gauges were written only by the request middleware, so at idle they froze at their last request-time value. Load verification exposed the gap: after a drain, shedding_pending_residues stayed pinned high while queues sat empty — the exact condition the SheddingPendingResidueLeak alert fires on, i.e. a false positive at idle. Add a per-instance poller (mirroring queue-depth polling) that refreshes the three gauges from reconciled Redis state every 15s. Verified live: the gauge now converges to 0 within one tick after load stops. Also corrects the two gauge help strings to the post-change aggregate, both-queue semantics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… cadence The staleness guard read `lastCompletionTimestampMs` from the throughput sampler's `last_sample_ms`, which the leader sweep only refreshes once per reconcile interval. With both defaults at 60s, the timestamp's age oscillates 0→60s every sweep while pending>0 under healthy load, so the guard could false-fire UPSTREAM_DOWN and shed everything (including enterprise SLO=0) on a working pipeline. Restore completion-driven liveness: stamp `shedding:last_completion_ms` on each terminal embedding event and read that for the guard, distinct from the sampler's sweep-cadence timestamp. decide.ts is unchanged. Also add deep-overload calibration tests for decideAdmission — the regime the gateway submission rate limiter masks from HTTP load tests — covering the SLO boundary, retry-after growth, and enterprise-vs-pro under sustained overload. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…key config
Submission ceilings: add a RATE_LIMIT_SUBMISSIONS_{FREE,PRO,ENTERPRISE} config
section (defaulting to PLAN_LIMITS) and thread it into the submission rate
limiter as an optional per-plan override. Raising the env in a load-test
deployment lets traffic reach the shedding controller instead of bouncing off
the limiter — without a privileged no-throttle plan in prod code. Default
behaviour is unchanged.
ops-key decoupling: the CLI loaded the full gateway config (storage/redis/
triton/model) just to mint an API key, so it failed wherever that infra env was
absent — including the ops-key integration test, which seeded only the auth/db
vars it actually uses. Add a narrow loadOpsKeyConfig (auth/cors/database only)
and point ops-key at it; an operator no longer needs Garage secrets to revoke a
key. Seed the int test's auth env to match the narrow loader.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The staleness-guard fix moved the completion-liveness timestamp out of the throughput hash's `last_sample_ms` into the dedicated `shedding:last_completion_ms` key, but the backend-e2e helpers still seeded/read/reset the old location — so `lastCompletionTimestamp` came back null and the admin-state E2E failed. Point seedSheddingState/readSheddingRedis/resetSheddingState at the new key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rationale already lives on recordCompletion(); the key name is self-evident. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-event redis - rate-limit: drop DEFAULT_SUBMISSIONS_PER_MINUTE (a second PLAN_LIMITS flattening) and fall back to PLAN_LIMITS inline in the limit callback, so the config section is the single source of the per-plan defaults. Param stays optional; behaviour unchanged. - event-subscriber: recordCompletion() and getJob() are independent, so run them with Promise.all instead of serially — one fewer Redis round-trip per drained job on the leader's hot path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ding spec Move the completed change to archive/2026-06-19 and promote its delta into a new request-shedding capability spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two coupled bugs in the gateway's admission-control (request-shedding) accounting:
pending_residuesleaks permanently. Decrements came only from BullMQQueueEvents, which have no replay — a missed terminal event (worker crash, reconnect, restart) left residues pinned forever.estimated_wait = pending / throughputthen drifts upward and over-sheds.throughputwas dimensionally wrong. It was a per-job, embedding-only completion rate, divided into a both-queue pending total — inconsistent units, and blind to worker concurrency.Fix
pending_residuesfrom real queue state (sumQueueResiduesoverwaiting/active/waiting-childrenon both queues) via an authoritative absolutesetPending. Event-driven inc/decrement stays as a between-sweeps fast-path hint; any drift self-heals each sweep.departures = max(0, arrivals − Δpending),drain_rate = departures / Δt, fed into the existing EWMA. This is concurrency-aware and dimensionally consistent with both-queue pending.SheddingPendingResidueLeak): fires whenshedding_pending_residues > 0is sustained 15m while queues are idle. promtool-validated.Verification
state.test.ts,cleanup.test.ts,event-subscriber.test.ts,metrics.test.ts): conservation math, clamps, skip-cases, route-agnostic residue sum, sweep wiring, drift-override, idle gauge refresh. api-gateway suite 327 ✓.tests/backend-e2e, real Redis/queues/Triton-stub): 17 ✓. Aligned the "decrements after completion" test with sweep-driven throughput (timestamp/EWMA populate on the sweep, not on completion).tests/load/shedding.jsagainst the live stack):shedding_pending_residuesbuilds under load and returns to 0 at idle;shedding_residues_per_secondis sweep-sampled aggregate (≈170–390 r/s, not the cold-start default); estimated-wait tracks pending/rps. Note: deep sustained overload is throttled by the gateway submission rate limiter (429) before the controller saturates, so calibration-depth shedding wasn't exercised via that path.Closes the
fix-shedding-residue-leakOpenSpec change (all tasks complete).🤖 Generated with Claude Code