Skip to content

fix(shedding): reconcile pending residues + aggregate drain-rate on leader sweep#30

Merged
t03i merged 11 commits into
mainfrom
fix/shedding-residue-leak
Jun 19, 2026
Merged

fix(shedding): reconcile pending residues + aggregate drain-rate on leader sweep#30
t03i merged 11 commits into
mainfrom
fix/shedding-residue-leak

Conversation

@t03i

@t03i t03i commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Problem

Two coupled bugs in the gateway's admission-control (request-shedding) accounting:

  1. pending_residues leaks permanently. Decrements came only from BullMQ QueueEvents, which have no replay — a missed terminal event (worker crash, reconnect, restart) left residues pinned forever. estimated_wait = pending / throughput then drifts upward and over-sheds.
  2. throughput was dimensionally wrong. It was a per-job, embedding-only completion rate, divided into a both-queue pending total — inconsistent units, and blind to worker concurrency.

Fix

  • Leader-sweep reconciliation. The existing leader-elected cleanup sweep now recomputes pending_residues from real queue state (sumQueueResidues over waiting/active/waiting-children on both queues) via an authoritative absolute setPending. Event-driven inc/decrement stays as a between-sweeps fast-path hint; any drift self-heals each sweep.
  • Conservation-based aggregate drain rate. A monotonic admitted-residues counter + a prior-sweep snapshot yield departures = max(0, arrivals − Δpending), drain_rate = departures / Δt, fed into the existing EWMA. This is concurrency-aware and dimensionally consistent with both-queue pending.
  • Leak-detector alert (SheddingPendingResidueLeak): fires when shedding_pending_residues > 0 is sustained 15m while queues are idle. promtool-validated.
  • Idle gauge refresh (found during load verification): the shedding gauges were written only by the request middleware, so at idle they froze at their last request-time value — which would make the leak alert false-fire after a drain. Added a per-instance poller (mirroring queue-depth polling) that refreshes pending/throughput/wait from reconciled Redis state every 15s.

Verification

  • Unit (state.test.ts, cleanup.test.ts, event-subscriber.test.ts, metrics.test.ts): conservation math, clamps, skip-cases, route-agnostic residue sum, sweep wiring, drift-override, idle gauge refresh. api-gateway suite 327 ✓.
  • Integration (tests/backend-e2e, real Redis/queues/Triton-stub): 17 ✓. Aligned the "decrements after completion" test with sweep-driven throughput (timestamp/EWMA populate on the sweep, not on completion).
  • Load (tests/load/shedding.js against the live stack): shedding_pending_residues builds under load and returns to 0 at idle; shedding_residues_per_second is sweep-sampled aggregate (≈170–390 r/s, not the cold-start default); estimated-wait tracks pending/rps. Note: deep sustained overload is throttled by the gateway submission rate limiter (429) before the controller saturates, so calibration-depth shedding wasn't exercised via that path.

Closes the fix-shedding-residue-leak OpenSpec change (all tasks complete).

🤖 Generated with Claude Code

t03i and others added 11 commits June 14, 2026 21:27
…eader sweep

pending_residues leaked permanently: incremented at admission, decremented
only via replay-less embedding QueueEvents — any missed event leaked forever,
pinning the wait estimate. throughput was a per-job, embedding-only rate,
dimensionally inconsistent with both-queue pending.

- state.ts: add authoritative setPending (absolute SET, clamped >=0), a
  monotonic incrAdmitted counter, and a sweep-driven sampleThroughput deriving
  drain rate by conservation (arrivals − Δpending); drop per-job recordCompletion.
- cleanup.ts: sum residues over waiting/active/waiting-children across both
  queues on the leader sweep, setPending + sampleThroughput (observe-and-set).
- middleware/shedding.ts: incrAdmitted alongside incrementPending in both paths.
- event-subscriber.ts: drop recordCompletion; keep decrementPending as hint.
- monitoring: SheddingPendingResidueLeak alert (pending>0 while queues idle).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d counter

- state.test.ts: setPending overwrite/clamp; sampleThroughput conservation
  math, negative-departure clamp, no-snapshot/zero-Δt skips, cold-start floor.
- cleanup.test.ts: sumQueueResidues route-agnostic parent+child sum, drains to
  zero, queried states; leader-sweep overwrites stale counter, samples
  throughput sweep-side, and survives a reconciliation failure.
- event-subscriber.test.ts: terminal event no longer feeds throughput; sweep
  reconciliation overrides accumulated event drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lete

Code, tests, and the leak-detector alert rule are done (tasks 1–5, 6.1).
6.2 (test:int) and 6.3 (load verification) require the docker stack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…self-evident key comments

The "decrements after completion" E2E asserted last-sample timestamp and EWMA
were written on job completion. Throughput is now leader-sweep-driven, so those
fields populate on the sweep (60s default), not on the fast-path decrement —
assert only the pending decrement here (sampler math is unit-covered). A fast
test sweep is not viable: it would trip the sibling cache-hit test via the
intentional parent+child double-count.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…alert clears

The shedding gauges were written only by the request middleware, so at idle they
froze at their last request-time value. Load verification exposed the gap: after a
drain, shedding_pending_residues stayed pinned high while queues sat empty — the
exact condition the SheddingPendingResidueLeak alert fires on, i.e. a false
positive at idle. Add a per-instance poller (mirroring queue-depth polling) that
refreshes the three gauges from reconciled Redis state every 15s. Verified live:
the gauge now converges to 0 within one tick after load stops.

Also corrects the two gauge help strings to the post-change aggregate, both-queue
semantics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… cadence

The staleness guard read `lastCompletionTimestampMs` from the throughput
sampler's `last_sample_ms`, which the leader sweep only refreshes once per
reconcile interval. With both defaults at 60s, the timestamp's age oscillates
0→60s every sweep while pending>0 under healthy load, so the guard could
false-fire UPSTREAM_DOWN and shed everything (including enterprise SLO=0) on a
working pipeline.

Restore completion-driven liveness: stamp `shedding:last_completion_ms` on each
terminal embedding event and read that for the guard, distinct from the
sampler's sweep-cadence timestamp. decide.ts is unchanged.

Also add deep-overload calibration tests for decideAdmission — the regime the
gateway submission rate limiter masks from HTTP load tests — covering the SLO
boundary, retry-after growth, and enterprise-vs-pro under sustained overload.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…key config

Submission ceilings: add a RATE_LIMIT_SUBMISSIONS_{FREE,PRO,ENTERPRISE} config
section (defaulting to PLAN_LIMITS) and thread it into the submission rate
limiter as an optional per-plan override. Raising the env in a load-test
deployment lets traffic reach the shedding controller instead of bouncing off
the limiter — without a privileged no-throttle plan in prod code. Default
behaviour is unchanged.

ops-key decoupling: the CLI loaded the full gateway config (storage/redis/
triton/model) just to mint an API key, so it failed wherever that infra env was
absent — including the ops-key integration test, which seeded only the auth/db
vars it actually uses. Add a narrow loadOpsKeyConfig (auth/cors/database only)
and point ops-key at it; an operator no longer needs Garage secrets to revoke a
key. Seed the int test's auth env to match the narrow loader.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The staleness-guard fix moved the completion-liveness timestamp out of the
throughput hash's `last_sample_ms` into the dedicated `shedding:last_completion_ms`
key, but the backend-e2e helpers still seeded/read/reset the old location — so
`lastCompletionTimestamp` came back null and the admin-state E2E failed. Point
seedSheddingState/readSheddingRedis/resetSheddingState at the new key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rationale already lives on recordCompletion(); the key name is self-evident.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-event redis

- rate-limit: drop DEFAULT_SUBMISSIONS_PER_MINUTE (a second PLAN_LIMITS
  flattening) and fall back to PLAN_LIMITS inline in the limit callback, so the
  config section is the single source of the per-plan defaults. Param stays
  optional; behaviour unchanged.
- event-subscriber: recordCompletion() and getJob() are independent, so run
  them with Promise.all instead of serially — one fewer Redis round-trip per
  drained job on the leader's hot path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ding spec

Move the completed change to archive/2026-06-19 and promote its delta
into a new request-shedding capability spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@t03i t03i merged commit e3d1e55 into main Jun 19, 2026
9 checks passed
@t03i t03i deleted the fix/shedding-residue-leak branch June 19, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant