t03i · t03i · Jun 19, 2026 · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026
diff --git a/infra/monitoring/protifer.rules.yml b/infra/monitoring/protifer.rules.yml
@@ -120,6 +120,28 @@ groups:
             engaging (requests_shed_total). Sustained backlog with healthy
             workers means demand exceeds capacity.
 
+      - alert: SheddingPendingResidueLeak
+        expr: |
+          shedding_pending_residues > 0
+          and on()
+          sum(bullmq_queue_jobs{state=~"waiting|active"}) == 0
+        for: 15m
+        labels:
+          severity: warning
+          app: protifer
+        annotations:
+          summary: 'protifer: shedding pending-residues nonzero while queues idle'
+          description: |
+            shedding_pending_residues has stayed above zero for 15m with no
+            waiting or active jobs on any queue. Pending residues should
+            reconcile to zero on the job-cleanup leader sweep once work
+            drains — a stuck nonzero value at idle means admission accounting
+            drifted (a release path missed its decrement). Admission keeps
+            counting this phantom load against the SLO budget, so it will shed
+            real requests too early. Check the api-gateway logs and /metrics
+            (docker compose logs api-gateway on the app-tier host); restarting
+            the gateway resets the in-process counter as a stopgap.
+
       - alert: HighJobLatency
         expr: |
           histogram_quantile(0.95, sum by (le) (rate(bullmq_job_total_duration_seconds_bucket{queue="prediction"}[15m]))) > 1200

diff --git a/openspec/changes/archive/2026-06-19-fix-shedding-residue-leak/.openspec.yaml b/openspec/changes/archive/2026-06-19-fix-shedding-residue-leak/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-06-14
diff --git a/openspec/changes/archive/2026-06-19-fix-shedding-residue-leak/design.md b/openspec/changes/archive/2026-06-19-fix-shedding-residue-leak/design.md
@@ -0,0 +1,102 @@
+## Context
+
+The shedding admission controller estimates wait time as `pendingResidues / throughput` and sheds requests when that exceeds the per-plan SLO. `pendingResidues` is maintained as a free-running Redis counter (`shedding:pending_residues`): `incrementPending` runs at admission for both `/v1/predictions` and `/v1/embeddings` (`middleware/shedding.ts:195`), and `decrementPending` runs only from live BullMQ `QueueEvents` on the embedding queue, processed by a single elected accounting leader (`event-subscriber.ts:79,109`).
+
+`QueueEvents` is live-only with no replay. Any terminal event not observed by the leader at the instant it fires — leader lock handoff (30s TTL, 10s renew), events outpacing a single leader during a burst, a deduped embedding child (deterministic `jobId`), or a job evicted before `getJob` (`removeOnFail: { count: 100 }`) — drops its decrement permanently. The counter only ratchets up. A 5-minute load test drove it to 400,000 and it stayed dead-flat after load ended, pinning `estimated_wait` at its 5-minute cap. The bug is masked today only because `SHED_MODE` defaults to `shadow`.
+
+The existing cleanup module (`cleanup.ts`) already runs a leader-elected periodic sweep that holds both `predictionQueue` and `embeddingQueue`, exposes a `staleChildrenScan` hook and a `reconcileNow()` handle — the natural home for an authoritative recompute.
+
+The second input, `throughput`, is equally wrong. `recordCompletion(residues, durationSeconds)` (`state.ts:79`) feeds the EWMA a sample of `residues / (finishedOn − processedOn)` per terminal **embedding** event — the processing rate of a single job. The system actually drains `WORKER_CONCURRENCY` (4) jobs per worker across both queues in parallel, so the single-job rate under-states true throughput by roughly the concurrency factor. And it is embedding-only, while reconciled `pending` spans both queues — so `estimated_wait = pending / throughput` divides a both-queue residue total by an embedding-only single-job rate. Fixing the leak alone leaves this dimensional mismatch, so `enforce` would still over-shed. Both inputs must be corrected together.
+
+## Goals / Non-Goals
+
+**Goals:**
+
+- Make `pending_residues` self-healing: a missed event must not leak past the next sweep.
+- Derive the authoritative value from real queue state, route-agnostically (fixes the prediction-side blind spot for free).
+- Make `throughput` an aggregate, concurrency-aware drain rate in the same residue units as `pending`, so `estimated_wait` is dimensionally meaningful and reflects how fast work actually leaves the queues.
+- Derive throughput from reliable signals, not the replay-less event stream, so a missed completion no longer corrupts the estimate.
+- Keep admission responsive between sweeps (no regression in decision latency); keep the existing EWMA smoothing and the `decide.ts` formula unchanged — only its inputs change.
+- Detect accounting drift via an alert so this class of bug can never again sit invisible.
+- Make a future `SHED_MODE=enforce` rollout actually safe (both inputs correct, not just one).
+
+**Non-Goals:**
+
+- Reworking load-testing methodology (honest e2e measurement, rate-limit exemption) — deferred.
+- Per-Triton-call / per-job latency metrics (the separate `prediction-latency-observability` change) — not needed for the aggregate residue drain rate here.
+- Changing the SLO thresholds, plan priorities, retry-after jitter, or the shadow/enforce default.
+
+## Decisions
+
+### Decision 1: Reconcile on the cleanup leader sweep, not a new loop
+
+Recompute `pending_residues` inside the existing cleanup sweep, next to `staleChildrenScan`. The sweep is already leader-elected (single writer), already holds both queues, and already runs on an interval — so reconciliation inherits correct concurrency semantics with no new lock or timer.
+
+- **Alternative — dedicated reconciliation loop/leader:** rejected; duplicates leader election and risks two writers racing on the same Redis key.
+- **Alternative — fix event capture (durable stream / ack):** rejected; turning `QueueEvents` into a replayable log is far more complex and still leaves drift on edge cases. Reconciliation makes correctness independent of event delivery.
+
+### Decision 2: Authoritative recompute = sum of residues over waiting/active/waiting-children
+
+Compute pending as the sum of per-job residues across `QUEUE_NAMES.PREDICTION` and `QUEUE_NAMES.EMBEDDING` for jobs in `waiting`, `active`, and `waiting-children`. Residue per job = sequence length, read from job data (matching how `incrementPending` and the subscriber already derive residues). Write the result with an authoritative setter (`setPending`) — an absolute `SET`, not `incrby`/`decrby`.
+
+- Including `active` keeps in-flight work counted; including `waiting-children` counts prediction parents whose embedding child is still running. Excluding terminal states is what drives the value to zero on drain.
+- **Alternative — reconcile only embedding queue:** rejected; that preserves the prediction blind spot.
+
+### Decision 3: Event counter stays as a between-sweeps fast path
+
+Keep `incrementPending`/`decrementPending` so admission reflects sub-sweep-interval activity. Reconciliation overwrites (not adjusts) the value each sweep, so accumulated drift is discarded wholesale. This bounds worst-case staleness to one sweep interval while keeping decisions current.
+
+- **Alternative — remove the event counter, read live each admission:** rejected; summing queue residues on every request is too expensive for the hot path. Sweep-interval reconciliation + cheap incr/decr hint is the right cost balance.
+
+### Decision 4: Leak detector as a metric invariant alert
+
+Add a rule to `infra/monitoring/protifer.rules.yml`: fire when `shedding_pending_residues > 0` is sustained while `bullmq_queue_jobs{state=~"waiting|active"}` is zero. This encodes the invariant "pending must trend to zero when there is no work." `promtool` validates it in CI; `mimirtool rules sync` deploys on `main`.
+
+### Decision 5: Throughput = aggregate drain rate measured on the sweep by conservation
+
+Measure how fast residues actually leave the queues, on the same leader sweep, using flow conservation rather than per-job timing:
+
+```
+departures(Δt) = arrivals(Δt) − (pending_now − pending_prev)
+drain_rate     = max(0, departures) / Δt_seconds        // residues/sec, aggregate
+```
+
+- `arrivals(Δt)` comes from a new **monotonic** `shedding:admitted_residues_total` counter, `INCRBY`-ed synchronously at admission (in both shadow and enforce paths, alongside the existing `incrementPending`). It is reliable because it is in the request path, not the event stream.
+- `pending_now` / `pending_prev` are the authoritative reconciled values from Decision 2 (this sweep and the prior one).
+- The sweep records `drain_rate` as the EWMA sample (reusing the existing `THROUGHPUT_KEY` Lua update), then snapshots `admitted_total` and the timestamp for the next interval's delta.
+
+This is the same architectural thesis as the pending reconciliation: derive the number from reliable, replay-independent signals. It is automatically **concurrency-aware** (it counts real departures, however many workers/slots produced them) and **route-agnostic** (both queues' work flows through the same pending/arrivals bookkeeping), so it is in the same residue units as `pending` — the division in `decide.ts` becomes dimensionally sound. `decide.ts` itself is unchanged.
+
+- **Alternative — count terminal events into a completed-residues counter and rate it:** rejected as the source — it inherits exactly the replay-less `QueueEvents` fragility (and embedding-only scope) this change exists to escape; a missed completion would understate drain and over-shed.
+- **Alternative — multiply the existing single-job rate by a fixed `WORKER_CONCURRENCY × worker_count`:** rejected — the gateway does not know the live worker count, slots are not always full, and it hard-codes a deployment assumption. Measuring real departures needs none of that.
+- **Edge cases:** clamp `departures` at ≥ 0 (snapshot timing can make pending momentarily rise mid-drain); skip the EWMA update when `Δt` is ~0 or the prior snapshot is absent (first sweep after boot); `readState` continues to seed `residuesPerSecondEwma` to `config.initialResiduesPerSecond` until the first valid sample, so cold-start behavior is unchanged.
+
+### Decision 6: Retire the per-job throughput sample; subscriber keeps only `decrementPending`
+
+`recordCompletion` is removed from the throughput path — the event subscriber (`event-subscriber.ts:99`) stops calling it, since the per-job, embedding-only sample is exactly the wrong rate. The subscriber retains only `decrementPending` as the between-sweeps fast-path hint (Decision 3). Throughput is now owned entirely by the sweep (Decision 5).
+
+- This keeps a single writer for the EWMA (the leader sweep), mirroring the single-writer property the pending reconciliation relies on, and removes the last place the replay-less event stream fed a decision input.
+
+## Risks / Trade-offs
+
+- **Reconcile reads many jobs under deep backlog** → keep it observe-and-set (no per-job mutation), reuse the existing sweep cadence, and prefer count/aggregate reads; if job-data reads prove heavy, batch them. The sweep already iterates queue state, so marginal cost is bounded.
+- **Sub-sweep drift window** → between sweeps the fast-path counter can still be wrong, but it self-corrects each sweep and can no longer leak unboundedly. Acceptable; sweep interval is the staleness bound.
+- **Two writers if leadership flaps** → mitigated by reusing the existing single-leader sweep election; the authoritative `SET` is idempotent, so a brief overlap converges rather than corrupts.
+- **Alert false negatives if `bullmq_queue_jobs` lags** → the alert uses a sustained `for:` duration so transient drain-vs-counter skew does not fire; tune the window during rollout.
+- **Drain-rate noise on short/empty intervals** → a sweep with little traffic yields a tiny, noisy `departures`. Mitigated by the existing EWMA smoothing, clamping `departures ≥ 0`, and skipping the update when `Δt` is ~0; the cold-start floor (`initialResiduesPerSecond`) covers the no-sample window.
+- **Sweep cadence couples to estimate freshness** → throughput now updates once per sweep rather than per completion. The EWMA already smooths over multiple samples, so a sweep-cadence sample rate is sufficient for an admission heuristic; if the estimate proves laggy, shorten the sweep, not the design.
+
+## Migration Plan
+
+1. Land reconciliation + `setPending`, the aggregate drain-rate sampler, the admitted-residues counter, and the alert rule behind the existing shadow default (no behavior change for users).
+2. Verify on the next real load test that (a) `shedding_pending_residues` returns to zero after load drains, and (b) `shedding_residues_per_second` reflects realistic aggregate throughput (order-of-`WORKER_CONCURRENCY` higher than the old single-job value) so `shedding_estimated_wait_seconds` tracks observed end-to-end latency rather than pinning.
+3. Confirm the leak-detector alert is green at idle and does not false-fire under load.
+4. Only after (2)–(3) hold is a future `SHED_MODE=enforce` rollout safe — that rollout is a separate operational change.
+
+Rollback: revert the cleanup-sweep reconciliation and drain-rate sampler; restore the `recordCompletion` call in the subscriber. The event-driven counter and per-job throughput resume their prior (leaky, over-estimating) behavior, which is no worse than today's shadow-mode state.
+
+## Open Questions
+
+- Exact reconcile cadence: reuse the cleanup sweep interval as-is, or run reconciliation on a subset of sweeps if job-data reads are costly? Default: reuse as-is, revisit if metrics show cost.
+- Idle definition for the alert (`waiting|active` == 0) vs. also requiring `waiting-children` == 0 — confirm against how `bullmq_queue_jobs` labels are exported.
+- Drain-rate measurement source — **resolved: conservation (`arrivals − Δpending`)**, per Decision 5. Chosen over a completed-residues event counter because it is fully event-independent and unit-consistent by construction.
diff --git a/openspec/changes/archive/2026-06-19-fix-shedding-residue-leak/proposal.md b/openspec/changes/archive/2026-06-19-fix-shedding-residue-leak/proposal.md
@@ -0,0 +1,50 @@
+## Why
+
+The admission controller's wait estimate — `estimated_wait = pending_residues / throughput`, shed when it exceeds the per-plan SLO — is wrong on **both** of its inputs, and neither alone is safe to enable.
+
+**`pending_residues` leaks permanently.** It is incremented on every submission but decremented only via live BullMQ `QueueEvents` on the embedding queue, handled by a single elected leader — and `QueueEvents` has no replay, so any missed event (leader handoff, events outpacing the leader during a burst, a deduped embedding child, or a job evicted before `getJob`) leaks residues forever. A single 5-minute load test ramped `shedding_pending_residues` to 400,000 where it stayed dead-flat after load ended, pinning `estimated_wait` indefinitely.
+
+**`throughput` is the wrong rate.** It is an EWMA of `residues / (finishedOn − processedOn)` recorded per terminal embedding event (`state.ts:84`, `event-subscriber.ts:99`) — i.e. the processing rate of a **single** job, measured on the **embedding queue only**. The real system drains `WORKER_CONCURRENCY` jobs in parallel per worker across both queues, so this under-states true throughput by roughly the concurrency factor. Worse, after the leak fix `pending_residues` sums residues across **both** queues while `throughput` still reflects embedding-only single-job service — the ratio is dimensionally inconsistent. The estimate over-states wait and would over-shed.
+
+Today both bugs are masked because `SHED_MODE` defaults to `shadow` (requests still admitted). The moment anyone sets `SHED_MODE=enforce`, the gateway returns 503 to all free/pro users — permanently from the leak, and excessively from the throughput error. Fixing the leak alone does **not** make `enforce` safe, because the wait estimate is still wrong. The two fixes are coupled and must land together; this change does both.
+
+## What Changes
+
+**Accounting (`pending_residues`):**
+
+- Add periodic **reconciliation** of `pending_residues` from actual queue state: on the existing cleanup leader sweep, recompute the counter as the sum of residues over jobs currently in `waiting` / `active` / `waiting-children` across the prediction and embedding queues, and write the authoritative value to Redis.
+- Make reconciliation the **source of truth**; keep the existing increment/decrement only as a fast-path hint between sweeps. Event drift is now self-healing — a missed event is corrected on the next sweep instead of leaking forever.
+- Close the prediction-side blind spot as a side effect: summing real queued work is route-agnostic.
+
+**Estimation (`throughput` / `estimated_wait`):**
+
+- Replace per-job, embedding-only throughput sampling with an **aggregate drain rate** measured on the same leader sweep: residues actually leaving the queues per wall-clock second, in the **same residue units** as `pending_residues`, so the wait estimate is dimensionally consistent and concurrency-aware.
+- Derive the drain rate from **reliable signals** (a synchronous monotonic admitted-residues counter and the reconciled pending value), not from the replay-less event stream — a missed completion event no longer corrupts the estimate. The EWMA smoothing is retained for stability; only its input sample changes.
+- Stop feeding the per-job `recordCompletion` sample into the throughput EWMA; the event subscriber keeps only its fast-path `decrementPending` role.
+
+**Observability:**
+
+- Add a **leak-detector alert** to `infra/monitoring/protifer.rules.yml`: fire when `shedding_pending_residues` stays above zero while the queues are idle — the invariant that pending must trend to zero when there is no work.
+
+Out of scope (deferred to follow-up changes):
+
+- Load-testing methodology rework (honest end-to-end measurement, rate-limit exemption for pipeline tests).
+- Per-Triton-call and per-job latency metrics (the separate `prediction-latency-observability` change). That change improves _human-facing_ latency visibility; it is not required for the aggregate drain-rate estimate here, which needs only residue throughput, not per-model timing.
+- Changing the SLO thresholds, plan priorities, retry-after jitter, or the shadow/enforce default.
+
+## Capabilities
+
+### New Capabilities
+
+- `request-shedding`: Admission-control accounting and wait estimation — how `pending_residues` is maintained (event fast-path plus authoritative leader-sweep reconciliation), how `throughput` / `estimated_wait` is derived (aggregate, concurrency-aware drain rate in units consistent with pending), and the observability invariant that detects accounting drift.
+
+### Modified Capabilities
+
+<!-- None — no existing shedding spec; this change introduces the capability. -->
+
+## Impact
+
+- **Code**: `services/api-gateway/src/cleanup.ts` (add reconciliation + drain-rate sampling to the leader sweep, alongside the existing stale-children scan), `services/api-gateway/src/shedding/state.ts` (add an authoritative `setPending`, a monotonic admitted-residues counter, and a sweep-fed throughput sampler; the per-job `recordCompletion` path is removed from the throughput EWMA), `services/api-gateway/src/shedding/event-subscriber.ts` (drops the `recordCompletion` call; keeps `decrementPending` as a hint), `services/api-gateway/src/shedding/decide.ts` (unchanged formula; now fed correct, unit-consistent inputs). Reconciliation reads residues for jobs across `QUEUE_NAMES.PREDICTION` and `QUEUE_NAMES.EMBEDDING`.
+- **Redis**: `shedding:pending_residues` becomes a leader-managed derived value; a new monotonic `shedding:admitted_residues_total` feeds the drain-rate computation; `shedding:throughput_ewma` is now driven by the sweep, not per-job events.
+- **Monitoring**: new alert rule in `infra/monitoring/protifer.rules.yml` (validated by `promtool` in CI).
+- **Behavior**: no user-facing API change. Shadow-mode metrics (`shedding_pending_residues`, `shedding_residues_per_second`, `shedding_estimated_wait_seconds`) become trustworthy, and a future `SHED_MODE=enforce` rollout no longer over-sheds or carries the permanent-503 landmine.