Skip to content

feat(prediction-worker): bound Triton fan-out concurrency + transient retry#32

Open
t03i wants to merge 2 commits into
mainfrom
refactor/bound-prediction-fanout
Open

feat(prediction-worker): bound Triton fan-out concurrency + transient retry#32
t03i wants to merge 2 commits into
mainfrom
refactor/bound-prediction-fanout

Conversation

@t03i

@t03i t03i commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Why

Under load the prediction worker self-DoSes Triton — All prediction models failed — Connection dropped / Bandwidth exhausted or memory limit exceeded while the GPU sits idle. Cause: unbounded gRPC concurrency on a single connection. Each worker holds one Triton client (one HTTP/2 connection); WORKER_CONCURRENCY=4 jobs each fan out all 8 adapters at once via Promise.allSettled → up to ~32 simultaneous streams (expanding to ~80 single-instance model executions). The transport saturates and resets before steady-state inference; one RST fails the whole job, and BullMQ whole-job retry re-stampedes.

What changed

  • Bounded fan-out — a process-wide FIFO async semaphore (services/prediction-worker/src/semaphore.ts) caps concurrent in-flight modelInfer calls per worker, shared across all jobs (not per-job). Constructed once in index.ts, injected through processor.ts → dispatch.ts; the permit wraps only the gRPC call and releases in finally (success, throw, timeout).
  • Configurable capTRITON_MAX_INFLIGHT_INFERS (typed configField, env-wins, conservative default 8), plus retry tunables TRITON_RETRY_MAX_ATTEMPTS (3) / TRITON_RETRY_BASE_BACKOFF_MS (100).
  • Transient transport retry (packages/triton-client/src/client.ts) — bounded jittered retry firing only on UNAVAILABLE and transport-signature INTERNAL (bandwidth/parse/connection), never on INVALID_ARGUMENT/NOT_FOUND/DEADLINE_EXCEEDED. The retry loop lives inside the client call, so it stays within the caller's held permit and never widens concurrency.
  • Channel keepalive — conservative keepalive_time_ms: 30000 / keepalive_timeout_ms: 10000 / keepalive_permit_without_calls: 0 (pings only during active calls, avoiding Triton's ENHANCE_YOUR_CALM enforcement).

No API/schema change; no behavioral change at low load.

Tests

  • Semaphore unit tests (FIFO, no-leak, idempotent release).
  • dispatch.test.ts — concurrency never exceeds the bound across simultaneous dispatchAlls; thrown modelInfer leaks no permit; excess calls wait.
  • client.test.ts — retries on transient classes up to the cap; no retry on deterministic/deadline classes; success-after-retry; exhausted retries surface the classified error.
  • Config defaults + override parsing.
  • Gates green: typecheck, lint, format, unit tests, build.
  • E2E: backend-e2e suite (17 tests incl. full prediction pipeline) passes against docker-compose.test.yml with the worker built from source — jobs process cleanly, no transport storm, 0 worker restarts.

Follow-up (operational, not in this PR)

  • Load verification on real GPU Triton (no transport storm, GPU busy, retries drop).
  • Tune TRITON_MAX_INFLIGHT_INFERS from that run and record value/rationale in the deploy runbook.

🤖 Generated with Claude Code

t03i and others added 2 commits June 19, 2026 11:13
… retry

Cap concurrent in-flight modelInfer streams per worker via a process-wide
shared semaphore (TRITON_MAX_INFLIGHT_INFERS), so WORKER_CONCURRENCY jobs
draw from one permit pool instead of bursting ~32 streams onto a single
HTTP/2 connection. Add bounded jittered retry in the Triton client on
transient transport errors (UNAVAILABLE, transport-signature INTERNAL) held
inside the caller's permit, plus conservative channel keepalive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Backend E2E suite (17 tests incl. full prediction pipeline) passes against
the docker-compose.test.yml stack with the semaphore-bounded prediction
worker built from source: jobs process cleanly, no transport storm, 0
worker restarts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant