feat(prediction-worker): bound Triton fan-out concurrency + transient retry by t03i · Pull Request #32 · t03i/protifer

t03i · 2026-06-19T09:21:31Z

Why

Under load the prediction worker self-DoSes Triton — All prediction models failed — Connection dropped / Bandwidth exhausted or memory limit exceeded while the GPU sits idle. Cause: unbounded gRPC concurrency on a single connection. Each worker holds one Triton client (one HTTP/2 connection); WORKER_CONCURRENCY=4 jobs each fan out all 8 adapters at once via Promise.allSettled → up to ~32 simultaneous streams (expanding to ~80 single-instance model executions). The transport saturates and resets before steady-state inference; one RST fails the whole job, and BullMQ whole-job retry re-stampedes.

What changed

Bounded fan-out — a process-wide FIFO async semaphore (services/prediction-worker/src/semaphore.ts) caps concurrent in-flight modelInfer calls per worker, shared across all jobs (not per-job). Constructed once in index.ts, injected through processor.ts → dispatch.ts; the permit wraps only the gRPC call and releases in finally (success, throw, timeout).
Configurable cap — TRITON_MAX_INFLIGHT_INFERS (typed configField, env-wins, conservative default 8), plus retry tunables TRITON_RETRY_MAX_ATTEMPTS (3) / TRITON_RETRY_BASE_BACKOFF_MS (100).
Transient transport retry (packages/triton-client/src/client.ts) — bounded jittered retry firing only on UNAVAILABLE and transport-signature INTERNAL (bandwidth/parse/connection), never on INVALID_ARGUMENT/NOT_FOUND/DEADLINE_EXCEEDED. The retry loop lives inside the client call, so it stays within the caller's held permit and never widens concurrency.
Channel keepalive — conservative keepalive_time_ms: 30000 / keepalive_timeout_ms: 10000 / keepalive_permit_without_calls: 0 (pings only during active calls, avoiding Triton's ENHANCE_YOUR_CALM enforcement).

No API/schema change; no behavioral change at low load.

Tests

Semaphore unit tests (FIFO, no-leak, idempotent release).
dispatch.test.ts — concurrency never exceeds the bound across simultaneous dispatchAlls; thrown modelInfer leaks no permit; excess calls wait.
client.test.ts — retries on transient classes up to the cap; no retry on deterministic/deadline classes; success-after-retry; exhausted retries surface the classified error.
Config defaults + override parsing.
Gates green: typecheck, lint, format, unit tests, build.
E2E: backend-e2e suite (17 tests incl. full prediction pipeline) passes against docker-compose.test.yml with the worker built from source — jobs process cleanly, no transport storm, 0 worker restarts.

Follow-up (operational, not in this PR)

Load verification on real GPU Triton (no transport storm, GPU busy, retries drop).
Tune TRITON_MAX_INFLIGHT_INFERS from that run and record value/rationale in the deploy runbook.

🤖 Generated with Claude Code

… retry Cap concurrent in-flight modelInfer streams per worker via a process-wide shared semaphore (TRITON_MAX_INFLIGHT_INFERS), so WORKER_CONCURRENCY jobs draw from one permit pool instead of bursting ~32 streams onto a single HTTP/2 connection. Add bounded jittered retry in the Triton client on transient transport errors (UNAVAILABLE, transport-signature INTERNAL) held inside the caller's permit, plus conservative channel keepalive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Backend E2E suite (17 tests incl. full prediction pipeline) passes against the docker-compose.test.yml stack with the semaphore-bounded prediction worker built from source: jobs process cleanly, no transport storm, 0 worker restarts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

t03i and others added 2 commits June 19, 2026 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(prediction-worker): bound Triton fan-out concurrency + transient retry#32

feat(prediction-worker): bound Triton fan-out concurrency + transient retry#32
t03i wants to merge 2 commits into
mainfrom
refactor/bound-prediction-fanout

t03i commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

t03i commented Jun 19, 2026

Why

What changed

Tests

Follow-up (operational, not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant