feat(prediction-worker): bound Triton fan-out concurrency + transient retry#32
Open
t03i wants to merge 2 commits into
Open
feat(prediction-worker): bound Triton fan-out concurrency + transient retry#32t03i wants to merge 2 commits into
t03i wants to merge 2 commits into
Conversation
… retry Cap concurrent in-flight modelInfer streams per worker via a process-wide shared semaphore (TRITON_MAX_INFLIGHT_INFERS), so WORKER_CONCURRENCY jobs draw from one permit pool instead of bursting ~32 streams onto a single HTTP/2 connection. Add bounded jittered retry in the Triton client on transient transport errors (UNAVAILABLE, transport-signature INTERNAL) held inside the caller's permit, plus conservative channel keepalive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Backend E2E suite (17 tests incl. full prediction pipeline) passes against the docker-compose.test.yml stack with the semaphore-bounded prediction worker built from source: jobs process cleanly, no transport storm, 0 worker restarts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Under load the prediction worker self-DoSes Triton —
All prediction models failed — Connection dropped / Bandwidth exhausted or memory limit exceededwhile the GPU sits idle. Cause: unbounded gRPC concurrency on a single connection. Each worker holds one Triton client (one HTTP/2 connection);WORKER_CONCURRENCY=4jobs each fan out all 8 adapters at once viaPromise.allSettled→ up to ~32 simultaneous streams (expanding to ~80 single-instance model executions). The transport saturates and resets before steady-state inference; one RST fails the whole job, and BullMQ whole-job retry re-stampedes.What changed
services/prediction-worker/src/semaphore.ts) caps concurrent in-flightmodelInfercalls per worker, shared across all jobs (not per-job). Constructed once inindex.ts, injected throughprocessor.ts → dispatch.ts; the permit wraps only the gRPC call and releases infinally(success, throw, timeout).TRITON_MAX_INFLIGHT_INFERS(typedconfigField, env-wins, conservative default8), plus retry tunablesTRITON_RETRY_MAX_ATTEMPTS(3) /TRITON_RETRY_BASE_BACKOFF_MS(100).packages/triton-client/src/client.ts) — bounded jittered retry firing only onUNAVAILABLEand transport-signatureINTERNAL(bandwidth/parse/connection), never onINVALID_ARGUMENT/NOT_FOUND/DEADLINE_EXCEEDED. The retry loop lives inside the client call, so it stays within the caller's held permit and never widens concurrency.keepalive_time_ms: 30000/keepalive_timeout_ms: 10000/keepalive_permit_without_calls: 0(pings only during active calls, avoiding Triton'sENHANCE_YOUR_CALMenforcement).No API/schema change; no behavioral change at low load.
Tests
dispatch.test.ts— concurrency never exceeds the bound across simultaneousdispatchAlls; thrownmodelInferleaks no permit; excess calls wait.client.test.ts— retries on transient classes up to the cap; no retry on deterministic/deadline classes; success-after-retry; exhausted retries surface the classified error.docker-compose.test.ymlwith the worker built from source — jobs process cleanly, no transport storm, 0 worker restarts.Follow-up (operational, not in this PR)
TRITON_MAX_INFLIGHT_INFERSfrom that run and record value/rationale in the deploy runbook.🤖 Generated with Claude Code