A production-style model inference platform — the serving-infrastructure layer most ML portfolios skip: a versioned model registry, dynamic micro-batching, canary and shadow deployments with instant rollback, load shedding under overload, and Prometheus metrics — plus a load-test harness that measures what each mechanism actually buys over real HTTP.
uvicorn serving.api:app # run the server
serving-loadtest # boot server + measure all four mechanisms
serving-loadtest --requests 2000 --concurrency 128 --json| mechanism | module | what it buys |
|---|---|---|
| Dynamic micro-batching | serving/batcher.py |
Requests queue; a worker greedily drains what's queued (plus up to batch_window_ms for stragglers) and runs one vectorized predict — amortizing the model's fixed per-call cost. |
| Versioned registry | serving/registry.py |
Models registered as (name, version) with stable/candidate aliases. Two real trained versions ship (v1 RandomForest-300, v2 HistGradientBoosting) so version comparisons are meaningful. |
| Canary deployment | serving/router.py |
A configured fraction of live traffic serves from the candidate; per-version metering judges it. Promote or roll back by repointing stable — no reload, instant. |
| Shadow deployment | serving/router.py |
100% of users get stable; traffic is mirrored to the candidate and its answers compared + logged, never returned. Full production-traffic signal at zero user risk. |
| Load shedding | serving/batcher.py |
Bounded queue: past max_queue_depth, requests get an immediate 429 + Retry-After instead of unbounded queueing that collapses p99 for everyone. |
| Metrics | serving/metrics.py |
Prometheus text at /metrics: QPS, p50/p95/p99, batch-size histogram, shed count, per-version routing, shadow disagreements. stdlib-only. |
Ops surface: POST /deploy/{model} (candidate / promote / rollback / clear_candidate),
GET /status, GET /healthz.
serving-loadtest — real uvicorn over real HTTP, 800 requests at concurrency 64:
| batching | RPS | p50 | p95 | p99 | mean batch |
|---|---|---|---|---|---|
| off | 121 | 511 ms | 589 ms | 599 ms | 1.0 |
| on | 966 | 50 ms | 147 ms | 163 ms | 30.8 |
One mechanism: 121 → 966 RPS and p99 599 → 163 ms, because ~31 requests share each RandomForest call instead of paying its fixed cost 31 times.
3. Shadow: 400 mirrored calls, 3.0% disagreement between stable (RF) and candidate (HistGB) — the number you'd use to decide whether the candidate is safe to promote.
4. Overload (queue capped at 8): 238 accepted at p99 296 ms, 562 shed with fast 429s — bounded latency for admitted work instead of everyone timing out.
The first load-test run showed batching at 0.45× — it made things worse. Two real causes, both worth knowing:
- The payload was too cheap. The original stable model was a LogisticRegression (~100 µs per predict); an 8 ms batch window only added latency around a call with nothing to amortize. Batching pays when the model call has real fixed cost — swapping stable to a RandomForest-300 (a realistic serving payload) is what unlocked the gain.
- Naive windowing waits even under load. The first batcher awaited the window per batch. The fix is Triton-style adaptive batching: greedily drain whatever is already queued (zero added latency under load), and only wait out the window when the batch is still small — the window matters at low load only.
Same code path, honest measurement first, 0.45× → 8.0×. The lesson generalizes: batching is not free — measure it against your actual payload before turning it on.
client ──POST /predict/{model}──► Router (stable | canary% | +shadow mirror)
│ version
▼
bounded asyncio queue ──full──► 429 + Retry-After
│
Batcher worker: greedy drain (≤ max_batch)
│ one vectorized predict (thread pool)
▼
futures resolved per request ──► response
METRICS: latency/batch/shed/version counters ──► /metrics
Rollback story: POST /deploy/digits {"action":"rollback","version":"v1"} repoints the stable
alias — the very next request serves v1. No process restart, no model reload.
pip install -e ".[dev]"
pytest -q # 7 passed — registry promote/rollback, batch formation, canary split,
# shadow mirroring, 429+Retry-After shedding, deploy endpoints (in-process ASGI)FastAPI + uvicorn, asyncio micro-batcher (futures + bounded queue), scikit-learn models as real payloads, httpx async load-test harness, stdlib Prometheus-format metrics. Docker + CI included.
MIT