Skip to content

tkarim45/model-serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

model-serving

A production-style model inference platform — the serving-infrastructure layer most ML portfolios skip: a versioned model registry, dynamic micro-batching, canary and shadow deployments with instant rollback, load shedding under overload, and Prometheus metrics — plus a load-test harness that measures what each mechanism actually buys over real HTTP.

uvicorn serving.api:app                # run the server
serving-loadtest                       # boot server + measure all four mechanisms
serving-loadtest --requests 2000 --concurrency 128 --json

What it does

mechanism module what it buys
Dynamic micro-batching serving/batcher.py Requests queue; a worker greedily drains what's queued (plus up to batch_window_ms for stragglers) and runs one vectorized predict — amortizing the model's fixed per-call cost.
Versioned registry serving/registry.py Models registered as (name, version) with stable/candidate aliases. Two real trained versions ship (v1 RandomForest-300, v2 HistGradientBoosting) so version comparisons are meaningful.
Canary deployment serving/router.py A configured fraction of live traffic serves from the candidate; per-version metering judges it. Promote or roll back by repointing stableno reload, instant.
Shadow deployment serving/router.py 100% of users get stable; traffic is mirrored to the candidate and its answers compared + logged, never returned. Full production-traffic signal at zero user risk.
Load shedding serving/batcher.py Bounded queue: past max_queue_depth, requests get an immediate 429 + Retry-After instead of unbounded queueing that collapses p99 for everyone.
Metrics serving/metrics.py Prometheus text at /metrics: QPS, p50/p95/p99, batch-size histogram, shed count, per-version routing, shadow disagreements. stdlib-only.

Ops surface: POST /deploy/{model} (candidate / promote / rollback / clear_candidate), GET /status, GET /healthz.

Measured results

serving-loadtest — real uvicorn over real HTTP, 800 requests at concurrency 64:

1. Dynamic batching: 8.0× throughput

batching RPS p50 p95 p99 mean batch
off 121 511 ms 589 ms 599 ms 1.0
on 966 50 ms 147 ms 163 ms 30.8

One mechanism: 121 → 966 RPS and p99 599 → 163 ms, because ~31 requests share each RandomForest call instead of paying its fixed cost 31 times.

2. Canary: configured 10% → observed 11.0% (800 requests: v1 712 / v2 88)

3. Shadow: 400 mirrored calls, 3.0% disagreement between stable (RF) and candidate (HistGB) — the number you'd use to decide whether the candidate is safe to promote.

4. Overload (queue capped at 8): 238 accepted at p99 296 ms, 562 shed with fast 429s — bounded latency for admitted work instead of everyone timing out.

The honest finding: batching hurt until the workload was realistic

The first load-test run showed batching at 0.45× — it made things worse. Two real causes, both worth knowing:

  1. The payload was too cheap. The original stable model was a LogisticRegression (~100 µs per predict); an 8 ms batch window only added latency around a call with nothing to amortize. Batching pays when the model call has real fixed cost — swapping stable to a RandomForest-300 (a realistic serving payload) is what unlocked the gain.
  2. Naive windowing waits even under load. The first batcher awaited the window per batch. The fix is Triton-style adaptive batching: greedily drain whatever is already queued (zero added latency under load), and only wait out the window when the batch is still small — the window matters at low load only.

Same code path, honest measurement first, 0.45× → 8.0×. The lesson generalizes: batching is not free — measure it against your actual payload before turning it on.

Design

client ──POST /predict/{model}──► Router (stable | canary% | +shadow mirror)
                                      │ version
                                      ▼
                          bounded asyncio queue ──full──► 429 + Retry-After
                                      │
                          Batcher worker: greedy drain (≤ max_batch)
                                      │ one vectorized predict (thread pool)
                                      ▼
                          futures resolved per request ──► response
                          METRICS: latency/batch/shed/version counters ──► /metrics

Rollback story: POST /deploy/digits {"action":"rollback","version":"v1"} repoints the stable alias — the very next request serves v1. No process restart, no model reload.

Install & test

pip install -e ".[dev]"
pytest -q          # 7 passed — registry promote/rollback, batch formation, canary split,
                   # shadow mirroring, 429+Retry-After shedding, deploy endpoints (in-process ASGI)

Stack

FastAPI + uvicorn, asyncio micro-batcher (futures + bounded queue), scikit-learn models as real payloads, httpx async load-test harness, stdlib Prometheus-format metrics. Docker + CI included.

License

MIT

About

Production-style model inference platform — versioned registry, dynamic micro-batching (measured 8x throughput), canary + shadow deployments with instant rollback, load shedding (429+Retry-After), Prometheus metrics, async load-test harness

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages