model-serving

A production-style model inference platform — the serving-infrastructure layer most ML portfolios skip: a versioned model registry, dynamic micro-batching, canary and shadow deployments with instant rollback, load shedding under overload, and Prometheus metrics — plus a load-test harness that measures what each mechanism actually buys over real HTTP.

uvicorn serving.api:app                # run the server
serving-loadtest                       # boot server + measure all four mechanisms
serving-loadtest --requests 2000 --concurrency 128 --json

What it does

mechanism	module	what it buys
Dynamic micro-batching	`serving/batcher.py`	Requests queue; a worker greedily drains what's queued (plus up to `batch_window_ms` for stragglers) and runs one vectorized predict — amortizing the model's fixed per-call cost.
Versioned registry	`serving/registry.py`	Models registered as `(name, version)` with `stable`/`candidate` aliases. Two real trained versions ship (v1 RandomForest-300, v2 HistGradientBoosting) so version comparisons are meaningful.
Canary deployment	`serving/router.py`	A configured fraction of live traffic serves from the candidate; per-version metering judges it. Promote or roll back by repointing `stable` — no reload, instant.
Shadow deployment	`serving/router.py`	100% of users get stable; traffic is mirrored to the candidate and its answers compared + logged, never returned. Full production-traffic signal at zero user risk.
Load shedding	`serving/batcher.py`	Bounded queue: past `max_queue_depth`, requests get an immediate 429 + Retry-After instead of unbounded queueing that collapses p99 for everyone.
Metrics	`serving/metrics.py`	Prometheus text at `/metrics`: QPS, p50/p95/p99, batch-size histogram, shed count, per-version routing, shadow disagreements. stdlib-only.

Ops surface: POST /deploy/{model} (candidate / promote / rollback / clear_candidate), GET /status, GET /healthz.

Measured results

serving-loadtest — real uvicorn over real HTTP, 800 requests at concurrency 64:

1. Dynamic batching: 8.0× throughput

batching	RPS	p50	p95	p99	mean batch
off	121	511 ms	589 ms	599 ms	1.0
on	966	50 ms	147 ms	163 ms	30.8

One mechanism: 121 → 966 RPS and p99 599 → 163 ms, because ~31 requests share each RandomForest call instead of paying its fixed cost 31 times.

2. Canary: configured 10% → observed 11.0% (800 requests: v1 712 / v2 88)

3. Shadow: 400 mirrored calls, 3.0% disagreement between stable (RF) and candidate (HistGB) — the number you'd use to decide whether the candidate is safe to promote.

4. Overload (queue capped at 8): 238 accepted at p99 296 ms, 562 shed with fast 429s — bounded latency for admitted work instead of everyone timing out.

The honest finding: batching hurt until the workload was realistic

The first load-test run showed batching at 0.45× — it made things worse. Two real causes, both worth knowing:

The payload was too cheap. The original stable model was a LogisticRegression (~100 µs per predict); an 8 ms batch window only added latency around a call with nothing to amortize. Batching pays when the model call has real fixed cost — swapping stable to a RandomForest-300 (a realistic serving payload) is what unlocked the gain.
Naive windowing waits even under load. The first batcher awaited the window per batch. The fix is Triton-style adaptive batching: greedily drain whatever is already queued (zero added latency under load), and only wait out the window when the batch is still small — the window matters at low load only.

Same code path, honest measurement first, 0.45× → 8.0×. The lesson generalizes: batching is not free — measure it against your actual payload before turning it on.

Design

client ──POST /predict/{model}──► Router (stable | canary% | +shadow mirror)
                                      │ version
                                      ▼
                          bounded asyncio queue ──full──► 429 + Retry-After
                                      │
                          Batcher worker: greedy drain (≤ max_batch)
                                      │ one vectorized predict (thread pool)
                                      ▼
                          futures resolved per request ──► response
                          METRICS: latency/batch/shed/version counters ──► /metrics

Rollback story: POST /deploy/digits {"action":"rollback","version":"v1"} repoints the stable alias — the very next request serves v1. No process restart, no model reload.

Install & test

pip install -e ".[dev]"
pytest -q          # 7 passed — registry promote/rollback, batch formation, canary split,
                   # shadow mirroring, 429+Retry-After shedding, deploy endpoints (in-process ASGI)

Stack

FastAPI + uvicorn, asyncio micro-batcher (futures + bounded queue), scikit-learn models as real payloads, httpx async load-test harness, stdlib Prometheus-format metrics. Docker + CI included.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
loadtest		loadtest
serving		serving
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

model-serving

What it does

Measured results

1. Dynamic batching: 8.0× throughput

2. Canary: configured 10% → observed 11.0% (800 requests: v1 712 / v2 88)

3. Shadow: 400 mirrored calls, 3.0% disagreement between stable (RF) and candidate (HistGB) — the number you'd use to decide whether the candidate is safe to promote.

4. Overload (queue capped at 8): 238 accepted at p99 296 ms, 562 shed with fast 429s — bounded latency for admitted work instead of everyone timing out.

The honest finding: batching hurt until the workload was realistic

Design

Install & test

Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

model-serving

What it does

Measured results

1. Dynamic batching: 8.0× throughput

2. Canary: configured 10% → observed 11.0% (800 requests: v1 712 / v2 88)

3. Shadow: 400 mirrored calls, 3.0% disagreement between stable (RF) and candidate (HistGB) — the number you'd use to decide whether the candidate is safe to promote.

4. Overload (queue capped at 8): 238 accepted at p99 296 ms, 562 shed with fast 429s — bounded latency for admitted work instead of everyone timing out.

The honest finding: batching hurt until the workload was realistic

Design

Install & test

Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages