Hardware-specific LLM serving configurations. Model cards for serving.
Searching for optimal vLLM parameters returns Reddit threads, not config files. Every time someone gets a new GPU and wants to serve a model, they spend hours reading scattered GitHub issues, Discord messages, and blog posts to find the right combination of tensor parallelism, quantization, speculative decoding, and memory utilization settings. Then they run benchmarks, tweak, and repeat. The knowledge lives in their head and their shell history.
No standard exists that combines model identity, hardware profile, framework parameters, and verified benchmarks in a single portable file. HuggingFace model cards tell you what a model is -- its architecture, training data, evaluation scores. They never tell you how to serve it on your specific hardware. Docker Compose files capture launch commands but not benchmark results. Shell scripts work for one person but are not shareable knowledge.
A serving card is a YAML file that captures everything needed to reproduce a serving configuration and the proof that it works. It is the serving counterpart to a model card: one file per model-hardware-framework combination, with required benchmarks so that configs without performance data are explicitly not serving cards. The format is human-readable, machine-parseable, and designed for community sharing.
- Hardware-specific -- same model, different configs for RTX 4090 vs A100 vs GB10
- Benchmarks required -- configs without benchmarks are guesses, not serving cards
- Framework-aware -- vLLM, TGI, SGLang, llama.cpp params in one standard
- One-command apply --
servingcard apply qwen3-coder/gb10-fp8-eagle3-spec3 - Benchmark-first --
servingcard benchmarkruns PawBench and produces a card in one step - Autoresearch-compatible -- auto-tuning tools export directly to serving cards
- Community registry -- share and discover optimized configs
- Transform documentation -- model output quirks (think tags, float coercion) captured alongside the config
- Readiness checks -- warmup sequences and health endpoints in the card itself
Serving cards are not the right tool for every situation. Be honest about scope:
- Cloud-managed inference (Fireworks, Together, Bedrock) -- these providers optimize serving for you. A serving card adds nothing when you are calling an API endpoint.
- One-off experiments -- if you are running a quick test and will never use the config again, writing a serving card is overkill. Just use your shell history.
- Training configurations -- serving cards are inference-only. Training has different concerns (learning rates, batch sizes, gradient accumulation) that deserve their own format.
- Model evaluation -- serving cards tell you how fast a model runs, not how good its outputs are. Use eval harnesses (lm-eval, HELM) for quality assessment.
- Multi-model pipelines -- a serving card describes one model on one hardware configuration. If you need to orchestrate multiple models, use a higher-level tool and reference individual serving cards.
pip install -e packages/python # or: pip install servingcardservingcard benchmark \
--model qwen3-coder \
--hardware nvidia-gb10 \
--endpoint http://localhost:8000
# Produces: qwen3-coder-nvidia-gb10.yaml
# Uses PawBench if installed, otherwise prompts for manual entry# From the registry (shorthand)
servingcard apply qwen3-coder/gb10-fp8-eagle3-spec3
# From a local file
servingcard apply ./my-config.yaml
# From a URL
servingcard apply --url https://raw.githubusercontent.com/.../config.yaml
# Outputs the vllm serve command with optimized params — copy and runservingcard validate registry/qwen3-coder/gb10-fp8-eagle3-spec3.yaml
servingcard info registry/qwen3-coder/gb10-fp8-eagle3-spec3.yaml
servingcard search qwen3-coder
servingcard search --hardware nvidia-gb10A serving card is a YAML file with structured sections. Here is the production Eagle3 config with annotations explaining each part:
# ---- Identity ----
servingcard: "1.0" # Spec version (required)
model: qwen3-coder # Model identifier (required)
variant: fp8-eagle3-spec3 # Config variant name (required)
hardware: nvidia-gb10 # Target hardware (required)
framework: vllm # Inference framework (required)
framework_version: ">=0.8.0" # Framework version constraint
# ---- Accountability ----
author: zenprocess # Who created and verified this (required)
created: "2026-03-26" # When (required)
method: autoresearch # How it was tuned (required)
method_detail: "378 iterations, Karpathy pattern"
# ---- Serving Parameters ----
# Framework-specific, passed directly to the engine.
# Keys correspond to vLLM CLI flags / config options.
serving:
tensor_parallel_size: 1
gpu_memory_utilization: 0.90
max_model_len: 131072
max_num_seqs: 8
quantization: fp8
speculative_decoding:
method: eagle3
draft_model: aurora-spec-qwen3-coder
num_speculative_tokens: 3
enforce_eager: false
enable_prefix_caching: true
# ---- Sampling Constraints ----
# What sampling parameters this config supports and their valid ranges.
sampling:
temperature: {min: 0, max: 2, default: 0.2}
top_p: {min: 0, max: 1, default: 1}
logit_bias: unsupported # Not supported under speculative decoding
min_p: unsupported
# ---- Capacity ----
capacity:
context_limit: 131072 # Tested max context length
max_concurrent: 8 # Tested max concurrent requests
parallel_tool_calls: {max_reliable: 3}
# ---- Benchmarks (required) ----
# At minimum, single_stream.tok_s must be present.
benchmark:
single_stream:
tok_s: 69.0 # Output tokens/sec, single request
ttft_ms: 1541 # Time to first token
p99_latency_ms: 2200
parallel:
peak_tok_s: 469 # Aggregate tok/s at peak concurrency
concurrency: 8
methodology:
tool: benchmark_serving
prompt_distribution: coding-tasks
num_runs: 10
confidence_interval: 0.95
# ---- Transforms ----
# Known model output quirks and how to fix them.
transforms:
- type: regex_strip
pattern: "<think>.*?</think>"
description: "Strip Qwen3-coder reasoning tags from output"
- type: coerce_float_to_int
scope: tool_call_arguments
description: "Fix float tool args (42.0 -> 42)"
# ---- Readiness ----
readiness:
warmup_requests: 3
warmup_prompt: "Say ok."
warmup_max_tokens: 5
health_endpoint: /health
# ---- Prerequisites ----
prerequisites:
models:
- path: ~/models/aurora-spec-qwen3-coder
description: "Eagle3 draft head for speculative decoding"
gpu_memory_gb: 110
# ---- Notes ----
notes:
- "CUDA graphs enabled. First 2-3 requests after restart are 3-5x slower."
- "Draft head uses 32K/151K vocab. Non-Latin tokens fall back to baseline speed."Curated, validated configurations with reproducible benchmarks.
| Model | Hardware | Variant | tok/s | TTFT | Context | Method |
|---|---|---|---|---|---|---|
| qwen3-coder | NVIDIA GB10 | FP8 + Eagle3 (spec=3) | 69 | 1541ms | 131K | autoresearch |
| qwen3-coder | NVIDIA GB10 | FP8 baseline | 42 | 780ms | 131K | autoresearch |
| qwen3-coder | NVIDIA GB10 | NVFP4 | 42 | 650ms | 262K | autoresearch |
Registry path convention: registry/{model}/{hardware}-{quant}-{variant}.yaml
Your config here -- contribute
Any tool that manages LLM inference can use serving cards to make runtime decisions:
from servingcard.schema import ServingCard
card = ServingCard.from_yaml("registry/qwen3-coder/gb10-fp8-eagle3-spec3.yaml")
# Capacity planning: don't exceed tested limits
if card.capacity:
max_context = card.capacity.context_limit # 131072
max_concurrent = card.capacity.max_concurrent # 8
# Benchmark data: estimate completion time
if card.benchmark and card.benchmark.single_stream:
estimated_time = output_tokens / card.benchmark.single_stream.tok_s
# Transforms: apply known model quirks
for transform in (card.transforms or []):
if transform.type == "regex_strip":
output = re.sub(transform.pattern, "", output, flags=re.DOTALL)This replaces hardcoded constants scattered across config files with a single source of truth that is versioned, validated, and benchmarked.
All benchmarks in the registry follow these principles:
-
Tool: benchmarks use
benchmark_servingfrom the vLLM project with thesonnetdataset, unless otherwise noted in the card'smethodologysection. -
Warmup: 3 warmup requests are sent before measurement begins. CUDA graph compilation makes the first few requests 3-5x slower -- these are excluded.
-
Single-stream baseline: every card includes at minimum
single_stream.tok_smeasured at concurrency=1. This is the universal comparison point. -
Parallel throughput: cards that include
parallelbenchmarks specify the concurrency level. Peak throughput numbers without concurrency context are meaningless. -
Reproducibility: the
methodologysection documents the benchmarking tool, prompt distribution, number of runs, and confidence interval. Cards without methodology are still valid but carry less weight. -
Environment pinning: driver version, CUDA version, and framework version are captured so that benchmark numbers can be reproduced on matching hardware.
- tok/s is output tokens per second. Higher is better for throughput.
- TTFT (time to first token) matters for interactive use. Eagle3 speculative decoding trades higher TTFT (1541ms) for higher throughput (69 tok/s vs 42 tok/s baseline).
- p99 latency captures worst-case behavior under normal load. Important for SLA-bound deployments.
- Specification -- full YAML schema with JSON Schema appendix
- Getting Started -- quick intro to creating your first serving card
- Format Overview -- detailed guide to every section
- Benchmark Guide -- how to run benchmarks and report results
- Contributing -- how to submit a serving card
- Examples -- integration examples
Submit a serving card in 4 steps:
-
Benchmark your model on your hardware using a standard workload. We recommend
benchmark_servingfrom vLLM with thesonnetdataset at concurrency=1 for the single-stream baseline. -
Write the YAML following the specification. Copy an existing card from
registry/as a starting point. Include at minimum the required fields andbenchmark.single_stream.tok_s. -
Validate your card:
# Clone the repo (PyPI package coming soon) git clone https://github.com/zenprocess/servingcard cd servingcard/packages/python && pip install -e . servingcard validate your-config.yaml
-
Submit a PR placing the file at
registry/{model}/{hardware}-{quant}-{variant}.yaml. CI validates automatically. Benchmark claims are spot-checked by maintainers with matching hardware when possible.
See CONTRIBUTING.md for the full guide, including benchmark methodology requirements, PR template, and review criteria.
- Reproducible -- someone with the same GPU can follow it exactly
- Benchmarked -- real numbers from real runs, not estimates
- Contextualized -- notes explain tradeoffs (e.g., "2x TTFT for 64% more throughput")
- Versioned -- pin the engine version and driver so results can be reproduced
- Honest -- document limitations and failure modes, not just best-case numbers
Serving cards are written and read by humans first. YAML supports comments, multi-line strings, and readable nested structures. JSON Schema is provided in the specification for programmatic validation, but YAML is the source format because people need to edit these files by hand.
Optimal serving parameters are hardware-dependent. gpu_memory_utilization: 0.95
may work on a 128 GB GB10 but OOM on a 24 GB RTX 4090.
tensor_parallel_size: 4 is meaningless on a single-GPU system. Benchmark
numbers without hardware context are not actionable.
The gap between "this config starts without errors" and "this config performs
well" is enormous. Speculative decoding can halve throughput if misconfigured.
Quantization can degrade quality silently. Without benchmarks, a serving card
is just a guess with formatting. The benchmark section is the difference
between "I think this works" and "I measured this working."
If your serving params are in one file and your sampling constraints in another, they will drift apart. One file per model-hardware-framework combination means one atomic unit of truth that can be versioned, diffed, reviewed, and shared.
Inference frameworks have genuinely different parameter surfaces. Attempting
to standardize tensor_parallel_size across vLLM, TGI, SGLang, llama.cpp,
and TensorRT-LLM would either miss framework-specific features or create a
leaky abstraction. Instead, the serving section is framework-specific by
design, while sampling, capacity, and benchmark use universal concepts.
Many models have known output quirks: Qwen3-coder emits <think> tags, some
models return floats where integers are expected in tool calls, others produce
malformed JSON. Documenting quirks alongside the serving config means consumers
apply the right fixes without rediscovering them through trial and error.
Can I use a serving card without the Python package?
Yes. A serving card is just a YAML file. You can parse it with any YAML library in any language. The Python package provides validation and CLI convenience, but the format is language-agnostic.
What if my hardware is not in the registry?
Create a new serving card for your hardware. That is the whole point -- the registry grows as people contribute configs for new hardware. Start by copying an existing card and adjusting the parameters and benchmarks.
Can a serving card describe multiple GPUs (multi-node)?
Yes, but treat the multi-GPU setup as a single hardware configuration. Use a
hardware slug like nvidia-h100-8x and set tensor_parallel_size: 8 in the
serving section. The card still describes one serving configuration.
How do I compare two serving cards?
Look at benchmark.single_stream.tok_s for throughput comparison and
benchmark.single_stream.ttft_ms for latency comparison. The registry
table in this README shows these side by side. For deeper comparison,
servingcard info shows a formatted summary.
What about quality benchmarks (MMLU, HumanEval, etc.)?
Serving cards focus on serving performance, not model quality. Quality benchmarks belong in model cards. A serving card tells you how fast a model runs on specific hardware, not how good its outputs are.
| Tool | Status | Description |
|---|---|---|
| vLLM | servingcard apply |
Generate vLLM CLI from a serving card |
| Multi-agent dispatchers | Compatible | Any dispatcher can read serving cards for routing and capacity |
| auto-tuning-vllm | Planned | Export tuning results as serving cards |
| TGI | Planned | servingcard apply --engine tgi param mapping |
| SGLang | Planned | SGLang param mapping |
| Your tool here | -- | PRs welcome |
The long-term vision: auto-tuning tools like
autoresearch run parameter sweeps
and export the best configurations as serving cards. The registry grows
organically as people share what works on their hardware. Inference frameworks
add native --serving-card flags to consume cards directly. The community
builds a searchable index of verified, benchmarked configurations.
Today: three configs for Qwen3-Coder on NVIDIA GB10. Tomorrow: every model, every GPU, every framework -- with real benchmarks attached to every claim.
servingcard/
registry/ # Curated configs by model
qwen3-coder/ # One directory per model
spec/ # Schema specification
packages/python/ # pip-installable CLI + library
examples/ # Integration examples
docs/ # Guides and reference
.github/ # CI workflows and issue templates
Created by zenprocess.
Inspired by HuggingFace Model Cards, TOON format, and Karpathy's autoresearch.