Skip to content

Commit 9326703

Browse files
Testclaude
andcommitted
feat: CLI benchmark + apply commands — the two things that matter
benchmark: runs PawBench (or manual fallback), produces a servingcard YAML. Pluggable backend system — PawBenchBackend tries CLI then import, falls back to ManualBackend with interactive prompts. apply: pulls config from local path, registry shorthand (model/variant), or URL. Generates framework-specific launch command (vLLM, TGI). Registry fetches from GitHub raw URLs. PawBenchResults added to schema — quality_score, cacp_compliance, useful_token_ratio, tokens_per_turn, adaptability_score. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7a469b3 commit 9326703

File tree

7 files changed

+548
-26
lines changed

7 files changed

+548
-26
lines changed

README.md

Lines changed: 37 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,8 @@ for community sharing.
3535
- **Hardware-specific** -- same model, different configs for RTX 4090 vs A100 vs GB10
3636
- **Benchmarks required** -- configs without benchmarks are guesses, not serving cards
3737
- **Framework-aware** -- vLLM, TGI, SGLang, llama.cpp params in one standard
38-
- **One-command apply** -- `servingcard launch qwen3-coder/gb10-fp8-eagle3-spec3.yaml`
38+
- **One-command apply** -- `servingcard apply qwen3-coder/gb10-fp8-eagle3-spec3`
39+
- **Benchmark-first** -- `servingcard benchmark` runs PawBench and produces a card in one step
3940
- **Autoresearch-compatible** -- auto-tuning tools export directly to serving cards
4041
- **Community registry** -- share and discover optimized configs
4142
- **Transform documentation** -- model output quirks (think tags, float coercion) captured alongside the config
@@ -54,27 +55,43 @@ Serving cards are not the right tool for every situation. Be honest about scope:
5455
## Quick Start
5556

5657
```bash
57-
# Clone the repo (PyPI package coming soon)
58-
git clone https://github.com/zenprocess/servingcard
59-
cd servingcard/packages/python
60-
pip install -e .
58+
pip install -e packages/python # or: pip install servingcard
59+
```
6160

62-
# Validate a config
63-
servingcard validate registry/qwen3-coder/gb10-fp8-eagle3-spec3.yaml
61+
### Benchmark your model
6462

65-
# Show summary info
66-
servingcard info registry/qwen3-coder/gb10-fp8-eagle3-spec3.yaml
63+
```bash
64+
servingcard benchmark \
65+
--model qwen3-coder \
66+
--hardware nvidia-gb10 \
67+
--endpoint http://localhost:8000
68+
69+
# Produces: qwen3-coder-nvidia-gb10.yaml
70+
# Uses PawBench if installed, otherwise prompts for manual entry
71+
```
72+
73+
### Apply a community config
74+
75+
```bash
76+
# From the registry (shorthand)
77+
servingcard apply qwen3-coder/gb10-fp8-eagle3-spec3
78+
79+
# From a local file
80+
servingcard apply ./my-config.yaml
81+
82+
# From a URL
83+
servingcard apply --url https://raw.githubusercontent.com/.../config.yaml
84+
85+
# Outputs the vllm serve command with optimized params — copy and run
86+
```
6787

68-
# Search the registry
69-
servingcard search --model qwen3-coder --hardware nvidia-gb10
88+
### Validate and inspect
7089

71-
# Launch vLLM from a serving card
72-
servingcard launch registry/qwen3-coder/gb10-fp8-eagle3-spec3.yaml
73-
# Expands to: vllm serve Qwen/Qwen3-Coder-480B-A35B-FP8 \
74-
# --tensor-parallel-size 1 --max-model-len 131072 \
75-
# --gpu-memory-utilization 0.90 --quantization fp8 \
76-
# --speculative-model aurora-spec-qwen3-coder \
77-
# --num-speculative-tokens 3
90+
```bash
91+
servingcard validate registry/qwen3-coder/gb10-fp8-eagle3-spec3.yaml
92+
servingcard info registry/qwen3-coder/gb10-fp8-eagle3-spec3.yaml
93+
servingcard search qwen3-coder
94+
servingcard search --hardware nvidia-gb10
7895
```
7996

8097
## Format Overview
@@ -366,10 +383,10 @@ runs on specific hardware, not how good its outputs are.
366383

367384
| Tool | Status | Description |
368385
|------|--------|-------------|
369-
| vLLM | `servingcard launch` | Generate vLLM CLI from a serving card |
386+
| vLLM | `servingcard apply` | Generate vLLM CLI from a serving card |
370387
| Multi-agent dispatchers | Compatible | Any dispatcher can read serving cards for routing and capacity |
371388
| [auto-tuning-vllm](https://github.com/zenprocess/auto-tuning-vllm) | Planned | Export tuning results as serving cards |
372-
| TGI | Planned | `servingcard launch --engine tgi` param mapping |
389+
| TGI | Planned | `servingcard apply --engine tgi` param mapping |
373390
| SGLang | Planned | SGLang param mapping |
374391
| Your tool here | -- | PRs welcome |
375392

packages/python/pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ dependencies = [
4646
]
4747

4848
[project.optional-dependencies]
49+
benchmark = ["pawbench"]
4950
dev = [
5051
"pytest>=7.0",
5152
"pytest-cov",

packages/python/servingcard/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
BenchmarkSection,
88
CapacitySection,
99
HardwareDetails,
10+
PawBenchResults,
1011
QuantizationSection,
1112
ServingCard,
1213
ServingSection,
@@ -19,6 +20,7 @@
1920
"BenchmarkSection",
2021
"CapacitySection",
2122
"HardwareDetails",
23+
"PawBenchResults",
2224
"QuantizationSection",
2325
"ServingCard",
2426
"ServingSection",
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
"""Generate framework-specific launch commands from a servingcard."""
2+
3+
from __future__ import annotations
4+
5+
from servingcard.schema import ServingCard
6+
7+
REGISTRY_BASE_URL = (
8+
"https://raw.githubusercontent.com/zenprocess/servingcard/main/registry"
9+
)
10+
11+
12+
def resolve_source(source: str) -> str:
13+
"""Resolve a config source to a fetchable URL or local path.
14+
15+
Supports:
16+
- Local file path: ./my-config.yaml, /abs/path.yaml
17+
- Full URL: https://...
18+
- Registry shorthand: model/variant -> GitHub raw URL
19+
"""
20+
# Full URL
21+
if source.startswith("http://") or source.startswith("https://"):
22+
return source
23+
24+
# Local file (contains dot-slash, slash, or ends with .yaml/.yml)
25+
if "/" in source and (
26+
source.startswith(".")
27+
or source.startswith("/")
28+
or source.endswith(".yaml")
29+
or source.endswith(".yml")
30+
):
31+
return source
32+
33+
# Registry shorthand: model/variant
34+
if "/" in source:
35+
return f"{REGISTRY_BASE_URL}/{source}.yaml"
36+
37+
# Bare name -- assume it is a local file
38+
return source
39+
40+
41+
def generate_vllm_command(card: ServingCard) -> str:
42+
"""Generate a vLLM serve command from a servingcard."""
43+
if not card.serving or not card.serving.engine_args:
44+
return f"# No engine_args in servingcard -- cannot generate vllm command\nvllm serve {card.model}"
45+
46+
args = card.serving.engine_args.copy()
47+
model_id = args.pop("model", card.model)
48+
49+
parts = [f"vllm serve {model_id}"]
50+
51+
# Map engine_args keys to CLI flags
52+
for key, value in args.items():
53+
flag = f"--{key.replace('_', '-')}"
54+
if isinstance(value, bool):
55+
if value:
56+
parts.append(f" {flag}")
57+
else:
58+
parts.append(f" {flag} {value}")
59+
60+
# Add tensor-parallel-size if not in engine_args but inferable
61+
# Add enable-prefix-caching if not explicitly set
62+
return " \\\n".join(parts)
63+
64+
65+
def generate_tgi_command(card: ServingCard) -> str:
66+
"""Generate a TGI launch command from a servingcard."""
67+
if not card.serving or not card.serving.engine_args:
68+
return f"# No engine_args in servingcard -- cannot generate TGI command\ntext-generation-launcher --model-id {card.model}"
69+
70+
args = card.serving.engine_args.copy()
71+
model_id = args.pop("model", card.model)
72+
73+
parts = [f"text-generation-launcher --model-id {model_id}"]
74+
75+
# Map common vLLM args to TGI equivalents
76+
tgi_map = {
77+
"quantization": "quantize",
78+
"max_model_len": "max-input-length",
79+
"max_num_seqs": "max-batch-size",
80+
}
81+
82+
for key, value in args.items():
83+
tgi_key = tgi_map.get(key, key.replace("_", "-"))
84+
# Skip speculative decoding args -- TGI doesn't support them the same way
85+
if key in ("speculative_model", "num_speculative_tokens"):
86+
parts.append(f" # --{tgi_key} {value} # speculative decoding: adjust for TGI")
87+
continue
88+
if key in ("gpu_memory_utilization",):
89+
continue # TGI doesn't have this flag
90+
if isinstance(value, bool):
91+
if value:
92+
parts.append(f" --{tgi_key}")
93+
else:
94+
parts.append(f" --{tgi_key} {value}")
95+
96+
return " \\\n".join(parts)
97+
98+
99+
def generate_launch_command(card: ServingCard, engine: str | None = None) -> str:
100+
"""Generate a launch command for the given engine.
101+
102+
If engine is None, infer from card.framework.
103+
"""
104+
if engine is None:
105+
framework = card.framework.lower()
106+
if "vllm" in framework:
107+
engine = "vllm"
108+
elif "tgi" in framework or "text-generation" in framework:
109+
engine = "tgi"
110+
else:
111+
engine = "vllm" # default
112+
113+
if engine == "vllm":
114+
return generate_vllm_command(card)
115+
elif engine == "tgi":
116+
return generate_tgi_command(card)
117+
else:
118+
return f"# Unsupported engine: {engine}\n# Use --engine vllm or --engine tgi"
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
"""Pluggable benchmark backends for servingcard."""
2+
3+
from __future__ import annotations
4+
5+
import json
6+
import shutil
7+
import subprocess
8+
import sys
9+
from abc import ABC, abstractmethod
10+
11+
12+
class BenchmarkBackend(ABC):
13+
"""Interface for benchmark harnesses."""
14+
15+
@abstractmethod
16+
def run(self, endpoint: str, model: str, **kwargs: object) -> dict:
17+
"""Run benchmarks, return results dict.
18+
19+
Returns a dict with keys:
20+
single_stream_tok_s, ttft_ms, quality_score, cacp_compliance,
21+
parallel_peak_tok_s (optional), peak_concurrency (optional),
22+
useful_token_ratio (optional), tokens_per_turn (optional),
23+
adaptability_score (optional), suite (optional).
24+
"""
25+
raise NotImplementedError
26+
27+
28+
class PawBenchBackend(BenchmarkBackend):
29+
"""PawBench integration -- subprocess first, then Python import."""
30+
31+
@staticmethod
32+
def is_available() -> bool:
33+
"""Check if PawBench is installed."""
34+
if shutil.which("pawbench"):
35+
return True
36+
try:
37+
import importlib
38+
39+
importlib.import_module("pawbench")
40+
return True
41+
except ImportError:
42+
return False
43+
44+
def run(self, endpoint: str, model: str, **kwargs: object) -> dict:
45+
"""Run PawBench against an endpoint."""
46+
# Try subprocess first (works if pawbench is a CLI tool)
47+
pawbench_bin = shutil.which("pawbench")
48+
if pawbench_bin:
49+
return self._run_subprocess(endpoint, model, **kwargs)
50+
# Fall back to Python import
51+
return self._run_python(endpoint, model, **kwargs)
52+
53+
def _run_subprocess(self, endpoint: str, model: str, **kwargs: object) -> dict:
54+
"""Run PawBench via subprocess."""
55+
cmd = [
56+
"pawbench",
57+
"run",
58+
"--endpoint",
59+
endpoint,
60+
"--model",
61+
model,
62+
"--output-json",
63+
"-",
64+
]
65+
suite = kwargs.get("suite")
66+
if suite:
67+
cmd.extend(["--suite", str(suite)])
68+
69+
result = subprocess.run(cmd, capture_output=True, text=True, check=False)
70+
if result.returncode != 0:
71+
raise RuntimeError(f"PawBench failed (exit {result.returncode}): {result.stderr}")
72+
return json.loads(result.stdout) # type: ignore[no-any-return]
73+
74+
def _run_python(self, endpoint: str, model: str, **kwargs: object) -> dict:
75+
"""Run PawBench via Python API."""
76+
try:
77+
from pawbench import run_benchmark # type: ignore[import-untyped]
78+
except ImportError:
79+
raise RuntimeError(
80+
"PawBench not found. Install: pip install pawbench"
81+
) from None
82+
83+
results: dict = run_benchmark(endpoint=endpoint, model=model, **kwargs)
84+
return results
85+
86+
87+
class ManualBackend(BenchmarkBackend):
88+
"""Manual entry -- user provides benchmark numbers interactively."""
89+
90+
def run(self, endpoint: str, model: str, **kwargs: object) -> dict:
91+
"""Prompt the user for benchmark results."""
92+
print("\nPawBench not found. Enter benchmark results manually:\n")
93+
94+
tok_s = self._prompt_float(" Single-stream tok/s: ")
95+
ttft_ms = self._prompt_float(" TTFT (ms): ")
96+
quality = self._prompt_float(" Quality score (0-1): ", min_val=0, max_val=1)
97+
cacp = self._prompt_float(" CACP compliance (0-1): ", min_val=0, max_val=1)
98+
99+
parallel_tok_s_str = input(" Parallel peak tok/s (Enter to skip): ").strip()
100+
concurrency_str = input(" Peak concurrency (Enter to skip): ").strip()
101+
102+
result: dict = {
103+
"single_stream_tok_s": tok_s,
104+
"ttft_ms": ttft_ms,
105+
"quality_score": quality,
106+
"cacp_compliance": cacp,
107+
"suite": "manual",
108+
}
109+
if parallel_tok_s_str:
110+
result["parallel_peak_tok_s"] = float(parallel_tok_s_str)
111+
if concurrency_str:
112+
result["peak_concurrency"] = int(concurrency_str)
113+
114+
return result
115+
116+
@staticmethod
117+
def _prompt_float(
118+
prompt: str, min_val: float | None = None, max_val: float | None = None
119+
) -> float:
120+
"""Prompt for a float value with optional range validation."""
121+
while True:
122+
try:
123+
val = float(input(prompt))
124+
if min_val is not None and val < min_val:
125+
print(f" Must be >= {min_val}")
126+
continue
127+
if max_val is not None and val > max_val:
128+
print(f" Must be <= {max_val}")
129+
continue
130+
return val
131+
except ValueError:
132+
print(" Enter a number.")
133+
134+
135+
def get_backend() -> BenchmarkBackend:
136+
"""Return the best available benchmark backend."""
137+
if PawBenchBackend.is_available():
138+
return PawBenchBackend()
139+
return ManualBackend()

0 commit comments

Comments
 (0)