PawBench

Because your model deserves a benchmark with more bark than bite.

4-dimensional LLM inference benchmark.
Multi-turn, multi-agent, parallel dispatch with tool calling.

About

PawBench tests LLMs with realistic coding agent workloads — not synthetic single-turn completions.

It simulates what actually happens when you deploy coding agents: multi-turn conversations, parallel tool calling, mid-task steering events, and cross-agent coordination. Then it measures four dimensions: throughput, quality, efficiency, and adaptability.

Works against any OpenAI-compatible endpoint — vLLM, TGI, OpenAI, Ollama, LMStudio.

Inspired by Fabian Wesner's One-Shot Shop Challenge (announcement) — the study that showed orchestration architecture beats model choice (Team Mode 85% vs Sub-Agents 57% on the same model). Pawbench's orchestration × complexity matrix operationalizes that finding inside a reproducible benchmark.

Meet Lola

PawBench is inspired by Lola (@_justlolathings) — the most fashionable pup on Instagram.

The built-in scenarios revolve around building her boutique dog apparel store, PawStyle by Lola. Every product, every size guide, every "Lola's Pick" badge traces back to this style icon on four legs.

Follow Lola: instagram.com/_justlolathings

Install

pip install pawbench
# or
uv pip install pawbench

Quick Start

# Benchmark your local vLLM
pawbench --endpoint http://localhost:8000

# Against any OpenAI-compatible endpoint
pawbench --endpoint https://api.openai.com/v1 --tag gpt4o

# Just throughput saturation (no scenarios)
pawbench --saturation-only --concurrency 1,2,4,8,16

# JSON output for CI/autoresearch
pawbench --json --output results/

# Custom scenario
pawbench --scenario my_scenario.json

What It Measures

4 Dimensions + Spec 009 Matrix

Dimension	Metrics
Throughput	Single-agent tok/s, parallel saturation curve (1->N), TTFT, peak concurrency
Quality	Tool call accuracy, instruction following, format compliance, keyword matching
Efficiency	Useful token ratio (code in tool args vs filler preamble), tokens per turn
Adaptability	Steering event response, mid-conversation context injection, nudge quality delta
Artifact Quality (spec 009)	Static analysis over changed files (ruff/mypy/radon for Python, generic fallback otherwise). Orthogonal to AC pass.
Complexity Tier (spec 009)	Per-task tagging — `display` / `crud` / `transactional` / `cross_cutting` — with stratified `quality_by_tier` reporting.
Orchestration Shape (spec 009)	Same scenario × 5 shapes (`flat` / `waves` / `scatter-gather` / `team-mode` / `subagents`) → `orchestration_dqs_spread` SLI.
DQS (spec 009)	Composite Dispatch Quality Score v1.0.0 with auditable weights + post-hoc ablation matrix.

New flags (spec 009)

pawbench --orchestration flat,waves,scatter-gather,team-mode,subagents
pawbench --ablate quality,format_compliance,tool_accuracy,useful_ratio,steering_rate
pawbench --context-tier manifest-only
pawbench --verification-runs 2
pawbench --no-quality-analysis

The orchestration matrix scenario (pawstyle-orchestration-matrix) is designed to differentiate shapes — four independent feature blocks, one per complexity tier. Inspired by Fabian Wesner's One-Shot Shop Challenge (orchestration > model).

What "spec 009" means here

The implementation follows an internal design doc colloquially called "spec 009" that defines the orchestration shape vocabulary, complexity tier vocabulary, the fixture_gap terminal status, verification_runs[], and artifact_quality schema. The vocabularies are normative in Axiom §17; Pawbench is the reference implementation.

Built-in Scenarios: PawStyle by Lola

Two parallel agents build Lola's boutique dog apparel e-commerce store — "Where every pup is a fashionista":

pawstyle-independent — Frontend and backend work independently on Lola's shop. Pure parallel throughput + quality baseline.
pawstyle — Backend gets a steering event mid-task ("frontend added a Size Guide button — implement Lola's breed-specific sizing endpoint").
pawstyle-nudge — Frontend adds Lola's Favorites (wishlist) and Compare features that require backend changes. Backend receives nudges and adapts.

Each scenario is 3 turns x 2 agents, with tool calls (write_file, read_file, run_command) and injected tool results. Products include Lola's Signature Bandana, Cozy Knit Sweater, Rainy Day Raincoat, Adventure Booties, Dapper Bow Tie, and Walk-in-Style Harness — with "Lola's Pick" badges on her personal favorites.

Server Metrics (optional)

If the endpoint exposes /metrics (vLLM, TGI), PawBench scrapes:

KV cache usage and prefix cache hit rate
Speculative decoding acceptance rate
GPU cache pressure

Custom Scenarios

Scenarios are JSON files:

{
  "id": "my-scenario",
  "name": "My Custom Scenario",
  "agents": [
    {
      "id": "agent-1",
      "name": "My Agent",
      "turns": [
        {
          "turn": 1,
          "role": "user",
          "content": "Build a REST API with Flask...",
          "tools": ["write_file"],
          "expect": {
            "tool_calls_min": 1,
            "tool_name_any": ["write_file"],
            "output_mentions": ["flask", "api"]
          }
        }
      ]
    }
  ],
  "tools_schema": [...]
}

Comparing Configs

pawbench --tag baseline --output results/
# ... change model config ...
pawbench --tag eagle3 --output results/

pawbench-compare results/pawbench_baseline_*.json results/pawbench_eagle3_*.json

Output Format

JSON results include full model card (architecture, quantization, GPU, serving params) for reproducibility:

{
  "tag": "fp8-eagle3-spec3",
  "model_card": {
    "model_name": "qwen3-coder",
    "model_config": {"architectures": ["Qwen3NextForCausalLM"], "num_experts": 512},
    "tuning": {"kv_cache_dtype": "fp8_e4m3", "speculative_config": "eagle3"},
    "gpu": {"name": "NVIDIA GB10"}
  },
  "dim1_throughput": {"avg_single_tok_s": 69.0, "raw_peak_tok_s": 469.3},
  "dim2_quality": {"avg_quality": 0.81, "tool_accuracy": 0.96},
  "saturation_curve": [{"concurrency": 1, "tok_s": 69.3}, {"concurrency": 8, "tok_s": 469.3}],
  "server_metrics": {"spec_acceptance_rate": 0.72, "gpu_prefix_cache_hit_rate": 0.92}
}

Why PawBench Exists

Karpathy's autoresearch showed that an AI agent can autonomously run ML experiments overnight — modify, train, evaluate, repeat. PawBench extends that idea to inference serving: what if an agent could autonomously tune your model config, benchmark it, and keep the best result?

The problem is that LLM serving optimization is gatekept. The best configs — speculative decoding heads, MoE kernel tuning, KV cache quantization strategies — live in private Discord channels and undocumented tribal knowledge. A team with an H100 cluster can spend weeks finding the right settings. A solo dev with a single GPU doesn't have that luxury.

PawBench is the benchmark harness for that loop. Run it, change your config, run it again, compare. The Serving Card initiative takes it further — standardizing how model serving configs are documented and shared, so the community can build on each other's work instead of rediscovering the same optimizations in isolation.

Democratize the configs. Benchmark everything. Share what works.

Disclaimer

This project has been entirely vibe coded. Two humans, several AI agents, one very fashionable dog, and a mass of mass energy that mass produced some mass code. If something breaks, it was probably the cat's fault (see commit history).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
assets		assets
docs		docs
leaderboard		leaderboard
src/pawbench		src/pawbench
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
sonar-project.properties		sonar-project.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PawBench

About

Meet Lola

Install

Quick Start

What It Measures

4 Dimensions + Spec 009 Matrix

New flags (spec 009)

What "spec 009" means here

Built-in Scenarios: PawStyle by Lola

Server Metrics (optional)

Custom Scenarios

Comparing Configs

Output Format

Why PawBench Exists

Disclaimer

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PawBench

About

Meet Lola

Install

Quick Start

What It Measures

4 Dimensions + Spec 009 Matrix

New flags (spec 009)

What "spec 009" means here

Built-in Scenarios: PawStyle by Lola

Server Metrics (optional)

Custom Scenarios

Comparing Configs

Output Format

Why PawBench Exists

Disclaimer

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages