All notable changes to the open-source CommonRouterBench Python distribution (import package main) are documented in this file.
- Headline scores
scores_v2in the eval summary (compute_v2_scores, exported frommain.eval), producing four orthogonal component scores plus their arithmetic mean:case_pass_rate_percent— per-rowpred_tier_id >= gold_tier_idover total rows.case_exact_match_percent— per-rowpred_tier_id == gold_tier_idover total rows.trajectory_pass_rate_percent— case-weighted trajectory pass: a row counts toward the numerator iff its entire trajectory passes; denominator is total rows (same scope as metric 1). Guaranteestrajectory_pass_rate <= case_pass_rate.cost_savings_score_percent— full-cost savings under a trajectory-level natural-accounting user-bill model. All gold tiers included;D_b += baseline_cost(total always-high bill). Numerator accumulation is trajectory-level: passed trajectories contributeN_b += baseline_cost - pred_costper step; failed trajectories (any step witherrororpred_tier_id < gold_tier_id) contributeN_b -= pred_costper step — encoding the failed-trajectory user bill asΣ pred_cost (router's wasted chain) + Σ baseline_cost (one full-high re-run of the chain)without an extra retry penalty coefficient. Macro-weighted across benchmarks by total row count.combined_score_percent— arithmetic mean of 1–4.
- Per-benchmark breakdown
scores_v2.by_benchmark.<b>withrow_count,step_count,failed_trajectory_count,failed_retry_baseline_usd(Σ baseline over failed-trajectory evaluable steps, informational),D_usd,N_usd,cost_savings_score_percent, andweight_in_global_cost_savings. - Shared helpers
_build_trajectory_statusand_iter_trajectory_step_costsfactored out socompute_router_accounting_metricsandcompute_v2_scoreswalk trajectories through the same code path.
- Documentation: both READMEs promote
scores_v2as the headline; legacysection_11/router_accountingsections are explicitly marked retained-for-backward-compatibility. - PinchBench data rebuild: replaced baseline (gpt-5.4) conversation context with validated mixed-model context from actual cascade search runs. Messages now reflect the real optimal-path model responses at each step.
- PinchBench reduced from 16 tasks / 88 rows to 12 tasks / 48 rows: removed 4 tasks with incomplete mixed-model data (task_10_workflow, task_17_email_search, task_20_eli5_pdf_summary, task_21_openclaw_comprehension).
- Corrected 3 PinchBench GT tier labels after last-step downgrade validation: task_05_summary step 4 (high→low), task_11_clawdhub step 4 (mid→low), task_12_skill_search step 6 (high→low). These final steps are text-reply summaries where low-tier models score equivalently.
- SWE-bench last-step 3-model validation: tested all 33 non-low last steps with 3 models per tier (full cascade). Downgraded 13 additional last-step GT labels (28 total counterexamples, 5 confirmed correct). New SWE-bench tier distribution: low 94, mid 33, mid_high 41, high 168.
- Total question bank: 970 rows (was 1010).
build_open_data.pynow reads PinchBench fromtest/pinchbench/mixed_model_data/instead of upstream baseline-only export.- Updated README/README.zh distribution tables to match new counts.
LICENSE: Apache License 2.0 full text, bundled in wheel metadata (*.dist-info/licenses/).- Tier-only routing supervision corpus (
data/question_bank.jsonl) and per-source counts (data/manifest.json), included in wheels when those files are present at build time. - Import package
main: dataset iterators, nominal RouterBench v2 §11.2-style step metrics,main.evalquestion-bank runner (run_question_bank_eval,evaluate_question_bank_rows,FunctionPredictor,LlmDigitClassifierPredictor), and OpenAI-compatible chat helper for digit tier classification. - Eval summary
router_accounting:pass_rate_percent,exact_match_rate_percent,accounting_savings_score_percent,overall_score_percent(arithmetic mean of those three; NaN if any component is NaN), plus underlyingD_nominal_usd/N_mixed(evaluable rows only). - Console entry point
CommonRouterBench→main.cli:main(alsopython -m main.cli).
- Publishing scope:
tests/andscripts/are.gitignored and pruned from sdists (MANIFEST.in). Public releases containmain,data/(when present), and docs only—no pytest suite or HTTP smoke harness in the repository or on PyPI. - Documentation: README describes data distribution (per-
benchmarkrows, goldtarget_tiercounts) instead of model score tables.
- PyPI / pip name is
CommonRouterBench; Python imports useimport main(see README).