Add benchflow evaluation framework#2139
Conversation
BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.
|
End-to-end demo that the new framework key threads through every consumer surface. 1. Built artefacts carry the entry2. Downstream
|
|
Hey @xdotli ! Thanks for the submission. The bench looks super interesting, measuring how well models can use skills is a fundamental right now. |
Summary
benchflowto theEVALUATION_FRAMEWORKSconstant inpackages/tasks/src/eval.ts.BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.
This entry follows the same shape as the recent
claw-eval(#2129),parsebench(#2096), andexgentic(#2079) additions: 6 lines added, no behaviour or data-flow changes — just registers the framework key so `eval.yaml` files in benchmark datasets can reference it.Traction
SkillsBench is an active benchmark with material adoption:
Tested locally
Followup
After this lands, `benchflow/skillsbench` will publish its `eval.yaml` and request allow-list inclusion via `huggingface/hub-docs` (per the registering-a-benchmark doc).
Note
Low Risk
Low risk: this only adds a new entry to the
EVALUATION_FRAMEWORKSallowlist with metadata (name/description/url) and does not change evaluation logic or data flow.Overview
Adds
benchflowto theEVALUATION_FRAMEWORKSregistry inpackages/tasks/src/eval.ts, enablingeval.yamlbenchmark configs to reference BenchFlow by key and surface its metadata (name/description/url).Reviewed by Cursor Bugbot for commit 578f61b. Bugbot is set up for automated code reviews on this repo. Configure here.