Add benchflow evaluation framework by xdotli · Pull Request #2139 · huggingface/huggingface.js

xdotli · 2026-05-05T07:44:45Z

Summary

Adds benchflow to the EVALUATION_FRAMEWORKS constant in packages/tasks/src/eval.ts.

BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.

This entry follows the same shape as the recent claw-eval (#2129), parsebench (#2096), and exgentic (#2079) additions: 6 lines added, no behaviour or data-flow changes — just registers the framework key so `eval.yaml` files in benchmark datasets can reference it.

Traction

SkillsBench is an active benchmark with material adoption:

Paper: arXiv:2602.12670 — 40 co-authors across academic and industry contributors.
Tasks: 91 tasks in the main repo, spanning financial analysis, code-patch, multimodal output, scientific computation, infrastructure, and other professional workflows.
Repo activity: 1,109 ⭐ / 277 🍴 / 226 merged PRs / 56 contributors on `benchflow-ai/skillsbench` at time of writing.
Trajectory archive: 8 contributor namespaces with multi-experiment trial bundles in `benchflow-ai/skillsbench-trajectories`.
HF datasets already live:
- `benchflow/skillsbench` — task corpus (parquet, full source-commit pin).
- `benchflow/skillsbench-leaderboard` — trajectory submission target patterned after `harborframework/terminal-bench-2-leaderboard`.

Tested locally

`pnpm format:check` — pass
`pnpm lint:check` — pass
`pnpm check` (tsc) — pass
`pnpm test` — 16/16 pass
`pnpm build` — `dist/esm/eval.js`, `dist/commonjs/eval.js`, and `eval.d.ts` all carry the new entry.

Followup

After this lands, `benchflow/skillsbench` will publish its `eval.yaml` and request allow-list inclusion via `huggingface/hub-docs` (per the registering-a-benchmark doc).

Note

Low Risk
Low risk: this only adds a new entry to the EVALUATION_FRAMEWORKS allowlist with metadata (name/description/url) and does not change evaluation logic or data flow.

Overview
Adds benchflow to the EVALUATION_FRAMEWORKS registry in packages/tasks/src/eval.ts, enabling eval.yaml benchmark configs to reference BenchFlow by key and surface its metadata (name/description/url).

^{Reviewed by Cursor Bugbot for commit 578f61b. Bugbot is set up for automated code reviews on this repo. Configure here.}

BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.

xdotli · 2026-05-05T07:59:31Z

End-to-end demo that the new framework key threads through every consumer surface.

1. Built artefacts carry the entry

$ pnpm --filter @huggingface/tasks build
$ grep -A 3 benchflow packages/tasks/dist/esm/eval.js
    benchflow: {
        name: "benchflow",
        description: "BenchFlow is an evaluation framework …",
        url: "https://github.com/benchflow-ai/benchflow",
    },
$ grep -A 3 benchflow packages/tasks/dist/esm/eval.d.ts
    readonly benchflow: {
        readonly name: "benchflow";
        readonly description: "BenchFlow is an evaluation framework …";
        readonly url: "https://github.com/benchflow-ai/benchflow";
    };

2. Downstream `eval.yaml` validates against the spec

The benchmark dataset that will consume this key is already public at https://huggingface.co/datasets/benchflow/skillsbench. Its `eval.yaml` is drafted and validates against `docs/hub/eval-results.md`:

```yaml
name: SkillsBench
description: >
SkillsBench measures how well AI agents leverage Skills — structured
packages of procedural knowledge — to complete realistic professional
workflows across many domains. The headline metric is the with-skills
vs. without-skills delta. See arXiv:2602.12670.
evaluation_framework: benchflow
tasks:

id: skillsbench
config: default
split: train
```

Validation transcript:

```
$ python validate.py /tmp/skillsbench-eval.yaml
PASS — eval.yaml conforms to spec
• required fields present (name, description, evaluation_framework, tasks)
• evaluation_framework='benchflow' is in the patched EVALUATION_FRAMEWORKS set
• tasks[0] has required id
```

3. Model-repo `.eval_results/skillsbench.yaml` round-trips

```yaml

dataset:
id: benchflow/skillsbench
task_id: skillsbench
value: 0.42
date: "2026-05-05"
source:
url: https://huggingface.co/datasets/benchflow/skillsbench-leaderboard
name: SkillsBench Leaderboard submissions
user: benchflow
notes: "with-skills"
```

```
$ python validate_eval_results.py /tmp/skillsbench-model-eval-result.yaml
PASS — 1 entry valid against eval-results spec
• dataset.id resolves to a Hub dataset
• dataset.task_id matches the eval.yaml's tasks[].id
• value is numeric, date is ISO-8601, source.url is set
```

4. Allow-list request already filed

huggingface/hub-docs#2456 — references this PR and lists the dataset's traction signals so the allow-list flip is one step after merge.

Once this lands and the hub deploys, `benchflow/skillsbench` pushes its `eval.yaml` and the chain is live.

NathanHB · 2026-05-05T11:37:34Z

Hey @xdotli ! Thanks for the submission. The bench looks super interesting, measuring how well models can use skills is a fundamental right now.
On what open model did you run the benchmark and is it reported in recent model release?

Add benchflow evaluation framework

578f61b

BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.

xdotli requested review from SBrandeis, Wauplin, gary149, julien-c, ngxson and pcuenca as code owners May 5, 2026 07:44

xdotli mentioned this pull request May 5, 2026

Request benchmark allow-list — benchflow/skillsbench huggingface/hub-docs#2456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchflow evaluation framework#2139

Add benchflow evaluation framework#2139
xdotli wants to merge 1 commit intohuggingface:mainfrom
xdotli:add-benchflow-evaluation-framework

xdotli commented May 5, 2026 •

edited by cursor Bot

Loading

Uh oh!

xdotli commented May 5, 2026

Uh oh!

NathanHB commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xdotli commented May 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Traction

Tested locally

Followup

Uh oh!

xdotli commented May 5, 2026

1. Built artefacts carry the entry

2. Downstream eval.yaml validates against the spec

3. Model-repo `.eval_results/skillsbench.yaml` round-trips

4. Allow-list request already filed

Uh oh!

NathanHB commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xdotli commented May 5, 2026 •

edited by cursor Bot

Loading

2. Downstream `eval.yaml` validates against the spec