BenchFlow uses a resource-verb pattern: bench <resource> <verb>.
List all registered agents with their protocol and auth requirements.
bench agent listShow details for a specific agent.
bench agent show geminiRun one task directory with one agent. This is the most direct command for single-task local, Daytona, or Modal checks.
# Single task with Gemini on Daytona
bench run tasks/regex-log \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--backend daytona
# Single task with mounted skills and the recommended skill nudge
bench run tasks/pdf-fix \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--backend daytona \
--skills-dir tasks/pdf-fix/environment/skills \
--ae BENCHFLOW_SKILL_NUDGE=name| Flag | Default | Description |
|---|---|---|
TASK_DIR |
— | Task directory containing task.toml |
--agent, -a |
claude-agent-acp |
Agent name from the registry |
--model, -m |
Agent default | Model ID |
--backend, -b |
docker |
Backend: docker, daytona, or modal |
--prompt, -p |
instruction.md |
Prompt text; repeat for multi-turn |
--jobs-dir, -o |
jobs |
Output directory |
--agent-env, --ae |
— | Agent environment variable as KEY=VALUE; repeatable |
--skills-dir, -s |
— | Skills directory to deploy into the sandbox |
--sandbox-user |
agent |
Non-root sandbox user; pass none for root |
When mounting skills, the recommended docs default is
--ae BENCHFLOW_SKILL_NUDGE=name. It prepends a short hint telling the agent
which skills are available and where to read them. More verbose modes are
description and full. Omit the env var to leave BenchFlow's runtime default
off.
Create and run an evaluation. Use it for YAML configs and batch runs; it also accepts a single task directory.
# From YAML config
bench eval create -f benchmarks/tb2-gemini-baseline.yaml
# Inline
bench eval create \
-t .ref/terminal-bench-2 \
-a gemini \
-m gemini-3.1-flash-lite-preview \
-e daytona \
-c 64 \
--sandbox-setup-timeout 300| Flag | Default | Description |
|---|---|---|
--config, -f |
— | YAML config file |
--tasks-dir, -t |
— | Task dir (single task with task.toml, or parent of many tasks) |
--agent, -a |
claude-agent-acp |
Agent name |
--model, -m |
Agent default | Model ID |
--env, -e |
docker |
Environment: docker, daytona, or modal |
--concurrency, -c |
4 |
Max concurrent tasks (batch mode only) |
--jobs-dir, -o |
jobs |
Output directory |
--sandbox-user |
agent |
Sandbox user (null for root) |
--sandbox-setup-timeout |
120 |
Timeout in seconds for sandbox user setup |
--skills-dir, -s |
— | Skills directory to deploy into each task sandbox |
List completed evaluations from a jobs directory.
bench eval list jobs/Evaluate a skill against its evals.json test cases.
bench skills eval skills/my-skill/ \
-a gemini \
-m gemini-3.1-flash-lite-preview \
--env daytonaScaffold a new benchmark task.
bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/Validate a task directory (Dockerfile, instruction.md, tests/).
bench tasks check tasks/my-task
bench tasks check tasks/my-task --rubric rubrics/quality.mdRun a reward-based training sweep.
bench train create \
-t tasks/ \
-a gemini \
--sweeps 5 \
--export ./training-dataCreate an environment from a task directory (spins up sandbox).
bench environment create tasks/my-task --backend daytonaList active Daytona sandboxes.
bench environment listtasks_dir: .ref/terminal-bench-2
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skills_dir: shared-skills/
agent_env:
BENCHFLOW_SKILL_NUDGE: name
max_retries: 2Use the Python API for multi-scene experiments. bench eval create -f is for
batch job configs; scene configs are loaded with benchflow.trial_yaml or built
directly in Python.
task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300
scenes:
- name: skill-gen
roles:
- name: creator
agent: gemini
model: gemini-3.1-flash-lite-preview
turns:
- role: creator
prompt: "Analyze the task and write a skill document to /app/generated-skill.md"
- name: solve
roles:
- name: solver
agent: gemini
model: gemini-3.1-flash-lite-preview
turns:
- role: solverThese still work but are hidden from --help:
| Old command | Replacement |
|---|---|
benchflow run |
bench run <task> |
benchflow job |
bench eval create -f <yaml> |
benchflow agents |
bench agent list |
benchflow eval |
bench skills eval |
benchflow metrics |
bench eval list --detail |
benchflow view |
(planned: bench trajectory show) |
benchflow cleanup |
bench environment list + delete |
benchflow skills install |
Skills are folders, not packages |