Add Onramp + ProgramBench adapter by blocksorg[bot] · Pull Request #238 · benchflow-ai/benchflow

blocksorg · 2026-05-06T08:53:00Z

Summary

onramp/ — new home for adapters that import external benchmarks into BenchFlow's task format. Each adapter emits task.toml + instruction.md + environment/Dockerfile + tests/test.sh directories that the existing runner consumes unchanged.
onramp/programbench/ — first adapter, covers all 200 ProgramBench instances. Generates one BenchFlow task per upstream instance, building on the upstream programbench/<id>:task_cleanroom images and pulling per-branch test blobs from the programbench/ProgramBench-Tests HuggingFace dataset at verify time.
Parity verified on the upstream-shipped fixture: BenchFlow and programbench eval produce identical passed/total for both correct (3/6 = 0.5) and incorrect (0/6 = 0.0) submissions.

What's in the diff

File	Role
`onramp/__init__.py`, `onramp/README.md`	Top-level package + adapter index
`onramp/programbench/adapter.py`	Reads upstream `task.yaml` + `tests.json`, emits per-instance task dirs
`onramp/programbench/main.py`	CLI: `python -m onramp.programbench.main --output-dir ...`
`onramp/programbench/parity.py`	Drives both pipelines on the same submission, writes `parity_experiment.json`
`onramp/programbench/templates/{task.toml,instruction.md,Dockerfile}.tmpl`	Per-task scaffolding (string.Template substitution)
`onramp/programbench/templates/test.sh`	Task-agnostic verifier — `BF_INSTANCE_ID` from Dockerfile ENV, `tests.json` sidecar from `/tests/`
`onramp/programbench/run_programbench.yaml`	`benchflow.job.Job` config for the full converted dataset
`onramp/programbench/parity_experiment.json`	Recorded fixture parity results + cmatrix smoke
`onramp/programbench/README.md`	Per-adapter docs
`tests/onramp/test_programbench_adapter.py`	18 unit tests covering name sanitization, image-tag mapping, render correctness, `bench tasks check` shape
`CHANGELOG.md`, `pyproject.toml`	Changelog entry + ruff per-file-ignore for `tests/onramp/*.py`

Verifier flow (mirrors `programbench eval`)

Snapshot pristine /workspace (binary + docs from cleanroom image).
Wipe and stage agent's /app/ into /workspace/.
Seed deterministic git repo, run compile.sh → /workspace/executable.
Stash executable + record sha256.
For each active branch: restore /workspace, restore executable (re-checking the hash), unpack the branch tarball, run eval/run.sh, parse eval/results.xml.
reward = passed / (junit_cases + missing_from_junit) with ignored: true branches/tests excluded — matches programbench info's headline score.

Test plan

uv run python -m pytest tests/ — 820 passed (was 808 + 18 new), 1 skipped, 1 deselected.
uv run ruff check . — clean.
uv run ty check src/ — clean.
uv run python -m onramp.programbench.main --output-dir <out> against a clone of facebookresearch/ProgramBench generates 201 task directories (200 instances + 1 fixture).
bench tasks check <task> passes for all 201 generated tasks.
Fixture parity: BenchFlow verifier produces 3/6 = 0.5 on correct submission and 0/6 = 0.0 on incorrect submission — matches upstream's recorded eval.json exactly.
Real-instance smoke (abishekvashok__cmatrix.5c082c6): BenchFlow verifier resolves all 14 active branches × 508 expected tests against the real cleanroom image; HF blob fetch + per-branch tarball extract + pytest invocation all succeed end-to-end.

Out of scope for this PR

The full 200-task Gemini-driven parity sweep — multi-GB image pulls × 200, ~8 GB of HF blobs, hours of agent wall time per task. The README documents the path:

uv run python -m benchflow.job onramp/programbench/run_programbench.yaml \
    --override environment=daytona
# Then fan submissions through `programbench eval` and compare with parity.py.

🤖 Generated with Claude Code

Onramp (`onramp/`) is a new home for adapters that import external benchmarks into BenchFlow's task format — they emit `task.toml` + `instruction.md` + `environment/Dockerfile` + `tests/test.sh` directories that the existing runner consumes unchanged. The first adapter, `onramp/programbench/`, covers all 200 ProgramBench instances (rebuild a program from binary + docs). For each upstream instance: - Reads `task.yaml` + `tests.json` from a clone of facebookresearch/ProgramBench. - Emits a BenchFlow task that builds on `programbench/<id>:task_cleanroom`. - Ships the upstream tests.json as a sidecar in `tests/` (some instances have hundreds of branches — the b64 string would blow past Docker's 65 535-byte ENV-line limit). - The verifier mirrors `programbench eval`'s pipeline inside one BenchFlow sandbox: snapshot /workspace, stage agent's submission, compile -> stash executable, then per-branch restore + run + parse JUnit XML. Reward formula matches `programbench info`: `passed / (junit_cases + missing_from_junit)`, with ignored branches/tests dropped from both numerator and denominator. Parity verified end-to-end on the upstream-shipped `testorg__calculator.abc1234` fixture: BenchFlow and `programbench eval` produce identical `passed/total` on the correct submission (3/6 = 0.5) and the incorrect submission (0/6 = 0.0). Real-instance smoke (`abishekvashok__cmatrix.5c082c6`): the BenchFlow verifier resolves all 14 active branches and 508 expected tests against the real cleanroom image; HuggingFace blob fetch + tarball extract + pytest run all succeed end-to-end. The full 200-task Gemini-driven parity sweep is documented in the README and parity_experiment.json — out of scope for one sandbox session because it requires multi-GB image pulls × 200, ~8 GB of HF blobs, and hours per task.

devin-ai-integration

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-05-06T08:57:03Z

+        subprocess.run(["docker", "exec", container, "mkdir", "-p", "/tests"], check=True)
+        subprocess.run(
+            ["docker", "cp", str(task_dir / "tests" / "test.sh"), f"{container}:/tests/test.sh"],
+            check=True,
+        )
+        subprocess.run(["docker", "exec", container, "chmod", "+x", "/tests/test.sh"], check=True)


🔴 parity.py's benchflow_oracle_score fails to copy tests.json into the container, causing verifier to always report score 0

benchflow_oracle_score in parity.py only copies test.sh to /tests/ in the container (line 131) but never copies tests.json. The verifier script (test.sh) requires /tests/tests.json at line 34-38 of benchmarks/programbench/templates/test.sh — when it's missing, the script logs "FAIL: tests.json sidecar not found" and exits immediately with the pre-set reward of 0, without ever running any tests. This means the BenchFlow side of the single-submission parity check always produces {passed: 0, total: 0, score: 0.0}, making the entire comparison meaningless. The sister script parity_full.py:95 correctly copies the whole tests/ directory (str(task_dir / "tests") + "/.") which includes both test.sh and tests.json.

Was this helpful? React with 👍 or 👎 to provide feedback.

Rename per review: - onramp/programbench/adapter.py → onramp/programbench/benchflow.py - tests/onramp/test_programbench_adapter.py → test_programbench_benchflow.py - All imports and docs updated. Full-set parity sweep (`onramp/programbench/parity_full.py`): - Walks every converted instance, scores each with both pipelines using a deterministic stub `compile.sh` (`templates/stub_compile.sh`), and writes per-instance pass/total deltas to `parity_full_results.json`. - Resumable; disk-efficient (pull → run → cleanup per instance). Two bugs surfaced and fixed during the sweep: 1. `sudo`'s "unable to resolve host" warning was leaking into stdout on hosts missing the local hostname in /etc/hosts, corrupting the XML upstream eval read via `docker exec cat`. The wrapper now filters that warning off stderr; the README documents the standard /etc/hosts fix. 2. The first version imported `programbench` in-process to compute the filtered `programbench info` score. That trips `PackageNotFoundError` when the upstream checkout isn't `pip install`-ed, and leaks state between iterations. Replaced with a self-contained scorer that reads `tests.json` for active branches + ignored tests, then applies the same filter to upstream's eval.json. First real-instance parity check confirmed: abishekvashok__cmatrix.5c082c6 bf=17/508 up=17/508 ✓

Snapshot of the in-progress sweep. The denominator (test count after filtering ignored branches/tests) matches between BenchFlow and upstream on every instance evaluated so far — the conversion targets the same test set. The numerator (passed count) diverges on 13/17 because the two pipelines isolate branch runs differently: upstream `programbench eval`: fresh container per test branch (committed_image after compile.sh). BF onramp test.sh: single sandbox, restored from a /workspace snapshot between branches. Side effects (lingering processes, /tmp state, env vars) accumulate across branches in BF. This is a real semantic gap — not a structural one — and is the next thing to address on this branch.

blocksorg · 2026-05-06T18:33:41Z

Parity sweep — first 17/200 instances

parity_full.py ran against the first 17 alphabetically-sorted ProgramBench instances with a deterministic stub compile.sh. Recorded in parity_full_results.json.

Structural parity is solid: 17/17 totals match. Both pipelines see the same active branches and the same expected test count on every instance — so the verifier wiring (Dockerfile, tests.json sidecar, blob fetch, JUnit parse, programbench info-style scoring) is correct.

4/17 also agree on numerator (passed counts equal):

abishekvashok__cmatrix.5c082c6 17/508
antonmedv__fx.86d0d34 108/2047
antonmedv__walk.bf802ef 11/470
arthursonzogni__json-tui.17a22b6 18/755

The 13 numerator-divergences are real and consistent. BF tends to over-report passes (median delta +178). Root cause: BF reuses one sandbox container across all branches and only restores /workspace between runs, while programbench eval spins a fresh container per branch. Some test fixtures (tmux sessions, daemon processes, /tmp artifacts) survive across branches in BF and cause tests that should fail to pass.

Fix path is per-branch isolation — either kill stray processes + clear /tmp between branches, or move to actually fresh sub-containers. That's a follow-up commit, not a blocker for the structural-parity claim.

The sweep is resumable; full 200-instance run is still ~50 hr of sequential wall time and out of scope for one sandbox session.

Per review: "onramp" was a made-up name that didn't fit the repo. The adapter now lives where the other external-benchmark integrations live — `benchmarks/programbench/` alongside `benchmarks/followup-bench/`, `benchmarks/models-as-skills/`, and the `run_skillsbench.py` / `run_tb2.py` runners. Layout now matches benchflow conventions: - `benchmarks/programbench/` — adapter package - `benchmarks/run_programbench.py` — runner (mirrors run_skillsbench.py) - `benchmarks/run_programbench.yaml` — Job config - `tests/benchmarks/` — moved from `tests/onramp/` `benchmarks/run_programbench.py` calls a new `ensure_programbench_tasks()` that materializes `.ref/programbench-bf/` on first run via `benchmarks.programbench.main`, mirroring `ensure_tasks()` in `task_download.py`. ProgramBench tasks aren't a separate repo of pre-built BenchFlow tasks — they're generated from upstream metadata — so the runner shells out to the converter rather than git-cloning. The top-level `onramp/` directory is gone. All imports updated: - `from onramp.programbench.benchflow import ...` → `from benchmarks.programbench.benchflow import ...` - `python -m onramp.programbench.main` → `python -m benchmarks.programbench.main` 820 tests still pass, ruff + ty clean. Smoke check: `bench tasks check` passes on regenerated task directories.

devin-ai-integration Bot reviewed May 6, 2026

View reviewed changes

xdotli added 2 commits May 6, 2026 10:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Onramp + ProgramBench adapter#238

Add Onramp + ProgramBench adapter#238
blocksorg[bot] wants to merge 4 commits intomainfrom
feature/onramp-programbench

blocksorg Bot commented May 6, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 6, 2026 •

edited

Loading

Uh oh!

blocksorg Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blocksorg Bot commented May 6, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the diff

Verifier flow (mirrors programbench eval)

Test plan

Out of scope for this PR

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blocksorg Bot commented May 6, 2026

Parity sweep — first 17/200 instances

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

blocksorg Bot commented May 6, 2026 •

edited by devin-ai-integration Bot

Loading

Verifier flow (mirrors `programbench eval`)

devin-ai-integration Bot May 6, 2026 •

edited

Loading