Add Onramp + ProgramBench adapter#238
Conversation
Onramp (`onramp/`) is a new home for adapters that import external benchmarks into BenchFlow's task format — they emit `task.toml` + `instruction.md` + `environment/Dockerfile` + `tests/test.sh` directories that the existing runner consumes unchanged. The first adapter, `onramp/programbench/`, covers all 200 ProgramBench instances (rebuild a program from binary + docs). For each upstream instance: - Reads `task.yaml` + `tests.json` from a clone of facebookresearch/ProgramBench. - Emits a BenchFlow task that builds on `programbench/<id>:task_cleanroom`. - Ships the upstream tests.json as a sidecar in `tests/` (some instances have hundreds of branches — the b64 string would blow past Docker's 65 535-byte ENV-line limit). - The verifier mirrors `programbench eval`'s pipeline inside one BenchFlow sandbox: snapshot /workspace, stage agent's submission, compile -> stash executable, then per-branch restore + run + parse JUnit XML. Reward formula matches `programbench info`: `passed / (junit_cases + missing_from_junit)`, with ignored branches/tests dropped from both numerator and denominator. Parity verified end-to-end on the upstream-shipped `testorg__calculator.abc1234` fixture: BenchFlow and `programbench eval` produce identical `passed/total` on the correct submission (3/6 = 0.5) and the incorrect submission (0/6 = 0.0). Real-instance smoke (`abishekvashok__cmatrix.5c082c6`): the BenchFlow verifier resolves all 14 active branches and 508 expected tests against the real cleanroom image; HuggingFace blob fetch + tarball extract + pytest run all succeed end-to-end. The full 200-task Gemini-driven parity sweep is documented in the README and parity_experiment.json — out of scope for one sandbox session because it requires multi-GB image pulls × 200, ~8 GB of HF blobs, and hours per task.
| subprocess.run(["docker", "exec", container, "mkdir", "-p", "/tests"], check=True) | ||
| subprocess.run( | ||
| ["docker", "cp", str(task_dir / "tests" / "test.sh"), f"{container}:/tests/test.sh"], | ||
| check=True, | ||
| ) | ||
| subprocess.run(["docker", "exec", container, "chmod", "+x", "/tests/test.sh"], check=True) |
There was a problem hiding this comment.
🔴 parity.py's benchflow_oracle_score fails to copy tests.json into the container, causing verifier to always report score 0
benchflow_oracle_score in parity.py only copies test.sh to /tests/ in the container (line 131) but never copies tests.json. The verifier script (test.sh) requires /tests/tests.json at line 34-38 of benchmarks/programbench/templates/test.sh — when it's missing, the script logs "FAIL: tests.json sidecar not found" and exits immediately with the pre-set reward of 0, without ever running any tests. This means the BenchFlow side of the single-submission parity check always produces {passed: 0, total: 0, score: 0.0}, making the entire comparison meaningless. The sister script parity_full.py:95 correctly copies the whole tests/ directory (str(task_dir / "tests") + "/.") which includes both test.sh and tests.json.
Was this helpful? React with 👍 or 👎 to provide feedback.
Rename per review: - onramp/programbench/adapter.py → onramp/programbench/benchflow.py - tests/onramp/test_programbench_adapter.py → test_programbench_benchflow.py - All imports and docs updated. Full-set parity sweep (`onramp/programbench/parity_full.py`): - Walks every converted instance, scores each with both pipelines using a deterministic stub `compile.sh` (`templates/stub_compile.sh`), and writes per-instance pass/total deltas to `parity_full_results.json`. - Resumable; disk-efficient (pull → run → cleanup per instance). Two bugs surfaced and fixed during the sweep: 1. `sudo`'s "unable to resolve host" warning was leaking into stdout on hosts missing the local hostname in /etc/hosts, corrupting the XML upstream eval read via `docker exec cat`. The wrapper now filters that warning off stderr; the README documents the standard /etc/hosts fix. 2. The first version imported `programbench` in-process to compute the filtered `programbench info` score. That trips `PackageNotFoundError` when the upstream checkout isn't `pip install`-ed, and leaks state between iterations. Replaced with a self-contained scorer that reads `tests.json` for active branches + ignored tests, then applies the same filter to upstream's eval.json. First real-instance parity check confirmed: abishekvashok__cmatrix.5c082c6 bf=17/508 up=17/508 ✓
Snapshot of the in-progress sweep. The denominator (test count after
filtering ignored branches/tests) matches between BenchFlow and upstream
on every instance evaluated so far — the conversion targets the same
test set. The numerator (passed count) diverges on 13/17 because the
two pipelines isolate branch runs differently:
upstream `programbench eval`: fresh container per test branch
(committed_image after compile.sh).
BF onramp test.sh: single sandbox, restored from a
/workspace snapshot between branches.
Side effects (lingering processes, /tmp state, env vars) accumulate
across branches in BF. This is a real semantic gap — not a structural
one — and is the next thing to address on this branch.
Parity sweep — first 17/200 instances
Structural parity is solid: 17/17 totals match. Both pipelines see the same active branches and the same expected test count on every instance — so the verifier wiring (Dockerfile, 4/17 also agree on numerator (passed counts equal):
The 13 numerator-divergences are real and consistent. BF tends to over-report passes (median delta +178). Root cause: BF reuses one sandbox container across all branches and only restores Fix path is per-branch isolation — either kill stray processes + clear The sweep is resumable; full 200-instance run is still ~50 hr of sequential wall time and out of scope for one sandbox session. |
Per review: "onramp" was a made-up name that didn't fit the repo. The adapter now lives where the other external-benchmark integrations live — `benchmarks/programbench/` alongside `benchmarks/followup-bench/`, `benchmarks/models-as-skills/`, and the `run_skillsbench.py` / `run_tb2.py` runners. Layout now matches benchflow conventions: - `benchmarks/programbench/` — adapter package - `benchmarks/run_programbench.py` — runner (mirrors run_skillsbench.py) - `benchmarks/run_programbench.yaml` — Job config - `tests/benchmarks/` — moved from `tests/onramp/` `benchmarks/run_programbench.py` calls a new `ensure_programbench_tasks()` that materializes `.ref/programbench-bf/` on first run via `benchmarks.programbench.main`, mirroring `ensure_tasks()` in `task_download.py`. ProgramBench tasks aren't a separate repo of pre-built BenchFlow tasks — they're generated from upstream metadata — so the runner shells out to the converter rather than git-cloning. The top-level `onramp/` directory is gone. All imports updated: - `from onramp.programbench.benchflow import ...` → `from benchmarks.programbench.benchflow import ...` - `python -m onramp.programbench.main` → `python -m benchmarks.programbench.main` 820 tests still pass, ruff + ty clean. Smoke check: `bench tasks check` passes on regenerated task directories.
Summary
onramp/— new home for adapters that import external benchmarks into BenchFlow's task format. Each adapter emitstask.toml+instruction.md+environment/Dockerfile+tests/test.shdirectories that the existing runner consumes unchanged.onramp/programbench/— first adapter, covers all 200 ProgramBench instances. Generates one BenchFlow task per upstream instance, building on the upstreamprogrambench/<id>:task_cleanroomimages and pulling per-branch test blobs from theprogrambench/ProgramBench-TestsHuggingFace dataset at verify time.programbench evalproduce identicalpassed/totalfor both correct (3/6 = 0.5) and incorrect (0/6 = 0.0) submissions.What's in the diff
onramp/__init__.py,onramp/README.mdonramp/programbench/adapter.pytask.yaml+tests.json, emits per-instance task dirsonramp/programbench/main.pypython -m onramp.programbench.main --output-dir ...onramp/programbench/parity.pyparity_experiment.jsononramp/programbench/templates/{task.toml,instruction.md,Dockerfile}.tmplonramp/programbench/templates/test.shBF_INSTANCE_IDfrom Dockerfile ENV,tests.jsonsidecar from/tests/onramp/programbench/run_programbench.yamlbenchflow.job.Jobconfig for the full converted datasetonramp/programbench/parity_experiment.jsononramp/programbench/README.mdtests/onramp/test_programbench_adapter.pybench tasks checkshapeCHANGELOG.md,pyproject.tomltests/onramp/*.pyVerifier flow (mirrors
programbench eval)/workspace(binary + docs from cleanroom image)./app/into/workspace/.compile.sh→/workspace/executable./workspace, restore executable (re-checking the hash), unpack the branch tarball, runeval/run.sh, parseeval/results.xml.reward = passed / (junit_cases + missing_from_junit)withignored: truebranches/tests excluded — matchesprogrambench info's headline score.Test plan
uv run python -m pytest tests/— 820 passed (was 808 + 18 new), 1 skipped, 1 deselected.uv run ruff check .— clean.uv run ty check src/— clean.uv run python -m onramp.programbench.main --output-dir <out>against a clone offacebookresearch/ProgramBenchgenerates 201 task directories (200 instances + 1 fixture).bench tasks check <task>passes for all 201 generated tasks.eval.jsonexactly.abishekvashok__cmatrix.5c082c6): BenchFlow verifier resolves all 14 active branches × 508 expected tests against the real cleanroom image; HF blob fetch + per-branch tarball extract + pytest invocation all succeed end-to-end.Out of scope for this PR
The full 200-task Gemini-driven parity sweep — multi-GB image pulls × 200, ~8 GB of HF blobs, hours of agent wall time per task. The README documents the path:
uv run python -m benchflow.job onramp/programbench/run_programbench.yaml \ --override environment=daytona # Then fan submissions through `programbench eval` and compare with parity.py.🤖 Generated with Claude Code