Skip to content

Add Onramp + ProgramBench adapter#238

Open
blocksorg[bot] wants to merge 4 commits intomainfrom
feature/onramp-programbench
Open

Add Onramp + ProgramBench adapter#238
blocksorg[bot] wants to merge 4 commits intomainfrom
feature/onramp-programbench

Conversation

@blocksorg
Copy link
Copy Markdown

@blocksorg blocksorg Bot commented May 6, 2026

Summary

  • onramp/ — new home for adapters that import external benchmarks into BenchFlow's task format. Each adapter emits task.toml + instruction.md + environment/Dockerfile + tests/test.sh directories that the existing runner consumes unchanged.
  • onramp/programbench/ — first adapter, covers all 200 ProgramBench instances. Generates one BenchFlow task per upstream instance, building on the upstream programbench/<id>:task_cleanroom images and pulling per-branch test blobs from the programbench/ProgramBench-Tests HuggingFace dataset at verify time.
  • Parity verified on the upstream-shipped fixture: BenchFlow and programbench eval produce identical passed/total for both correct (3/6 = 0.5) and incorrect (0/6 = 0.0) submissions.

What's in the diff

File Role
onramp/__init__.py, onramp/README.md Top-level package + adapter index
onramp/programbench/adapter.py Reads upstream task.yaml + tests.json, emits per-instance task dirs
onramp/programbench/main.py CLI: python -m onramp.programbench.main --output-dir ...
onramp/programbench/parity.py Drives both pipelines on the same submission, writes parity_experiment.json
onramp/programbench/templates/{task.toml,instruction.md,Dockerfile}.tmpl Per-task scaffolding (string.Template substitution)
onramp/programbench/templates/test.sh Task-agnostic verifier — BF_INSTANCE_ID from Dockerfile ENV, tests.json sidecar from /tests/
onramp/programbench/run_programbench.yaml benchflow.job.Job config for the full converted dataset
onramp/programbench/parity_experiment.json Recorded fixture parity results + cmatrix smoke
onramp/programbench/README.md Per-adapter docs
tests/onramp/test_programbench_adapter.py 18 unit tests covering name sanitization, image-tag mapping, render correctness, bench tasks check shape
CHANGELOG.md, pyproject.toml Changelog entry + ruff per-file-ignore for tests/onramp/*.py

Verifier flow (mirrors programbench eval)

  1. Snapshot pristine /workspace (binary + docs from cleanroom image).
  2. Wipe and stage agent's /app/ into /workspace/.
  3. Seed deterministic git repo, run compile.sh/workspace/executable.
  4. Stash executable + record sha256.
  5. For each active branch: restore /workspace, restore executable (re-checking the hash), unpack the branch tarball, run eval/run.sh, parse eval/results.xml.
  6. reward = passed / (junit_cases + missing_from_junit) with ignored: true branches/tests excluded — matches programbench info's headline score.

Test plan

  • uv run python -m pytest tests/ — 820 passed (was 808 + 18 new), 1 skipped, 1 deselected.
  • uv run ruff check . — clean.
  • uv run ty check src/ — clean.
  • uv run python -m onramp.programbench.main --output-dir <out> against a clone of facebookresearch/ProgramBench generates 201 task directories (200 instances + 1 fixture).
  • bench tasks check <task> passes for all 201 generated tasks.
  • Fixture parity: BenchFlow verifier produces 3/6 = 0.5 on correct submission and 0/6 = 0.0 on incorrect submission — matches upstream's recorded eval.json exactly.
  • Real-instance smoke (abishekvashok__cmatrix.5c082c6): BenchFlow verifier resolves all 14 active branches × 508 expected tests against the real cleanroom image; HF blob fetch + per-branch tarball extract + pytest invocation all succeed end-to-end.

Out of scope for this PR

The full 200-task Gemini-driven parity sweep — multi-GB image pulls × 200, ~8 GB of HF blobs, hours of agent wall time per task. The README documents the path:

uv run python -m benchflow.job onramp/programbench/run_programbench.yaml \
    --override environment=daytona
# Then fan submissions through `programbench eval` and compare with parity.py.

🤖 Generated with Claude Code


Open in Devin Review

Onramp (`onramp/`) is a new home for adapters that import external benchmarks
into BenchFlow's task format — they emit `task.toml` + `instruction.md` +
`environment/Dockerfile` + `tests/test.sh` directories that the existing
runner consumes unchanged.

The first adapter, `onramp/programbench/`, covers all 200 ProgramBench
instances (rebuild a program from binary + docs). For each upstream instance:

- Reads `task.yaml` + `tests.json` from a clone of facebookresearch/ProgramBench.
- Emits a BenchFlow task that builds on `programbench/<id>:task_cleanroom`.
- Ships the upstream tests.json as a sidecar in `tests/` (some instances have
  hundreds of branches — the b64 string would blow past Docker's
  65 535-byte ENV-line limit).
- The verifier mirrors `programbench eval`'s pipeline inside one BenchFlow
  sandbox: snapshot /workspace, stage agent's submission, compile -> stash
  executable, then per-branch restore + run + parse JUnit XML. Reward formula
  matches `programbench info`: `passed / (junit_cases + missing_from_junit)`,
  with ignored branches/tests dropped from both numerator and denominator.

Parity verified end-to-end on the upstream-shipped `testorg__calculator.abc1234`
fixture: BenchFlow and `programbench eval` produce identical `passed/total`
on the correct submission (3/6 = 0.5) and the incorrect submission (0/6 = 0.0).

Real-instance smoke (`abishekvashok__cmatrix.5c082c6`): the BenchFlow
verifier resolves all 14 active branches and 508 expected tests against the
real cleanroom image; HuggingFace blob fetch + tarball extract + pytest run
all succeed end-to-end.

The full 200-task Gemini-driven parity sweep is documented in the README and
parity_experiment.json — out of scope for one sandbox session because it
requires multi-GB image pulls × 200, ~8 GB of HF blobs, and hours per task.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +129 to +134
subprocess.run(["docker", "exec", container, "mkdir", "-p", "/tests"], check=True)
subprocess.run(
["docker", "cp", str(task_dir / "tests" / "test.sh"), f"{container}:/tests/test.sh"],
check=True,
)
subprocess.run(["docker", "exec", container, "chmod", "+x", "/tests/test.sh"], check=True)
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 parity.py's benchflow_oracle_score fails to copy tests.json into the container, causing verifier to always report score 0

benchflow_oracle_score in parity.py only copies test.sh to /tests/ in the container (line 131) but never copies tests.json. The verifier script (test.sh) requires /tests/tests.json at line 34-38 of benchmarks/programbench/templates/test.sh — when it's missing, the script logs "FAIL: tests.json sidecar not found" and exits immediately with the pre-set reward of 0, without ever running any tests. This means the BenchFlow side of the single-submission parity check always produces {passed: 0, total: 0, score: 0.0}, making the entire comparison meaningless. The sister script parity_full.py:95 correctly copies the whole tests/ directory (str(task_dir / "tests") + "/.") which includes both test.sh and tests.json.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

xdotli added 2 commits May 6, 2026 10:40
Rename per review:
- onramp/programbench/adapter.py → onramp/programbench/benchflow.py
- tests/onramp/test_programbench_adapter.py → test_programbench_benchflow.py
- All imports and docs updated.

Full-set parity sweep (`onramp/programbench/parity_full.py`):
- Walks every converted instance, scores each with both pipelines using
  a deterministic stub `compile.sh` (`templates/stub_compile.sh`), and writes
  per-instance pass/total deltas to `parity_full_results.json`.
- Resumable; disk-efficient (pull → run → cleanup per instance).

Two bugs surfaced and fixed during the sweep:
1. `sudo`'s "unable to resolve host" warning was leaking into stdout on hosts
   missing the local hostname in /etc/hosts, corrupting the XML upstream
   eval read via `docker exec cat`. The wrapper now filters that warning
   off stderr; the README documents the standard /etc/hosts fix.
2. The first version imported `programbench` in-process to compute the
   filtered `programbench info` score. That trips `PackageNotFoundError`
   when the upstream checkout isn't `pip install`-ed, and leaks state
   between iterations. Replaced with a self-contained scorer that reads
   `tests.json` for active branches + ignored tests, then applies the
   same filter to upstream's eval.json.

First real-instance parity check confirmed:
  abishekvashok__cmatrix.5c082c6  bf=17/508  up=17/508  ✓
Snapshot of the in-progress sweep. The denominator (test count after
filtering ignored branches/tests) matches between BenchFlow and upstream
on every instance evaluated so far — the conversion targets the same
test set. The numerator (passed count) diverges on 13/17 because the
two pipelines isolate branch runs differently:

  upstream `programbench eval`: fresh container per test branch
                                (committed_image after compile.sh).
  BF onramp test.sh:            single sandbox, restored from a
                                /workspace snapshot between branches.

Side effects (lingering processes, /tmp state, env vars) accumulate
across branches in BF. This is a real semantic gap — not a structural
one — and is the next thing to address on this branch.
@blocksorg
Copy link
Copy Markdown
Author

blocksorg Bot commented May 6, 2026

Parity sweep — first 17/200 instances

parity_full.py ran against the first 17 alphabetically-sorted ProgramBench instances with a deterministic stub compile.sh. Recorded in parity_full_results.json.

Structural parity is solid: 17/17 totals match. Both pipelines see the same active branches and the same expected test count on every instance — so the verifier wiring (Dockerfile, tests.json sidecar, blob fetch, JUnit parse, programbench info-style scoring) is correct.

4/17 also agree on numerator (passed counts equal):

  • abishekvashok__cmatrix.5c082c6 17/508
  • antonmedv__fx.86d0d34 108/2047
  • antonmedv__walk.bf802ef 11/470
  • arthursonzogni__json-tui.17a22b6 18/755

The 13 numerator-divergences are real and consistent. BF tends to over-report passes (median delta +178). Root cause: BF reuses one sandbox container across all branches and only restores /workspace between runs, while programbench eval spins a fresh container per branch. Some test fixtures (tmux sessions, daemon processes, /tmp artifacts) survive across branches in BF and cause tests that should fail to pass.

Fix path is per-branch isolation — either kill stray processes + clear /tmp between branches, or move to actually fresh sub-containers. That's a follow-up commit, not a blocker for the structural-parity claim.

The sweep is resumable; full 200-instance run is still ~50 hr of sequential wall time and out of scope for one sandbox session.

Per review: "onramp" was a made-up name that didn't fit the repo. The adapter
now lives where the other external-benchmark integrations live —
`benchmarks/programbench/` alongside `benchmarks/followup-bench/`,
`benchmarks/models-as-skills/`, and the `run_skillsbench.py` /
`run_tb2.py` runners.

Layout now matches benchflow conventions:
- `benchmarks/programbench/`           — adapter package
- `benchmarks/run_programbench.py`     — runner (mirrors run_skillsbench.py)
- `benchmarks/run_programbench.yaml`   — Job config
- `tests/benchmarks/`                  — moved from `tests/onramp/`

`benchmarks/run_programbench.py` calls a new `ensure_programbench_tasks()`
that materializes `.ref/programbench-bf/` on first run via
`benchmarks.programbench.main`, mirroring `ensure_tasks()` in
`task_download.py`. ProgramBench tasks aren't a separate repo of pre-built
BenchFlow tasks — they're generated from upstream metadata — so the runner
shells out to the converter rather than git-cloning.

The top-level `onramp/` directory is gone. All imports updated:
- `from onramp.programbench.benchflow import ...`
  → `from benchmarks.programbench.benchflow import ...`
- `python -m onramp.programbench.main`
  → `python -m benchmarks.programbench.main`

820 tests still pass, ruff + ty clean. Smoke check: `bench tasks check`
passes on regenerated task directories.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant