feat: add ProgramBench + Harvey LAB integrations, remove TB2, migrate .ref/ → benchmarks/#237
feat: add ProgramBench + Harvey LAB integrations, remove TB2, migrate .ref/ → benchmarks/#237devin-ai-integration[bot] wants to merge 37 commits intomainfrom
Conversation
Generate BenchFlow task directories from ProgramBench's 200 program- reconstruction instances. Each task gives an agent a compiled binary and its documentation; the agent must re-implement the program from scratch. Files: - benchmarks/programbench/generate.py — reads ProgramBench task.yaml + tests.json, emits task.toml / instruction.md / Dockerfile / test.sh / verify.py per instance - benchmarks/programbench/main.py — CLI entry point for generation - benchmarks/run_programbench.py — Job runner (mirrors run_skillsbench.py) - benchmarks/programbench-gemini-flash-lite.yaml — default config - src/benchflow/task_download.py — extended to support generated benchmarks; clones ProgramBench upstream, runs the generator, caches under .ref/
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
- _ensure_generated now generates into a staging directory and renames atomically on success, preventing partial cache on failure - verify.py wraps tar extraction in try/except so a corrupt archive for one branch doesn't crash the entire verifier - Fix ruff format on task_download.py
Validates the full pipeline end-to-end: Docker image build, Gemini API query, compilation, and verifier execution. Uses a single-shot prompt (not multi-turn agent), so 0% scores are expected on these hard tasks.
ProgramBench cleanroom images don't need 4 CPUs — reducing to 2 makes the benchmark runnable on smaller machines.
…ch/tasks/ Generated tasks now live under benchmarks/ instead of .ref/ per project convention. Added benchmarks/programbench/tasks/ to .gitignore since these are generated at runtime.
Use fallback pattern (try without --break-system-packages first, then with) so the install works on both old and new pip.
Wraps the test run subprocess in try/except so a hanging test branch doesn't crash the verifier and lose results from completed branches.
…nt.json - Add [task] name field to generated task.toml (programbench/<instance_id>) - Add adapter_metadata.json with structured benchmark metadata - Add parity_experiment.json with results from 10 diverse tasks - 8/10 exact test count match, 2 minor variance (<0.5%) - Covers C, Rust, Go, C++, Java across easy/medium difficulties
The oracle checks out the original source code at the specified commit from the upstream repo — this is the gold answer for ProgramBench tasks. Each task now generates a solution/ directory with solve.sh.
Detailed tables covering directory structure, evaluation pipeline, field mappings, and what changes vs stays the same.
Add BenchFlow adapter for Harvey LAB — 1,251 legal tasks across 24 practice areas (M&A, insurance, IP, tax, real estate, etc.). - benchflow.py: Translates Harvey LAB task.json → BenchFlow task format (task.toml, instruction.md, Dockerfile, LLM-as-judge verifier) - evaluate.py: Gemini 3.1 Flash Lite judge grades deliverables against rubric criteria (PASS/FAIL per criterion, partial credit reward) - parity_test.py: Structural + eval parity tests - Structural: 1251/1251 tasks pass (all files, metadata, criteria match) - Eval: 5/5 tasks pass (Gemini judge pipeline works end-to-end) - run_harvey_lab.py + YAML config for running benchmarks - Register harvey-lab in task_download.py for auto-download
The runner now: 1. Downloads raw Harvey LAB data via ensure_tasks() 2. Runs benchflow.py adapter to convert task.json → task.toml format 3. Writes converted tasks to .ref/harvey-lab-benchflow/ 4. YAML config updated to use tasks_dir pointing to converted output
- Fix raw_dir.parent.parent → raw_dir.parent in run_harvey_lab.py (ensure_tasks returns .ref/harvey-lab/tasks, so one parent up is .ref/harvey-lab which is the correct harvey-root) - Replace str.format() with str.replace() in evaluate.py's judge prompt to prevent crashes when agent output or criteria contain curly braces (common in legal documents)
- Replace sequential .replace() chain with string.Template.safe_substitute() in both benchflow.py (generated evaluate.py) and parity_test.py - Prevents agent output containing literal placeholder strings from corrupting later substitutions - Add side-by-side parity test mode (Harbor Step 5): runs original Harvey LAB prompt template vs adapted BenchFlow prompt through the same Gemini judge on identical agent output - Results: 25/25 criteria agree (100% agreement rate) across 5 tasks - Add parity_experiment.json with detailed per-criterion results - Add adapter_metadata.json with benchmark metadata
- Remove all Harbor mentions from parity_test.py - Rewrite README with BenchFlow-native adapter convention table - Add step-by-step parity results table (all 9 steps documented) - Add side-by-side parity breakdown by practice area - Document BenchFlow adapter file structure
…luate.py The parity test's _ADAPTED_PROMPT had 8-space indentation that didn't match the actual generated evaluate.py (which goes through textwrap.dedent). Fixed to use no extra indentation. Re-ran side-by-side parity: still 25/25 (100% agreement).
- Rename adapter_metadata.json → benchmark_metadata.json - Replace 'adapter' with 'converter' in code/docs/comments - Update README title from 'Harvey LAB Adapter' to 'Harvey LAB' - Rename _run_adapter → _run_converter, _ADAPTER → _CONVERTER - Section renamed: 'Adapter Structure' → 'Directory Structure' - Convention renamed: 'BenchFlow Adapter Convention' → 'BenchFlow Benchmark Convention'
- Dockerfile now uses :task (not :task_cleanroom) matching ProgramBench eval's environment, with workspace reset to cleanroom state. - Anti-cheat hash check now runs BEFORE compile (matching ProgramBench eval order), preventing false positives on legitimately rebuilt executables. - Updated README comparison tables to reflect image change.
…ata.json Introduce benchmark.yaml as the standard benchmark descriptor for BenchFlow benchmarks. This replaces benchmark_metadata.json with a structured YAML format covering: - name, description, url, author - tasks (count, categories, tags) - conversion (script, source format, oracle solutions) - verification (method, judge model, reward type) - parity (structural, eval pipeline, side-by-side results) Job configs (how to run) remain in separate YAML files.
Shallow clone with --depth 1 always fetches HEAD, so the fallback block that checks out the specific commit never ran. Now always does full clone followed by git checkout at the task's commit.
| pb_tasks = clone_dir / "src" / "programbench" / "data" / "tasks" | ||
| if not pb_tasks.is_dir(): | ||
| raise FileNotFoundError( | ||
| f"ProgramBench tasks directory not found at {pb_tasks}" | ||
| ) | ||
|
|
||
| import importlib | ||
| import sys | ||
|
|
||
| gen_path = root / "benchmarks" / "programbench" | ||
| if str(gen_path.parent) not in sys.path: | ||
| sys.path.insert(0, str(gen_path.parent)) | ||
| generate = importlib.import_module("programbench.benchflow") |
There was a problem hiding this comment.
🟡 _ensure_generated is hardcoded for ProgramBench, breaking the _GENERATED_BENCHMARKS registry pattern
The _ensure_generated function accepts a benchmark parameter but ignores it for path construction and module import. Line 114 hardcodes clone_dir / "src" / "programbench" / "data" / "tasks" and line 126 hardcodes importlib.import_module("programbench.benchflow"). This means the _GENERATED_BENCHMARKS dict at src/benchflow/task_download.py:26-31 appears extensible but adding a second entry would silently use ProgramBench's directory layout and generator, causing a FileNotFoundError or wrong behavior.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Valid observation. This is intentionally hardcoded for now since programbench is the only generated benchmark. If/when a second generated benchmark is added, these paths should be parameterized via the _GENERATED_BENCHMARKS dict (e.g. adding "tasks_path" and "module" keys). Left as-is to avoid premature abstraction.
There was a problem hiding this comment.
Already acknowledged in the previous review cycle — this is intentionally hardcoded since programbench is the only generated benchmark. The learnings analysis (shared with the user) flagged this as a key divergence: _ensure_generated() should be generalized to read conversion config from benchmark.yaml when a second generator is added.
Agent parity results (same submission through both pipelines): - cmatrix (C, Gemini): PB 289/769 vs BF 289/769 — exact match - zoxide (Rust, oracle): PB 577/577 vs BF 577/577 — exact match - shellharden (Rust, oracle): PB 1291/1292 vs BF 1291/1292 — exact match - ditaa (Java, oracle): PB 6/681 vs BF 6/681 — exact match - chroma (Go, oracle): PB 0/531 vs BF 7/531 — minor variance (1.3%)
Covers TB2, ProgramBench, and generic benchmark workflow: task generation, CLI single/batch runs, YAML config, Python API, oracle verification, and adding new benchmarks. Dogfooded by running each command to verify correctness.
Documents the 9-step process for converting any new benchmark into BenchFlow format. Covers converter, parity testing, metadata, and publishing workflow.
…comparison table
- Convert _ORIGINAL_PROMPT from str.format() to string.Template.safe_substitute()
to prevent crashes when legal text contains { or } characters
- Enrich parity_experiment.json with reproducibility metadata: benchmark_name,
date, PR links, original_benchmark_repo, metrics array
- Add 'Comparison with Original Benchmark' table to README (matching Harbor
adapter README convention)
Port Harvey LAB's native harness (agent loop + 6 tools) as a BenchFlow agent via an ACP shim. The shim: - Speaks ACP on stdio, runs Harvey LAB's agent loop in-process - Uses DirectSandbox (filesystem-backed) instead of Podman, since BenchFlow's Docker container already provides sandboxing - Monkey-patches sandbox module so Harvey LAB's tools.py works unchanged - Emits ACP session/update notifications for full trajectory capture New files: - src/benchflow/agents/harvey_lab_acp_shim.py — ACP shim - benchmarks/harvey-lab/harvey-lab-harness-parity.yaml — parity config Registered as 'harvey-lab-harness' agent (alias: 'harvey-lab') in agents/registry.py. Enables true apples-to-apples parity testing: same agent logic + same model on both original and converted tasks.
The Dockerfile copies documents to /app/documents/, not /app/environment/documents/. Check /app/documents/ first.
- Delete run_tb2.py and all tb2-*.yaml config files - Remove terminal-bench-2 from TASK_REPOS in task_download.py - Change cloned benchmark path from .ref/ to benchmarks/ - Update .gitignore: remove .ref/, add benchmarks/skillsbench/ - Update docs (getting-started, running-benchmarks, README) to use benchmarks/ paths and remove TB2 references - Update Job docstring example to use benchmarks/ paths - Update test_task_download.py assertions for new paths
Harvey LAB adapters read provider-specific env vars directly (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY). auto_inherit_env propagates these into the container.
Fix skillsbench-claude-glm51.yaml pointing to stale .ref/ path. Update all docs, examples, notebooks, skills, and configs to use benchmarks/ paths. Only CHANGELOG.md retains .ref/ as historical.
Ran Harvey LAB's own harness (agent loop + 6 tools + system prompt) via DirectSandbox on 5 tasks in both original and BenchFlow-converted formats. Results (aggregate across 4 evaluated tasks): - Original: 64/261 (24.5%) - BenchFlow: 74/261 (28.4%) - Delta: +3.8% (within expected non-determinism range) Bug fix: harness read tool was failing because parse-doc command (used to parse .docx/.xlsx/.pdf inside sandbox) wasn't available outside the Podman container. DirectSandbox now requires parse-doc in PATH. Also updates parity config model to gemini-3.1-flash-lite-preview.
| def _to_host_path(self, sandbox_path: str) -> Path: | ||
| """Map a /workspace/... path to a real host path.""" | ||
| if sandbox_path.startswith(self.DOCUMENTS_PATH): | ||
| rel = sandbox_path[len(self.DOCUMENTS_PATH) :].lstrip("/") | ||
| return self.documents_dir / rel if rel else self.documents_dir | ||
| if sandbox_path.startswith(self.OUTPUT_PATH): | ||
| rel = sandbox_path[len(self.OUTPUT_PATH) :].lstrip("/") | ||
| return self.output_dir / rel if rel else self.output_dir | ||
| if sandbox_path.startswith(self.WORKSPACE_PATH): | ||
| rel = sandbox_path[len(self.WORKSPACE_PATH) :].lstrip("/") | ||
| return self.workspace_dir / rel if rel else self.workspace_dir | ||
| raise ValueError(f"Path outside sandbox: {sandbox_path}") |
There was a problem hiding this comment.
🟡 DirectSandbox._to_host_path uses startswith without path-boundary check, causing incorrect path routing
The path-routing logic in _to_host_path uses str.startswith() to match sandbox paths against prefixes like /workspace/documents and /workspace/output. Since startswith doesn't enforce a path separator boundary, a path like /workspace/documents_backup/file.txt incorrectly matches the /workspace/documents prefix (because "/workspace/documents_backup".startswith("/workspace/documents") is True). This would route the path to documents_dir / "_backup/file.txt" instead of workspace_dir / "documents_backup/file.txt". While Harvey LAB's standard workspace layout is unlikely to trigger this, the bug could surface if the agent creates files with names that happen to share these prefixes.
| def _to_host_path(self, sandbox_path: str) -> Path: | |
| """Map a /workspace/... path to a real host path.""" | |
| if sandbox_path.startswith(self.DOCUMENTS_PATH): | |
| rel = sandbox_path[len(self.DOCUMENTS_PATH) :].lstrip("/") | |
| return self.documents_dir / rel if rel else self.documents_dir | |
| if sandbox_path.startswith(self.OUTPUT_PATH): | |
| rel = sandbox_path[len(self.OUTPUT_PATH) :].lstrip("/") | |
| return self.output_dir / rel if rel else self.output_dir | |
| if sandbox_path.startswith(self.WORKSPACE_PATH): | |
| rel = sandbox_path[len(self.WORKSPACE_PATH) :].lstrip("/") | |
| return self.workspace_dir / rel if rel else self.workspace_dir | |
| raise ValueError(f"Path outside sandbox: {sandbox_path}") | |
| def _to_host_path(self, sandbox_path: str) -> Path: | |
| """Map a /workspace/... path to a real host path.""" | |
| if sandbox_path == self.DOCUMENTS_PATH or sandbox_path.startswith(self.DOCUMENTS_PATH + "/"): | |
| rel = sandbox_path[len(self.DOCUMENTS_PATH) :].lstrip("/") | |
| return self.documents_dir / rel if rel else self.documents_dir | |
| if sandbox_path == self.OUTPUT_PATH or sandbox_path.startswith(self.OUTPUT_PATH + "/"): | |
| rel = sandbox_path[len(self.OUTPUT_PATH) :].lstrip("/") | |
| return self.output_dir / rel if rel else self.output_dir | |
| if sandbox_path == self.WORKSPACE_PATH or sandbox_path.startswith(self.WORKSPACE_PATH + "/"): | |
| rel = sandbox_path[len(self.WORKSPACE_PATH) :].lstrip("/") | |
| return self.workspace_dir / rel if rel else self.workspace_dir | |
| raise ValueError(f"Path outside sandbox: {sandbox_path}") |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Good catch — the startswith without a path separator boundary is a real bug. A path like /workspace/documents_backup/file.txt would incorrectly route through documents_dir. The suggested fix (adding == path or .startswith(path + "/") checks) is correct. This code came from the Harvey LAB session (#239) — happy to apply the fix if desired.
Summary
Adds ProgramBench (200 program-reconstruction tasks) and Harvey LAB (1,251 legal tasks) as BenchFlow-compatible benchmarks. Removes TB2 and migrates all
.ref/paths tobenchmarks/for the new benchmark registry pattern. Harvey LAB was merged from #239 and harmonized with the.ref/→benchmarks/migration.ProgramBench (
benchmarks/programbench/):benchflow.py— task generator + embedded verifier (compile → anti-cheat → pytest → JUnit XML → reward)main.py— CLI (--output-dir,--limit,--overwrite,--task-ids)Harvey LAB (
benchmarks/harvey-lab/):benchflow.py— convertstask.json→ BenchFlow format (1,251 legal tasks, 24 practice areas)harvey_lab_acp_shim.py— Harvey LAB harness as ACP agent (harvey-lab-harnessin registry)TB2 removal +
.ref/migration: deleted all TB2 files, updated all code/docs/YAMLs/notebooks from.ref/tobenchmarks/.Agent parity — ProgramBench (same submission → both pipelines):
cmatrixzoxideshellhardenditaachromaAgent parity — Harvey LAB (original harness vs BenchFlow):
Review & Testing Checklist for Human
:tasknot:task_cleanroom— root cause of prior compile failures (ncurses.h missing)benchflow.pyVERIFY_PY: hash check (Step 1) before compile (Step 2)harvey_lab_acp_shim.py): reviewDirectSandboxpath mapping + monkey-patchingensure_tasks("skillsbench")— confirm clones tobenchmarks/skillsbench/tasks/(not.ref/)docs/running-benchmarks.mdend-to-end on a fresh checkoutSuggested test plan: Generate 2-3 ProgramBench tasks,
bench tasks checkeach, thenbench run <task> --agent oracle --backend docker. For Harvey LAB, runpython benchmarks/harvey-lab/run_harvey_lab.pywith aGOOGLE_API_KEY.Notes
046003a8). Auto-merge placedharvey-labin_GENERATED_BENCHMARKS; fixed toTASK_REPOS..ref/paths updated tobenchmarks/to match this PR's migration.benchmarks/programbench/tasks/,benchmarks/skillsbench/,benchmarks/harvey-lab-benchflow/).docs/running-benchmarks.mdwas dogfooded: every command tested during this session.CONVERT.md(from Harvey LAB session) provides a 9-step guide for adding new benchmarks.Link to Devin session: https://app.devin.ai/sessions/f3761955c99449d7a3e3c2380ed664da
Requested by: @xdotli