Add LAB → BenchFlow adapter (benchmarks/lab/) by blocksorg[bot] · Pull Request #240 · benchflow-ai/benchflow

blocksorg · 2026-05-06T19:43:00Z

Summary

Adds a benchmark adapter under benchmarks/lab/ that translates Harvey AI's
Legal Agent Bench (LAB) — 1,251
realistic legal-work tasks with binary-document workspaces and all-pass
LLM-judged rubrics — into BenchFlow's task format, complete with a
self-contained Gemini-judged verifier and a harbor-style parity validation.

Translator (benchflow.py + adapter/translate.py) emits per-task
task.toml + instruction.md + environment/Dockerfile (with pandoc /
pdfplumber / pandas / markitdown / python-pptx) + tests/test.sh +
tests/rubric_judge.py (Gemini-judged, all-pass scoring matching LAB's
evaluation/scoring.py) + tests/criteria.json (the rubric, scoped out
of /app/) + an empty solution/solve.sh stub.
Parity (scripts/run_parity.py, scripts/parity_subset.txt,
parity_experiment.json): single-shot Gemini parity arm validates
translation fidelity end-to-end without needing podman/Docker on the
parity host. Sanity-arm results on 8 tasks / 520 criteria:

Metric Value

All-pass reward agreement 8/8 tasks

Per-criterion verdict agreement 510/520 (98.1%)

Range overlap (harbor matching criterion) ✓

Wall clock 191 s

The 10 disagreements are Gemini temperature-0 sampling noise on
borderline criteria — concentrated as 1–5 flips on 3 tasks; 5/8 tasks
agree perfectly.
Tests (tests/test_lab_adapter.py): 13 unit tests on id
sanitisation, task discovery (flat + scenario-nested), generated
layout, document copy, rubric copy, instruction preamble, embedded
judge syntax, idempotency / --force.
CHANGELOG entry under `[Unreleased]`.

The adapter does not mention Harbor, per request. Naming uses LAB's own
moniker (benchmarks/lab/) — consistent with benchmarks/skillsbench,
benchmarks/tb2-… etc.

What's NOT in this PR (next steps)

The harbor recipe's full agentic arm (steps 2–3: one full run + three
runs both sides with bench run and harness.run) is wired but
unrun — running it needs a Docker-permitted host and a budget of
$200–$1,500 in Gemini calls plus 25–75 hours of wall clock. The script
is parameterised; flipping --runs 1 → --runs 3 and pointing at
bench run is the only change needed. Tracking in
parity_experiment.json[next_steps].

Test plan

uv run ruff check . clean (verified locally)
uv run ty check src/ clean (verified locally)
uv run pytest tests/test_lab_adapter.py 13/13 passing (verified locally)
python benchmarks/lab/benchflow.py translate --output-dir /tmp/lab-tasks --limit 3 produces 3 tasks that bench tasks check accepts (verified locally)
One-shot parity arm finishes in <5 min on the 8-task subset (verified locally — 191 s)
Re-run with full Daytona/Modal docker backend on lab.yaml once infra is available (followup)

🤖 Generated with Claude Code

Translates harveyai/harvey-labs (1,251 legal-work tasks, all-pass rubric) into BenchFlow's task format and ships the verifier that runs the rubric judge inside the BenchFlow sandbox. - benchflow.py CLI: list / translate / check sub-commands. - adapter/translate.py: per-task task.toml + instruction.md + environment/Dockerfile + tests/{test.sh, rubric_judge.py, criteria.json, task_desc.txt} + solution/solve.sh. - rubric_judge.py: Gemini-judged, all-pass scoring, mirrors LAB's evaluation/scoring.py semantics; runs concurrently via LAB_JUDGE_PARALLEL. - lab.yaml: BenchFlow run config pinned to gemini-3.1-flash-lite-preview. - scripts/parity_subset.txt + scripts/run_parity.py: 8-task harbor- style sanity arm. One-shot Gemini agent + dual scoring (LAB-native rubric loaded from task.json vs the embedded BenchFlow verifier). - parity_experiment.json: 8/8 all-pass agreement, 510/520 (98.1%) per-criterion verdict agreement at gemini-3.1-flash-lite-preview; remaining flips are temperature-0 judge non-determinism. - tests/test_lab_adapter.py: 13 unit tests covering id sanitisation, discovery, layout generation, document copy, rubric copy, instruction preamble, embedded-judge syntax, idempotency. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 4 potential issues.

View 4 additional findings in Devin Review.

devin-ai-integration · 2026-05-06T19:49:39Z

+_JUDGE_SYSTEM = (
+    "You are an evaluator grading legal-work agent output against a "
+    "single pass/fail rubric criterion. Reply strictly with a JSON "
+    'object: {"verdict": "pass"|"fail", "reasoning": "..."}.'
+)
+
+_JUDGE_USER = """\
+TASK: {task_desc}
+
+CRITERION TITLE: {criterion_title}
+
+PASS/FAIL CRITERIA:
+{match_criteria}
+
+AGENT OUTPUT:
+{agent_output}
+
+Decide pass or fail for this single criterion only. JSON only.
+"""


🔴 Judge prompts differ between LAB-native and BenchFlow scoring arms despite being claimed identical

The system and user prompts used by the LLM judge differ between the two parity arms, undermining the parity validation. The _JUDGE_SYSTEM in the embedded rubric_judge.py (via _RUBRIC_JUDGE in translate.py:373-377) evaluates to "...Read the criterion carefully and answer strictly...", while run_parity.py:224-228 uses "...Reply strictly...". Similarly, the _JUDGE_USER tail differs: "Respond with JSON only." (translate.py:390) vs "JSON only." (run_parity.py:241). Both the README (benchmarks/lab/README.md:84-85: "prompt template identical across the two scoring paths") and parity_experiment.json:30 ("verbatim copies") explicitly claim the prompts are identical. Since the only variable the parity experiment aims to measure is framework wiring, differing judge prompts introduce an uncontrolled confound — the 10 verdict disagreements reported in the parity experiment may partly stem from prompt differences rather than Gemini non-determinism.

Suggested change

_JUDGE_SYSTEM = (

"You are an evaluator grading legal-work agent output against a "

"single pass/fail rubric criterion. Reply strictly with a JSON "

'object: {"verdict": "pass"|"fail", "reasoning": "..."}.'

)

_JUDGE_USER = """\

TASK: {task_desc}

CRITERION TITLE: {criterion_title}

PASS/FAIL CRITERIA:

{match_criteria}

AGENT OUTPUT:

{agent_output}

Decide pass or fail for this single criterion only. JSON only.

"""

_JUDGE_SYSTEM = (

"You are an evaluator grading legal-work agent output against a single "

"pass/fail rubric criterion. Read the criterion carefully and answer "

'strictly with a JSON object: {"verdict": "pass"|"fail", "reasoning": "..."}.',

)

_JUDGE_USER = """\

TASK: {task_desc}

CRITERION TITLE: {criterion_title}

PASS/FAIL CRITERIA:

{match_criteria}

AGENT OUTPUT:

{agent_output}

Decide pass or fail for this single criterion only. Respond with JSON only.\

"""

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-06T19:49:41Z

+    fallback_sections = None  # cache full-output once
+
+    def _grade(c: dict) -> dict:
+        nonlocal fallback_sections
+        sections = []
+        for name in c.get("deliverables", []):
+            path = resolved.get(name)
+            if path is None or not path.exists():
+                sections.append(f"## Deliverable: {name}\\n(File not found)")
+                continue
+            sections.append(
+                f"## Deliverable: {name}\\n{_read_file_as_text(path)}"
+            )
+        if not sections:
+            if fallback_sections is None:
+                fallback_sections = []
+                for p in _list_outputs(args.output_dir):
+                    fallback_sections.append(
+                        f"## File: {p.name}\\n{_read_file_as_text(p)}"
+                    )
+            sections = list(fallback_sections)


🔴 Race condition on fallback_sections shared across threads in rubric judge

In the embedded _RUBRIC_JUDGE (translate.py:454-474), the fallback_sections variable is shared across threads via nonlocal and used inside a ThreadPoolExecutor (default 8 workers). When multiple criteria lack a deliverables field and hit the fallback path concurrently: Thread A sets fallback_sections = [] and begins appending; Thread B sees fallback_sections is not None (it's []), copies it via list(fallback_sections), gets an empty list, and proceeds to judge the criterion with "(no output)" as the agent text. This causes incorrect fail verdicts for those criteria. Since this is the production verifier code (written into every generated task's tests/rubric_judge.py), it affects real benchmark scoring.

Prompt for agents

The fallback_sections variable in _RUBRIC_JUDGE's _grade closure is shared across threads without synchronization. The fix needs to happen inside the _RUBRIC_JUDGE string constant (translate.py:232-509), which gets written to every generated task's tests/rubric_judge.py. The simplest fix is to compute the fallback_sections list BEFORE the ThreadPoolExecutor, outside the _grade function. Move the fallback computation to after the _resolve_deliverables call and before the ThreadPoolExecutor block, then reference it as a read-only variable in _grade. This eliminates the race entirely since all threads would only read a pre-built list. Be careful with the escaping since this code is inside a triple-single-quoted string.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-06T19:49:43Z

+            (run_dir / "tasks.jsonl").open("a").write(
+                json.dumps(res.to_dict()) + "\n"
+            )


🟡 File handle leak: open() without close() in parity runner loop

At run_parity.py:569, (run_dir / "tasks.jsonl").open("a").write(...) creates a file object that is never assigned to a variable and never explicitly closed. This executes once per task per run (e.g. 8× per run). While CPython's reference-counting GC will close it promptly, other Python implementations (PyPy) may not, leading to accumulated open file descriptors.

Suggested change

(run_dir / "tasks.jsonl").open("a").write(

json.dumps(res.to_dict()) + "\n"

)

with (run_dir / "tasks.jsonl").open("a") as _f:

_f.write(json.dumps(res.to_dict()) + "\n")

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-06T19:49:44Z

+    lab_score: float = math.nan
+    bench_score: float = math.nan


🟡 json.dumps(math.nan) produces invalid JSON in JSONL output

TaskResult defaults lab_score and bench_score to math.nan (run_parity.py:367-368). When a task errors early, these NaN defaults are serialized via json.dumps(res.to_dict()) at line 570. Python's json.dumps outputs the JavaScript literal NaN, which is not valid JSON per RFC 8259. Any downstream tool (including Python's own json.loads with default settings) will fail to parse these lines.

Suggested change

lab_score: float = math.nan

bench_score: float = math.nan

lab_score: float | None = None

bench_score: float | None = None

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration Bot reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LAB → BenchFlow adapter (benchmarks/lab/)#240

Add LAB → BenchFlow adapter (benchmarks/lab/)#240
blocksorg[bot] wants to merge 1 commit intomainfrom
feature/lab-adapter

blocksorg Bot commented May 6, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 6, 2026

Uh oh!

devin-ai-integration Bot May 6, 2026

Uh oh!

devin-ai-integration Bot May 6, 2026

Uh oh!

devin-ai-integration Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Metric	Value
All-pass reward agreement	8/8 tasks
Per-criterion verdict agreement	510/520 (98.1%)
Range overlap (harbor matching criterion)	✓
Wall clock	191 s

Conversation

blocksorg Bot commented May 6, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's NOT in this PR (next steps)

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

blocksorg Bot commented May 6, 2026 •

edited by devin-ai-integration Bot

Loading