Skip to content

Add LAB → BenchFlow adapter (benchmarks/lab/)#240

Open
blocksorg[bot] wants to merge 1 commit intomainfrom
feature/lab-adapter
Open

Add LAB → BenchFlow adapter (benchmarks/lab/)#240
blocksorg[bot] wants to merge 1 commit intomainfrom
feature/lab-adapter

Conversation

@blocksorg
Copy link
Copy Markdown

@blocksorg blocksorg Bot commented May 6, 2026

Summary

Adds a benchmark adapter under benchmarks/lab/ that translates Harvey AI's
Legal Agent Bench (LAB) — 1,251
realistic legal-work tasks with binary-document workspaces and all-pass
LLM-judged rubrics — into BenchFlow's task format, complete with a
self-contained Gemini-judged verifier and a harbor-style parity validation.

  • Translator (benchflow.py + adapter/translate.py) emits per-task
    task.toml + instruction.md + environment/Dockerfile (with pandoc /
    pdfplumber / pandas / markitdown / python-pptx) + tests/test.sh +
    tests/rubric_judge.py (Gemini-judged, all-pass scoring matching LAB's
    evaluation/scoring.py) + tests/criteria.json (the rubric, scoped out
    of /app/) + an empty solution/solve.sh stub.

  • Parity (scripts/run_parity.py, scripts/parity_subset.txt,
    parity_experiment.json): single-shot Gemini parity arm validates
    translation fidelity end-to-end without needing podman/Docker on the
    parity host. Sanity-arm results on 8 tasks / 520 criteria:

    Metric Value
    All-pass reward agreement 8/8 tasks
    Per-criterion verdict agreement 510/520 (98.1%)
    Range overlap (harbor matching criterion)
    Wall clock 191 s

    The 10 disagreements are Gemini temperature-0 sampling noise on
    borderline criteria — concentrated as 1–5 flips on 3 tasks; 5/8 tasks
    agree perfectly.

  • Tests (tests/test_lab_adapter.py): 13 unit tests on id
    sanitisation, task discovery (flat + scenario-nested), generated
    layout, document copy, rubric copy, instruction preamble, embedded
    judge syntax, idempotency / --force.

  • CHANGELOG entry under `[Unreleased]`.

The adapter does not mention Harbor, per request. Naming uses LAB's own
moniker (benchmarks/lab/) — consistent with benchmarks/skillsbench,
benchmarks/tb2-… etc.

What's NOT in this PR (next steps)

The harbor recipe's full agentic arm (steps 2–3: one full run + three
runs both sides with bench run and harness.run) is wired but
unrun
— running it needs a Docker-permitted host and a budget of
$200–$1,500 in Gemini calls plus 25–75 hours of wall clock. The script
is parameterised; flipping --runs 1--runs 3 and pointing at
bench run is the only change needed. Tracking in
parity_experiment.json[next_steps].

Test plan

  • uv run ruff check . clean (verified locally)
  • uv run ty check src/ clean (verified locally)
  • uv run pytest tests/test_lab_adapter.py 13/13 passing (verified locally)
  • python benchmarks/lab/benchflow.py translate --output-dir /tmp/lab-tasks --limit 3 produces 3 tasks that bench tasks check accepts (verified locally)
  • One-shot parity arm finishes in <5 min on the 8-task subset (verified locally — 191 s)
  • Re-run with full Daytona/Modal docker backend on lab.yaml once infra is available (followup)

🤖 Generated with Claude Code


Open in Devin Review

Translates harveyai/harvey-labs (1,251 legal-work tasks, all-pass
rubric) into BenchFlow's task format and ships the verifier that runs
the rubric judge inside the BenchFlow sandbox.

  - benchflow.py CLI: list / translate / check sub-commands.
  - adapter/translate.py: per-task task.toml + instruction.md +
    environment/Dockerfile + tests/{test.sh, rubric_judge.py,
    criteria.json, task_desc.txt} + solution/solve.sh.
  - rubric_judge.py: Gemini-judged, all-pass scoring, mirrors LAB's
    evaluation/scoring.py semantics; runs concurrently via
    LAB_JUDGE_PARALLEL.
  - lab.yaml: BenchFlow run config pinned to gemini-3.1-flash-lite-preview.
  - scripts/parity_subset.txt + scripts/run_parity.py: 8-task harbor-
    style sanity arm. One-shot Gemini agent + dual scoring (LAB-native
    rubric loaded from task.json vs the embedded BenchFlow verifier).
  - parity_experiment.json: 8/8 all-pass agreement, 510/520 (98.1%)
    per-criterion verdict agreement at gemini-3.1-flash-lite-preview;
    remaining flips are temperature-0 judge non-determinism.
  - tests/test_lab_adapter.py: 13 unit tests covering id sanitisation,
    discovery, layout generation, document copy, rubric copy,
    instruction preamble, embedded-judge syntax, idempotency.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 4 potential issues.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment on lines +224 to +242
_JUDGE_SYSTEM = (
"You are an evaluator grading legal-work agent output against a "
"single pass/fail rubric criterion. Reply strictly with a JSON "
'object: {"verdict": "pass"|"fail", "reasoning": "..."}.'
)

_JUDGE_USER = """\
TASK: {task_desc}

CRITERION TITLE: {criterion_title}

PASS/FAIL CRITERIA:
{match_criteria}

AGENT OUTPUT:
{agent_output}

Decide pass or fail for this single criterion only. JSON only.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Judge prompts differ between LAB-native and BenchFlow scoring arms despite being claimed identical

The system and user prompts used by the LLM judge differ between the two parity arms, undermining the parity validation. The _JUDGE_SYSTEM in the embedded rubric_judge.py (via _RUBRIC_JUDGE in translate.py:373-377) evaluates to "...Read the criterion carefully and answer strictly...", while run_parity.py:224-228 uses "...Reply strictly...". Similarly, the _JUDGE_USER tail differs: "Respond with JSON only." (translate.py:390) vs "JSON only." (run_parity.py:241). Both the README (benchmarks/lab/README.md:84-85: "prompt template identical across the two scoring paths") and parity_experiment.json:30 ("verbatim copies") explicitly claim the prompts are identical. Since the only variable the parity experiment aims to measure is framework wiring, differing judge prompts introduce an uncontrolled confound — the 10 verdict disagreements reported in the parity experiment may partly stem from prompt differences rather than Gemini non-determinism.

Suggested change
_JUDGE_SYSTEM = (
"You are an evaluator grading legal-work agent output against a "
"single pass/fail rubric criterion. Reply strictly with a JSON "
'object: {"verdict": "pass"|"fail", "reasoning": "..."}.'
)
_JUDGE_USER = """\
TASK: {task_desc}
CRITERION TITLE: {criterion_title}
PASS/FAIL CRITERIA:
{match_criteria}
AGENT OUTPUT:
{agent_output}
Decide pass or fail for this single criterion only. JSON only.
"""
_JUDGE_SYSTEM = (
"You are an evaluator grading legal-work agent output against a single "
"pass/fail rubric criterion. Read the criterion carefully and answer "
'strictly with a JSON object: {"verdict": "pass"|"fail", "reasoning": "..."}.',
)
_JUDGE_USER = """\
TASK: {task_desc}
CRITERION TITLE: {criterion_title}
PASS/FAIL CRITERIA:
{match_criteria}
AGENT OUTPUT:
{agent_output}
Decide pass or fail for this single criterion only. Respond with JSON only.\
"""
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +454 to +474
fallback_sections = None # cache full-output once

def _grade(c: dict) -> dict:
nonlocal fallback_sections
sections = []
for name in c.get("deliverables", []):
path = resolved.get(name)
if path is None or not path.exists():
sections.append(f"## Deliverable: {name}\\n(File not found)")
continue
sections.append(
f"## Deliverable: {name}\\n{_read_file_as_text(path)}"
)
if not sections:
if fallback_sections is None:
fallback_sections = []
for p in _list_outputs(args.output_dir):
fallback_sections.append(
f"## File: {p.name}\\n{_read_file_as_text(p)}"
)
sections = list(fallback_sections)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Race condition on fallback_sections shared across threads in rubric judge

In the embedded _RUBRIC_JUDGE (translate.py:454-474), the fallback_sections variable is shared across threads via nonlocal and used inside a ThreadPoolExecutor (default 8 workers). When multiple criteria lack a deliverables field and hit the fallback path concurrently: Thread A sets fallback_sections = [] and begins appending; Thread B sees fallback_sections is not None (it's []), copies it via list(fallback_sections), gets an empty list, and proceeds to judge the criterion with "(no output)" as the agent text. This causes incorrect fail verdicts for those criteria. Since this is the production verifier code (written into every generated task's tests/rubric_judge.py), it affects real benchmark scoring.

Prompt for agents
The fallback_sections variable in _RUBRIC_JUDGE's _grade closure is shared across threads without synchronization. The fix needs to happen inside the _RUBRIC_JUDGE string constant (translate.py:232-509), which gets written to every generated task's tests/rubric_judge.py. The simplest fix is to compute the fallback_sections list BEFORE the ThreadPoolExecutor, outside the _grade function. Move the fallback computation to after the _resolve_deliverables call and before the ThreadPoolExecutor block, then reference it as a read-only variable in _grade. This eliminates the race entirely since all threads would only read a pre-built list. Be careful with the escaping since this code is inside a triple-single-quoted string.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +569 to +571
(run_dir / "tasks.jsonl").open("a").write(
json.dumps(res.to_dict()) + "\n"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 File handle leak: open() without close() in parity runner loop

At run_parity.py:569, (run_dir / "tasks.jsonl").open("a").write(...) creates a file object that is never assigned to a variable and never explicitly closed. This executes once per task per run (e.g. 8× per run). While CPython's reference-counting GC will close it promptly, other Python implementations (PyPy) may not, leading to accumulated open file descriptors.

Suggested change
(run_dir / "tasks.jsonl").open("a").write(
json.dumps(res.to_dict()) + "\n"
)
with (run_dir / "tasks.jsonl").open("a") as _f:
_f.write(json.dumps(res.to_dict()) + "\n")
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +367 to +368
lab_score: float = math.nan
bench_score: float = math.nan
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 json.dumps(math.nan) produces invalid JSON in JSONL output

TaskResult defaults lab_score and bench_score to math.nan (run_parity.py:367-368). When a task errors early, these NaN defaults are serialized via json.dumps(res.to_dict()) at line 570. Python's json.dumps outputs the JavaScript literal NaN, which is not valid JSON per RFC 8259. Any downstream tool (including Python's own json.loads with default settings) will fail to parse these lines.

Suggested change
lab_score: float = math.nan
bench_score: float = math.nan
lab_score: float | None = None
bench_score: float | None = None
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant