diff --git a/content/docs/datasets/adapters-ai.mdx b/content/docs/datasets/adapters-ai.mdx new file mode 100644 index 0000000..6f986ac --- /dev/null +++ b/content/docs/datasets/adapters-ai.mdx @@ -0,0 +1,829 @@ +--- +title: Adapters (Agent Version) +description: Comprehensive adapter specification — structured for AI agents to create adapters effectively. +--- + +import { Callout } from 'fumadocs-ui/components/callout'; + +{/* This page is optimized for AI coding agents. Every requirement is explicit, every schema is typed, and every step has validation criteria. */} + +## Adapter Specification Summary + +Harbor supports running various benchmarks and datasets via a simple, unified interface. SWE-Bench, LiveCodeBench, and more benchmarks are integrated into Harbor, and our team is actively working to adapt additional benchmarks to the framework. To add a new benchmark or dataset, you need to create an [adapter](https://github.com/laude-institute/harbor/tree/main/adapters) that translates the original benchmark's tasks into the Harbor format. + +We welcome the open source community to contribute adapters for new benchmarks and datasets. If you have a benchmark or a dataset of tasks that you want to adapt (e.g., using Harbor's evaluation harness), please follow the steps below to develop your adapter and get it merged. + + +If you are thinking about adapting your benchmark or contributing one from our [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0), please join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out to [Lin Shi](mailto:ls2282@cornell.edu) from the `#adapters-announcements` channel. We will support API costs for running parity experiments. + + + +See [this section](#terminal-bench-migration-reference) to learn about the requirements and differences between Terminal-Bench and Harbor. + + +**Repositories involved:** +- **harbor** (adapter code + registry): `https://github.com/laude-institute/harbor` +- **harbor-datasets** (generated tasks): `https://github.com/laude-institute/harbor-datasets` +- **parity-experiments** (results upload): `https://huggingface.co/datasets/harborframework/parity-experiments` + +**Reference examples:** +- Example task: `harbor/examples/tasks/hello-world/` +- Example configs: [harbor/examples/configs/](https://github.com/laude-institute/harbor/tree/main/examples/configs) +- Existing adapters: `harbor/adapters/` +- README template: [harbor/docs/adapters/templates/README.md](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) + +## Quick Start + +```bash +# List available datasets +harbor datasets list + +# Start the interactive wizard to create a new adapter +harbor adapters init + +# Initialize with specific arguments (skipping some prompts) +harbor adapters init my-adapter --name "My Benchmark" +``` + +The `harbor adapters init` command will create starter code and template files. + +## Overview: 9-Step Workflow + +| Step | What to do | Outcome | +|------|-----------|---------| +| 1. [Understand the Original Benchmark](#step-1-understand-the-original-benchmark) | Analyze original benchmark for instructions, environments, tests, solutions | Clear understanding of the four key factors | +| 2. [Fork and Develop Adapter Code](#step-2-fork-and-develop-adapter-code) | Write Python adapter code that generates Harbor-format task directories | Working adapter under `adapters/{adapter-name}` | +| 3. [Verify Oracle Solutions](#step-3-verify-oracle-solutions) | Run Harbor harness and ensure 100% oracle pass | WIP PR with screenshot of 100% pass | +| 4. [Plan Parity and Implement Agents](#step-4-plan-parity-and-implement-agents) | Discuss plans with team, implement agents as needed | Confirmed agent setup on both sides | +| 5. [Run Parity Experiments](#step-5-run-parity-experiments) | Run same agents/models on both original and Harbor | Comparable scores demonstrating equivalence | +| 6. [Record Parity Results](#step-6-record-parity-results) | Document results in `parity_experiment.json` | Valid JSON with all fields populated | +| 7. [Upload Parity Results](#step-7-upload-parity-results) | Upload to HuggingFace parity-experiments dataset | PR created to HuggingFace repo | +| 8. [Register the Dataset](#step-8-register-the-dataset) | Add tasks to `harbor-datasets` and `registry.json` | Oracle passes via `--registry-path` | +| 9. [Document and Submit](#step-9-document-and-submit) | Write README and `adapter_metadata.json`, submit PR | PR title changed to `[Ready for Review]` | + +Since each benchmark is unique, the exact process and special requirements may vary slightly. Please contact our team to understand the specific requirements and considerations for your benchmark. + +--- + +## File Structures + +### Generated task directory (one per task) + +Each task must produce this exact structure under `datasets/{adapter-name}/{task-id}`: + +``` +{task-id}/ +├── task.toml # REQUIRED - Task configuration and metadata +├── instruction.md # REQUIRED - Task instructions for the agent +├── environment/ +│ └── Dockerfile # REQUIRED - Container environment definition +├── solution/ +│ └── solve.sh # REQUIRED - Oracle solution script +└── tests/ + ├── test.sh # REQUIRED - Test execution script, writes reward to /logs/verifier/reward.txt + └── test_*.py # OPTIONAL - Additional pytest test files +``` + +### Adapter code directory + +Your adapter code must live at `harbor/adapters/{adapter-name}` with this structure: + +``` +harbor/adapters/{adapter-name}/ +├── adapter.py # REQUIRED - Core logic: parse benchmark, generate task dirs +├── run_adapter.py # REQUIRED - CLI entry point +├── template/ # REQUIRED - Template files used to generate tasks +│ ├── task.toml +│ ├── instruction.md +│ ├── environment/ +│ │ └── Dockerfile +│ ├── solution/ +│ │ └── solve.sh +│ └── tests/ +│ └── test.sh +├── parity_experiment.json # REQUIRED - Filled after parity experiments (Step 6) +├── run_{adapter-name}.yaml # REQUIRED - Reference config for reproducibility +├── adapter_metadata.json # REQUIRED - Structured metadata (Step 9) +└── README.md # REQUIRED - Final documentation (Step 9) +``` + +The `template/` directory stores the template files required for the tasks. All files from the [generated task structure](#generated-task-directory-one-per-task) or the [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) are recommended to be included. Your adapter code uses these templates to generate the actual task directories. + +The `parity_experiment.json` stores parity experiment results (comparison between the original benchmark and the Harbor adapter). Details in [Step 6](#step-6-record-parity-results). + +The `README.md` is the last thing you should work on before PR submission. Details in [Step 9](#step-9-document-and-submit). Follow the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). + +--- + +## Step-by-Step Workflow + +### Step 1: Understand the Original Benchmark + +Before writing any adapter code, deeply understand the original benchmark. Your goal is to identify the four key factors required by Harbor: + +| Component | What to find | Example questions | +|-----------|-------------|-------------------| +| **Instructions** | How tasks are described | What info does the agent receive? Is it in a JSON field, a separate file, or embedded in code? | +| **Environments** | Docker/system setup | What base image? What dependencies? What files are pre-loaded? | +| **Tests** | Evaluation mechanism | Unit tests? LLM-as-judge? Exit codes? Custom scoring? What metrics? Deterministic or not? | +| **Solutions** | Oracle/reference solutions | Do reference solutions exist? If not, can they be created with LLM? | + +Study the original benchmark's repository, documentation, and code structure. This understanding will guide your adapter development and ensure you capture all necessary information when converting tasks to Harbor format. + +**Validation:** You can describe all four components and know where each lives in the original repo. + +--- + +### Step 2: Fork and Develop Adapter Code + +#### 2.0 Read the README template first + +The [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) serves as both the template for your final README and a requirements checklist. It includes essential instructions to help you understand the requirements that will facilitate the development and review processes. Reading it first will give you a sense of what to provide and will guide your code, experiments, and documentation. + +#### 2.1 Setup + +```bash +git clone https://github.com/{your-github-username}/harbor.git +cd harbor +git checkout -b {adapter-name}-adapter +``` + +#### 2.2 Requirements for `run_adapter.py` + +| Requirement | Details | +|-------------|---------| +| Clone + cleanup | Must support temporarily cloning the source benchmark, generating tasks, then cleaning up the clone | +| Use existing clone | Must also support generating tasks from an already-cloned benchmark repo without deleting it | +| Default output | Must default to `datasets/{adapter-name}` | +| Custom output | Must accept `--output-path` CLI argument for custom output directory | +| Subset support | If benchmark is large, should accept `--split parity` to generate only the parity subset | + +#### 2.3 Requirements for `task.toml` + +```toml +version = "1.0" + +[metadata] +author_email = "example@email.com" +author_name = "Author Name" +difficulty = "hard" # optional +category = "programming" # optional +tags = ["debugging", "python"] # optional + +[agent] +timeout_sec = 3000.0 + +[verifier] +timeout_sec = 3000.0 +``` + +#### 2.4 Requirements for `tests/test.sh` + +The test script MUST write a numeric reward (int or float) to `/logs/verifier/reward.txt`. Harbor mounts `/logs/verifier` for test outputs. + +```bash +#!/bin/bash +# Run your tests however you need +pytest /tests/test_*.py + +# REQUIRED: write reward to this exact path +if [ $? -eq 0 ]; then + echo 1 > /logs/verifier/reward.txt +else + echo 0 > /logs/verifier/reward.txt +fi +``` + +Harbor supports multiple metrics represented as rewards to seamlessly serve for RL. Reward can be float values. We will further support aggregation of metrics across dataset (e.g., average or custom ones). This allows you to use the same metrics of any type as the original benchmark and convert them to RL-compatible formats. + +#### 2.5 Allowed adaptations + +- **Prompt tweaks are OK:** Adding things like "write files in place without asking for approval" is acceptable. Constraint: **the same tweak must apply to both the forked original benchmark repository and the Harbor adapter.** +- **Partial adaptation is OK:** You may adapt only a subset of the benchmark (e.g., only SWE-Bench-Verified) or exclude tasks for valid reasons (e.g., GPU requirements). Constraint: **must be documented in README.** + +**Validation:** Running `python run_adapter.py --output-path /tmp/test-output` produces correctly structured task directories matching the schema above. + +--- + +### Step 3: Verify Oracle Solutions + +Run the oracle agent on your entire dataset and confirm **100% reward on all tasks**. + +There are several ways to run Harbor harness on your adapter: + +**Option 1: Using individual trials (for testing single tasks)** +```bash +# Run oracle agent on a single task +uv run harbor trials start -p datasets/{adapter-name}/{task-id} + +# Run with specific agent and model +uv run harbor trials start -p datasets/{adapter-name}/{task-id} -a {agent-name} -m {model-name} +``` + +**Option 2: Using jobs with local dataset path** +```bash +# Run on entire local dataset +uv run harbor jobs start -p datasets/{adapter-name} -a {agent-name} -m {model-name} +``` + +**Option 3: Using jobs with configuration file.** Refer to [harbor/examples/configs](https://github.com/laude-institute/harbor/tree/main/examples/configs) for configuration examples. It's highly recommended to write a reference config file for your adapter to ensure reproducibility. +```bash +# Create a job config YAML (see harbor/examples/configs/ for examples) +uv run harbor jobs start -c adapters/{adapter-name}/{config}.yaml -a {agent-name} -m {model-name} +``` + +**Option 4: Using local registry after your [dataset registry PR](#step-8-register-the-dataset) gets merged.** This step is required to check the correctness of (1) your registered dataset and (2) your updated `registry.json` in the Harbor repository. If this run successfully passes all oracle tests, then after your adapter PR gets merged, people can directly use `-d` without `--registry-path` to run evaluation (the same way as Option 5). +```bash +# Run from registry +uv run harbor jobs start -d {adapter-name} --registry-path registry.json -a {agent-name} -m "{model-name}" +``` + +**Option 5: Using registry dataset (after registration and all PRs merged)** +```bash +# Run from registry +uv run harbor jobs start -d {adapter-name} -a {agent-name} -m "{model-name}" +``` + +You should include instructions for running in multiple ways in the `README.md` for your adapter, following the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). **Note that the order of these options is organized differently in the final adapter README.** This is because from the user's perspective, Option 5 is the primary way to run the adapter without needing to prepare task directories; the adapter code and other running methods are mainly used for development and reproduction. + +#### 3.1 Verify Oracle Solutions Pass 100% + +Before proceeding further, you must ensure that all oracle solutions pass with a 100% reward. Run the oracle agent on your entire dataset: + +```bash +uv run harbor jobs start -p datasets/{adapter-name} +``` + +After oracle passes, create a WIP pull request: +- **Title:** `[WIP] Adapter: {adapter_name}` +- **Include:** screenshot of terminal showing 100% oracle pass results + +This WIP PR allows the team to review your adapter structure early and provide feedback before you proceed with parity experiments. + +**Validation:** All oracle solutions pass with 100% reward. WIP PR is created with screenshot. + +--- + +### Step 4: Plan Parity and Implement Agents + +After your oracle solutions pass and you've created a WIP PR, reach out to the team (e.g., **Lin Shi**) through Discord to discuss your parity experiment plans **before** running them. They will determine: +- Which agents and models to use +- How many runs are needed +- API key provisioning + +#### Agent scenario decision table + +| Condition | Action | Example | +|-----------|--------|---------| +| Original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, Gemini-CLI) | No extra work. Run parity with identical agent+model on both sides. | Benchmark has a Codex integration | +| Original benchmark is LLM-based but has no Harbor-compatible agents | 1. Fork the original benchmark repo and create a branch (e.g., `harbor-adapter`). 2. Implement Harbor-compatible agents (e.g., Codex) in the fork. 3. Document the implementation in the fork's README. | See [EvoEval parity config](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json) | +| Original benchmark uses custom agents not in Harbor | 1. Implement custom agent in Harbor at `adapters/{adapter-name}/{agent}.py` (adapter-specific, doesn't need to be installed as a general Harbor agent). 2. Run parity experiments with this custom agent. 3. Additionally run experiments with standard Harbor agents (Codex, Claude-Code) to show that "using other supported agents to run the adapter makes sense." | Benchmark has proprietary agent | + +Keep a link to any forked repositories, and document your agent implementation approach in your adapter's README. + +#### Large/expensive benchmarks: parity on subset + +If running the full benchmark is too expensive, you may run parity on a representative subset. Please discuss with the team to confirm sampling and parity plans! + +| Requirement | Details | +|-------------|---------| +| Subset selection | Document how subset was selected (random seed, stratification, etc.) in README | +| Adapter flag | `run_adapter.py` must support `--split parity` to generate only the subset | +| Registry version | Use `"version": "parity"` instead of `"1.0"` so users can run `-d {name}@parity` while keeping the full dataset available separately | +| Team approval | Discuss and confirm sampling plan with team before running | + +Example commands for generating parity subset vs full dataset: +```bash +# Generate only the parity subset +uv run run_adapter.py --split parity --output-dir /path/to/output + +# Generate the full dataset +uv run run_adapter.py --output-dir /path/to/output +``` + +In your adapter's README, you must clearly: +- State how the parity subset was selected +- Explicitly indicate that parity experiments were run on a subset +- Provide instructions for users on how to use the full dataset + +**Validation:** Agent implementation plan is confirmed with the team. If forking, fork repo exists with agent implementation. + +--- + +### Step 5: Run Parity Experiments + +Once you've implemented the necessary agents (if needed), run parity experiments to verify your adapter. Use the Harbor harness (see [Step 3](#step-3-verify-oracle-solutions)) with the same set of agents and models that you used (or will use) on the original benchmark side. Ensure the config and parameter settings are identical as well (e.g., codex version). Run them multiple times on each side to compare average scores and standard deviations. + +```bash +# Harbor side +uv run harbor jobs start -p datasets/{adapter-name} -a {agent} -m {model} +``` + +Requirements: +- **Identical configs** on both sides (same agent version, model version, parameters, e.g., codex version) +- **Multiple runs** per side to compute mean and standard deviation +- **Comparable scores** between original and Harbor (demonstrates equivalence of adaptation) + +The average scores across multiple runs should be **comparable to demonstrate equivalence of adaptation** (i.e., running the benchmark with Harbor is equivalent to running it with the original harness). + +**Validation:** Average scores across runs are comparable between original benchmark and Harbor adapter. + +--- + +### Step 6: Record Parity Results + +To formally store and track the performance parity between the original benchmark and your adapter, create a `parity_experiment.json` file in your adapter directory. + +#### Schema: `parity_experiment.json` + +The file is a JSON array. Each entry represents one agent+model configuration: + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `adapter_name` | string | yes | Your adapter name | +| `agent` | string | yes | Agent name with version, e.g. `"codex@0.1"` | +| `model` | string | yes | Full model name with version, e.g. `"gpt-4o-2024-05-13"` | +| `date` | string | yes | Date experiments were run | +| `adapted_benchmark_size` | int | yes | Total number of tasks converted by adapter (full set) | +| `parity_benchmark_size` | int | yes | Number of tasks used for parity (equals `adapted_benchmark_size` if full set) | +| `number_of_runs` | int | yes | Number of runs per side. Unless special case, should be identical for original and Harbor runs. | +| `notes` | string | yes | Additional explanations on special treatments, etc. Use `null` if none | +| `original_parity_repo` | string | yes | URL to forked repo for reproducing original-side experiments. Usually a fork of the original benchmark repo whose README includes instructions + scripts for running the parity experiments. | +| `adapter_pr` | string[] | yes | Adapter PR link(s) in the `harbor` repo. Show all PR links related to the adapter, including later fixes. | +| `dataset_pr` | string[] | yes | All PR link(s) in `harbor-datasets` repo that are registering the adapter. | +| `parity_pr` | string[] | yes | All PR link(s) to the HuggingFace parity experiment dataset | +| `metrics` | array | yes | Array of metric comparison objects (see below) | + +#### Schema: each `metrics` entry + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `benchmark_name` | string | yes | Name of the original benchmark | +| `metric` | string | yes | Metric name (e.g. `"pass@1"`, `"accuracy"`) | +| `original` | string | yes | Mean +/- std error from original benchmark | +| `harbor` | string | yes | Mean +/- std error from Harbor adapter | +| `original_runs` | number[] | yes | Individual scores from each original run | +| `harbor_runs` | number[] | yes | Individual scores from each Harbor run | + +#### Example + +```json +[ + { + "adapter_name": "my-benchmark", + "agent": "codex@0.1", + "model": "gpt-4o-2024-05-13", + "date": "2026-03-20", + "adapted_benchmark_size": 500, + "parity_benchmark_size": 100, + "number_of_runs": 3, + "notes": "Parity run on 20% stratified sample", + "original_parity_repo": "https://github.com/user/benchmark-fork", + "adapter_pr": ["https://github.com/laude-institute/harbor/pull/123"], + "dataset_pr": ["https://github.com/laude-institute/harbor-datasets/pull/45"], + "parity_pr": ["https://huggingface.co/datasets/harborframework/parity-experiments/discussions/10"], + "metrics": [ + { + "benchmark_name": "my-benchmark", + "metric": "pass@1", + "original": "0.72 +/- 0.03", + "harbor": "0.71 +/- 0.02", + "original_runs": [0.70, 0.73, 0.74], + "harbor_runs": [0.69, 0.72, 0.71] + }, + { + "benchmark_name": "my-benchmark", + "metric": "accuracy", + "original": "0.85 +/- 0.02", + "harbor": "0.84 +/- 0.01", + "original_runs": [0.84, 0.86, 0.85], + "harbor_runs": [0.83, 0.85, 0.84] + } + ] + } +] +``` + +#### README parity table + +Also include a summary table in your adapter's README: + +```markdown +| Agent | Model | Metric | Number of Runs | Dataset Size | Original Benchmark Performance | Harbor Adapter Performance | +|-------|-------|--------|----------------|--------------|--------------------------------|----------------------------| +| claude-code@2.1.70 | claude-4-opus | pass@1 | 3 | 100 tasks (5% of full set) | Score +/- Std | Score +/- Std | +| codex@0.100.0 | gpt-5 | accuracy | 5 | 2000 tasks (100% of full set) | Score +/- Std | Score +/- Std | +``` + +And include these links in your README: +- The link to the original benchmark's GitHub repository +- The link to the forked repo of the original benchmark (if applicable, from [Step 4](#step-4-plan-parity-and-implement-agents)) +- The link to the dataset PR (from [Step 8](#step-8-register-the-dataset)) +- The link to the parity experiment PR to the HuggingFace parity experiment dataset (from [Step 7](#step-7-upload-parity-results)) +- The link to the adapter PR + +**Validation:** `parity_experiment.json` is valid JSON matching the schema above. All fields are populated. + +--- + +### Step 7: Upload Parity Results + +After recording your parity results, upload both parity experiment results and oracle results to the [Harbor Parity Experiments HuggingFace dataset](https://huggingface.co/datasets/harborframework/parity-experiments). This allows the community to track adapter quality and helps estimate costs for each adapter on diverse agents and models. + +Follow the README instructions in the HuggingFace dataset repository to upload your results. + +#### Required directory structure + +``` +adapters/ + └── {adapter_name}/ + ├── README.md # Results overview, interpretation, notes + ├── config.yaml # YAML config to reproduce parity experiments in Harbor + ├── original_parity/ # Raw results from original benchmark runs + ├── harbor_parity/ # Raw results from Harbor adapter runs + ├── oracle/ # Oracle solution results + └── results_collection/ # Copy the valid result.json files from parity to this directory + ├── result_{original/harbor}_run1.json + ├── result_{original/harbor}_run2.json + ├── ... + └── result_{original/harbor}_run{N}.json +``` + +**Validation:** PR created to HuggingFace dataset repo with the above structure. + +--- + +### Step 8: Register the Dataset + +Once your adapter correctly generates tasks and you verify the parity experiments, you should add them to the official [Harbor datasets repository](https://github.com/laude-institute/harbor-datasets). + +#### 8.1 Add tasks to `harbor-datasets` + +```bash +git clone https://github.com/{your-github-username}/harbor-datasets.git +cd harbor/adapters/{adapter-name} + +# Specify custom path to the harbor-datasets repo +uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/{adapter-name} +``` + +Create a PR to `harbor-datasets`. It's recommended to link the original benchmark's GitHub repository in your PR. Request `@Slimshilin` for review, and he will merge it so you can try `--registry-path` for the `harbor` harness. You may always submit another PR to update the dataset registry. + +#### 8.2 Add entry to `registry.json` + +In the **harbor** repo (not the datasets repo), add a new entry to the `registry.json` file in the root. **Note:** Harbor's registry format uses task-level entries with Git URLs. + +**Version naming:** Use `"version": "1.0"` by default. If there's a version naming of the original benchmark (e.g., `"verified"`, `"lite"`), then follow their naming. If you ran parity experiments on a subset (see [Step 4](#step-4-plan-parity-and-implement-agents)), use `"version": "parity"` for the parity subset to allow users to run `-d {adapter_name}@parity` for parity reproduction. + +#### Schema: `registry.json` entry + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `name` | string | yes | Adapter name | +| `version` | string | yes | `"1.0"` by default. Use original benchmark's split names if they exist (e.g. `"verified"`, `"lite"`). Use `"parity"` if registering a parity subset. | +| `description` | string | yes | Brief description including link to original benchmark and adapter README | +| `tasks` | array | yes | One entry per task (see below) | + +#### Schema: each `tasks` entry + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `name` | string | yes | Task ID | +| `git_url` | string | yes | Always `"https://github.com/laude-institute/harbor-datasets.git"` | +| `git_commit_id` | string | yes | Use `"head"` during development, pin to specific commit hash for production (for reproducibility) | +| `path` | string | yes | Path within repo: `"datasets/{adapter-name}/{task-id}"` | + +#### Example + +```json +[ + { + "name": "my-benchmark", + "version": "1.0", + "description": "My benchmark adapter. Original: https://github.com/org/benchmark. More details at adapter README URL.", + "tasks": [ + { + "name": "task-001", + "git_url": "https://github.com/laude-institute/harbor-datasets.git", + "git_commit_id": "abc123def456", + "path": "datasets/my-benchmark/task-001" + }, + { + "name": "task-002", + "git_url": "https://github.com/laude-institute/harbor-datasets.git", + "git_commit_id": "abc123def456", + "path": "datasets/my-benchmark/task-002" + } + ] + } +] +``` + +#### 8.3 Verify registry configuration + +**Important:** After your dataset registry PR is merged, you must verify that your registered dataset and `registry.json` are correctly configured. Run the following command to test oracle solutions using the registry: + +```bash +uv run harbor jobs start -d {adapter-name} --registry-path registry.json +``` + +Once the oracle tests pass successfully, **take a screenshot of your terminal** showing both: +- The command you ran +- The successful oracle test logs/results + +**Paste this screenshot in your adapter PR** to demonstrate that the registry configuration is correct and that the dataset can be successfully loaded from the registry. + +After all adapter PRs are merged, users can run without `--registry-path`: + +```bash +uv run harbor jobs start -d {adapter-name} -a {agent} -m {model} +``` + +**Validation:** Oracle passes via `--registry-path`. Screenshot posted in adapter PR showing both command and results. + +--- + +### Step 9: Document and Submit + +#### 9.1 Write README + +Follow the [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) to draft comprehensive documentation for your adapter. The documentation should be detailed enough for other community users to understand your adaptation choices and reproduce your work. + +Must document: + +| Required section | What to include | +|-----------------|----------------| +| Benchmark bugs/issues | Any bugs discovered in the original benchmark and how they were handled | +| Agent adaptation details | Prompt modifications, environment adjustments, custom agent implementations | +| Deviations from original | What was changed and why | +| Parity results | Summary table + links to all relevant PRs and repos | +| Running instructions | How to run via registry (`-d`), local path (`-p`), and config file (`-c`) | +| Known limitations | Any constraints or unsupported features | + +#### 9.2 Write `adapter_metadata.json` + +#### Schema: `adapter_metadata.json` + +The file is a JSON array. Each entry has two sections: `original_benchmark` and `harbor_adapter`. + +**Top-level fields:** + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `adapter_name` | string | yes | Adapter name | +| `adapter_builders` | string[] | yes | Builder names with contact email, e.g. `["Jane Doe (jane@example.com)"]` | +| `original_benchmark` | array | yes | One entry per split/subset of the original benchmark | +| `harbor_adapter` | array | yes | One entry per split/subset of the Harbor adapter | + +**`original_benchmark` entry fields:** + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `split` | string | yes | Split name. Use `"full"` if no named splits exist | +| `size` | int | yes | Number of tasks in this split. Note: "task" may mean different things in different benchmarks; for term consistency, count tasks in Harbor context. | +| `harness` | string | yes | One of: `"agent"`, `"llm"`, or `null`, depending on whether the benchmark has scripts for agent/llm inference | +| `supported_agents` | string[] or null | yes | Agents supported by original harness (use `agent@version` format if version is available). `null` if none | +| `adaptable` | bool | yes | Whether this split can be converted to Harbor tasks with the provided adapter code | +| `notes` | string or null | yes | Additional clarification, e.g., term explanation, special task structures or requirements on machine or compute. `null` if not applicable | + +**`harbor_adapter` entry fields:** + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `split` | string | yes | Split name matching original. Use `"full"` if adapter works for all splits collectively. If different splits are registered/validated in different ways, split them out separately. | +| `adapted_benchmark_size` | int | yes | Number of tasks the adapter can convert. May differ from original if tasks were excluded for sufficient reasons documented in the README. | +| `parity_benchmark_size` | int | yes | Number of tasks used for parity. Equals `adapted_benchmark_size` if full set. | +| `parity_sampling_rate` | float | yes | `adapted_benchmark_size / parity_benchmark_size` | +| `registry_benchmark_size` | int | yes | Exact integer count of tasks in the registry. Must match either `adapted_benchmark_size` or `parity_benchmark_size` to determine whether the full set or parity set is being registered. | +| `added_agents` | string[] or null | yes | Custom agents added by the adapter to align with the original benchmark. `null` if none | +| `parity_matching_agents` | string[] | yes | Agents (including custom ones) used for parity experiment AND achieved comparable scores to original benchmark. Format: `agent@version+model` | +| `parity_unmatching_agents` | string[] or null | yes | Agents used for parity experiment BUT didn't achieve comparable scores. This may happen for some weak models. `null` if there are no unmatching results. | +| `parity_costs` | string | yes | Total USD spent on parity experiments | +| `notes` | string or null | yes | Additional clarification, e.g., special treatment on the adapter. `null` if not applicable | + + +If the adapter ran parity between Harbor Adapter, Terminal-Bench Adapter, and Original Benchmark, substitute `"harbor_adapter"` with `"tb_adapter"` for the Terminal-Bench comparison, and add a separate `"harbor_adapter"` section for the Harbor comparison. Both sections use the same field schema above. + + +#### 9.3 Submit PR + +Once everything is ready for review (all steps completed, documentation finalized, screenshots added), update your Harbor adapter PR: + +1. **Change the PR title** from `[WIP] Adapter: {adapter_name}` to `[Ready for Review] Adapter: {adapter_name}` +2. **Request review** from `@Slimshilin` in the PR + +This signals to the team that your adapter is complete and ready for final review and merge. + +**Validation:** README follows template. `adapter_metadata.json` matches schema. PR title updated. Review requested. + +--- + +## CLI Command Reference + +| Purpose | Command | +|---------|---------| +| List datasets | `harbor datasets list` | +| Scaffold adapter | `harbor adapters init` | +| Scaffold with args | `harbor adapters init my-adapter --name "My Benchmark"` | +| Run single task | `uv run harbor runs start -p datasets/{adapter}/{task-id}` | +| Run single task with agent | `uv run harbor runs start -p datasets/{adapter}/{task-id} -a {agent} -m {model}` | +| Run full dataset | `uv run harbor jobs start -p datasets/{adapter} -a {agent} -m {model}` | +| Run with config | `uv run harbor jobs start -c adapters/{adapter}/{config}.yaml -a {agent} -m {model}` | +| Run from local registry | `uv run harbor jobs start -d {adapter} --registry-path registry.json` | +| Run from registry (post-merge) | `uv run harbor jobs start -d {adapter} -a {agent} -m {model}` | + +--- + +## Terminal-Bench Migration Reference + +If you have an existing [Terminal-Bench adapter](https://github.com/laude-institute/terminal-bench/tree/main/adapters) and want to convert it to Harbor format, this section outlines the key differences and migration steps. Harbor maintains the same core principles as Terminal-Bench but uses a different file structure and configuration format. + +Note that the Harbor adapter should be isolated from the Terminal-Bench repo. You are expected to write adapter code following the same process as for Terminal-Bench instead of applying a direct translation script. Fortunately, with a good Terminal-Bench adapter, it is relatively easy to create a Harbor adapter by handling a slightly different task format. + +### Key format differences + +| Terminal-Bench | Harbor | +|---------------|--------| +| `task.yaml` (YAML format) | `task.toml` (TOML format) | +| Instruction embedded in `task.yaml` as `instruction` field | Separate `instruction.md` file | +| `Dockerfile` (root level) | `environment/Dockerfile` | +| `solution.sh` (root level) | `solution/solve.sh` | +| `run-tests.sh` + `tests/test_outputs.py` (root level) | `tests/test.sh` | +| Exit code based verification (pytest) | Reward-based: write to `/logs/verifier/reward.txt` | +| `docker-compose.yaml` in task root | Not typically used per-task | +| `tasks/` output directory | `datasets/` output directory | +| Dataset-level `dataset_path` in registry | Task-level `git_url` + `path` per task in registry | +| `tb run --dataset` / `tb run --dataset-path` | `harbor jobs start -d` / `harbor runs start -p` | +| Resolved rate (binary pass/fail per task) | Rewards that support multiple metrics and float-type values | + +**Important:** If the Terminal-Bench adapter used a tweaked metric (threshold-based scoring, ignored certain metrics), re-implement to support the **original** benchmark metrics. Harbor supports multiple metrics as rewards. + +### File structure comparison + +**Terminal-Bench structure:** +``` +task-id/ +├── task.yaml +├── Dockerfile +├── docker-compose.yaml +├── run-tests.sh +├── solution.sh +└── tests/ + └── test_outputs.py +``` + +**Harbor structure:** +``` +task-id/ +├── task.toml +├── instruction.md +├── environment/ +│ └── Dockerfile +├── solution/ +│ └── solve.sh +└── tests/ + ├── test.sh + └── test_*.py (optional) +``` + +### Migration steps + +#### Step 1: Update task configuration format + +Convert `task.yaml` to `task.toml` and extract the instruction into a separate file: + +**Before (`task.yaml`):** +```yaml +instruction: | + Your task instruction here... + Multiple lines... +author_email: example@email.com +author_name: Author Name +difficulty: hard +category: programming +tags: + - debugging + - python +parser_name: swebench +max_agent_timeout_sec: 3000.0 +max_test_timeout_sec: 3000.0 +``` + +**After (`task.toml`):** +```toml +version = "1.0" + +[metadata] +author_email = "example@email.com" +author_name = "Author Name" +difficulty = "hard" +category = "programming" +tags = ["debugging", "python"] + +[agent] +timeout_sec = 3000.0 + +[verifier] +timeout_sec = 3000.0 +``` + +**And create `instruction.md`:** +```markdown +Your task instruction here... +Multiple lines... +``` + +#### Step 2: Reorganize files into subdirectories + +- Move `Dockerfile` -> `environment/Dockerfile` +- Move `solution.sh` -> `solution/solve.sh` +- Move `run-tests.sh` -> `tests/test.sh` +- Remove `docker-compose.yaml` (usually not needed per-task in Harbor) + +#### Step 3: Update test scripts for reward-based system + +**Before (`run-tests.sh` in Terminal-Bench):** +```bash +#!/bin/bash +# Run tests and create marker file +pytest tests/ > test_results.txt +if [ $? -eq 0 ]; then + echo "PASSED" > /tmp/test_marker.txt +else + echo "FAILED" > /tmp/test_marker.txt +fi +``` + +**After (`tests/test.sh` in Harbor):** +```bash +#!/bin/bash +# Install dependencies if needed +apt-get update && apt-get install -y python3-pip +pip3 install pytest + +# Run tests +pytest /tests/test_*.py + +# Write reward based on test results +if [ $? -eq 0 ]; then + echo 1 > /logs/verifier/reward.txt +else + echo 0 > /logs/verifier/reward.txt +fi +``` + +**Key changes:** +- Harbor mounts `/logs/verifier` for test outputs +- Write numeric reward (can be float type) to `/logs/verifier/reward.txt` +- Can still use pytest, but final output must be the reward file + +#### Step 4: Update adapter code + +- Change default output directory from `tasks/` to `datasets/` +- Update template directory to match Harbor structure +- Modify file generation logic to create subdirectories (`environment/`, `solution/`, `tests/`) +- Split instruction extraction into separate `instruction.md` file +- Convert YAML generation to TOML generation + +#### Step 5: Update registry format + +**Terminal-Bench `registry.json`:** +```json +{ + "name": "my-adapter", + "version": "head", + "description": "...", + "github_url": "https://github.com/laude-institute/terminal-bench-datasets.git", + "dataset_path": "datasets/my-adapter", + "task_id_subset": null +} +``` + +**Harbor `registry.json`:** +```json +{ + "name": "my-adapter", + "version": "1.0", + "description": "...", + "tasks": [ + { + "name": "task-1", + "git_url": "https://github.com/laude-institute/harbor-datasets.git", + "git_commit_id": "abc123", + "path": "datasets/my-adapter/task-1" + } + ] +} +``` + +--- + +## Other Useful Resources + +- The [Harbor documentation](/docs/getting-started) provides detailed information about running tasks and jobs with Harbor. +- The [Harbor repository](https://github.com/laude-institute/harbor) contains example tasks and configurations. +- The [agent tutorial](/docs/agents) provides instructions on how to create and use your customized agent in Harbor. + +## Getting Help + +Thank you for your interest in Harbor and building an adapter! If you have any questions, please ask in the `#adapters-spam` channel in our [Discord](https://discord.com/invite/6xWPKhGDbA) or reach out to [Lin Shi](mailto:ls2282@cornell.edu). diff --git a/content/docs/datasets/adapters.mdx b/content/docs/datasets/adapters.mdx index 740f58a..923cbfb 100644 --- a/content/docs/datasets/adapters.mdx +++ b/content/docs/datasets/adapters.mdx @@ -1,11 +1,18 @@ --- title: Adapters -description: How to create a new adapter for a new benchmark using Harbor. +description: A concise guide for human readers to create a Harbor adapter for your benchmark. --- -import { Accordion, Accordions } from 'fumadocs-ui/components/accordion'; +import { Callout } from 'fumadocs-ui/components/callout'; import { File, Folder, Files } from 'fumadocs-ui/components/files'; + +AI agents should follow the spec at [Adapters (Agent Version)](/docs/datasets/adapters-ai) +instead of this page. That document contains the complete schema, +all edge cases, and machine-verifiable examples. +Do not use the tutorial below as your source of truth. + + Harbor supports running various benchmarks and datasets via a simple, unified interface. SWE-Bench, LiveCodeBench, and more benchmarks are integrated into Harbor, and our team is actively working to adapt additional benchmarks to the framework. To add a new benchmark or dataset, you need to create an [adapter](https://github.com/laude-institute/harbor/tree/main/adapters) that translates the original benchmark's tasks into the Harbor format. @@ -15,8 +22,8 @@ We welcome the open source community to contribute adapters for new benchmarks a If you are thinking about adapting your benchmark or contributing one from our [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0), please join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out to [Lin Shi](mailto:ls2282@cornell.edu) from the `#adapters-announcements` channel. - -See [this section](#translating-terminal-bench-adapters-to-harbor) to learn about the requirements and differences between Terminal-Bench and Harbor. + +Join our [Discord](https://discord.com/invite/6xWPKhGDbA) and reach out in `#adapters-announcements`. Check the [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0) for available benchmarks. We cover API costs for parity experiments. ## Quick Start @@ -25,103 +32,59 @@ See [this section](#translating-terminal-bench-adapters-to-harbor) to learn abou # List available datasets harbor datasets list -# Start the interactive wizard to create a new adapter +# Scaffold a new adapter interactively harbor adapters init -# Initialize with specific arguments (skipping some prompts) +# Or with arguments harbor adapters init my-adapter --name "My Benchmark" ``` -Use the above commands to view our supported datasets and start creating your new ones. The `harbor adapters init` command will create starter code and template files. - -For more details about what adapters are and how we ensure equivalance between the original benchmark and its harbor adapter, please continue reading. - -## Overview - -Adapting a benchmark to Harbor is a straightforward process designed to ensure consistency and quality. This guide will walk you through everything you need to know. However, since each benchmark is unique, the exact process and special requirements may vary slightly depending on the benchmark. Please contact our team to understand the specific requirements and considerations for your benchmark. We will support API costs for running parity experiments :-) - -Here's a quick look at the typical steps: - -1. **[Understand the Original Benchmark](#1-understand-the-original-benchmark):** First, you'll analyze the original benchmark to identify the task's four key factors required by Harbor: task instructions, environments, tests, and solutions. -2. **[Fork Harbor Repository and Develop Adapter Code](#2-fork-harbor-repository-and-develop-adapter-code):** Fork the Harbor repository and write Python adapter code that translates the original benchmark's tasks into the Harbor format. -3. **[Running Harbor Harness and Verify Oracle Solutions](#3-running-harbor-harness-and-verify-oracle-solutions):** Run Harbor harness on your adapter and ensure all oracle solutions pass with 100% reward. Create a WIP PR with a screenshot showing oracle success. -4. **[Discuss Parity Plans and Implement Agents](#4-discuss-parity-plans-and-implement-agents):** Reach out to the team to discuss parity experiment plans, then implement the corresponding agents on the original benchmark side or in Harbor, depending on the benchmark setting. This could happen right after you sign up for an adapter and before Step 1 as well, if the benchmark is relatively straightforward. -5. **[Run Parity Experiments](#5-run-parity-experiments):** Run parity experiments to verify your adapter's performance against the original benchmark baseline results. -6. **[Record Parity Results](#6-record-parity-results):** Formally document the performance comparison in `parity_experiment.json`. -7. **[Upload Parity Results](#7-upload-parity-results):** Upload parity and oracle results to the HuggingFace dataset repository. -8. **[Register the Dataset](#8-register-the-dataset):** Add your new tasks to the official dataset repository and registry, then verify with `--registry-path`. -9. **[Document and Submit](#9-document-and-submit):** Document your adapter's usage, parity results, and comprehensive adaptation details in a `README.md`, then submit your work through a pull request. - -We'll break down each step in detail below. Let's get started! +## Steps at a Glance -## The Adapter Development Workflow +| # | Step | Goal | +|---|------|------| +| 1 | [Understand the benchmark](#1-understand-the-original-benchmark) | Identify instructions, environments, tests, and solutions | +| 2 | [Write the adapter code](#2-write-the-adapter-code) | Generate Harbor-format task directories | +| 3 | [Verify oracle solutions](#3-verify-oracle-solutions) | All oracle solutions pass at 100% reward | +| 4 | [Plan parity & implement agents](#4-plan-parity--implement-agents) | Coordinate with the team; set up agents on both sides | +| 5 | [Run parity experiments](#5-run-parity-experiments) | Compare Harbor vs. original benchmark scores | +| 6 | [Record parity results](#6-record-parity-results) | Save results to `parity_experiment.json` | +| 7 | [Upload results](#7-upload-results) | Push to HuggingFace parity dataset | +| 8 | [Register the dataset](#8-register-the-dataset) | Add tasks to `harbor-datasets` and `registry.json` | +| 9 | [Document & submit](#9-document--submit) | Write README, submit PR for review | -Creating a high-quality adapter involves several key steps. Following this workflow ensures that the adapted benchmark is a faithful and reliable implementation of the original. - -### 1. Understand the Original Benchmark +--- -Before writing any adapter code, it's crucial to deeply understand the original benchmark. Your goal is to identify and understand the four key factors required by Harbor: +## 1. Understand the Original Benchmark -1. **Task Instructions:** How are tasks described? What information do agents need to solve each task? -2. **Environments:** What environment setup is required? (e.g., Docker containers, system dependencies, file structures) -3. **Tests:** How are solutions evaluated? What test scripts or verification mechanisms are used? Deterministic unit tests or LLM-as-a-Judge? -4. **Solutions:** What are the oracle/reference solutions? If there's no oracle solution in the original benchmark, is it possible to create them using LLM? +Before coding, study the original benchmark and identify its four key components: -Study the original benchmark's repository, documentation, and code structure to understand these components. This understanding will guide your adapter development and ensure you capture all necessary information when converting tasks to Harbor format. +1. **Instructions** — What does the agent need to know to solve each task? +2. **Environments** — What Docker/system setup is required? +3. **Tests** — How are solutions evaluated? (unit tests, LLM-as-judge, etc.) +4. **Solutions** — What are the oracle/reference solutions? -### 2. Fork Harbor Repository and Develop Adapter Code +## 2. Write the Adapter Code -With a solid understanding of the original benchmark, you can now create the adapter itself within the [harbor](https://github.com/laude-institute/harbor) repository. +### 2.0 Read the README template first -#### 2.0 Read the README template -The [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) serves as the template for the final README file that you will create for your submitted adapter. However, it is more than just a template: it includes essential instructions to help you understand the requirements that will facilitate the development and review processes. Reading it will give you a sense of what to provide and will guide your code, experiments, and documentation. +The [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) doubles as a requirements checklist. Read it before writing code — it tells you what you'll need to provide. -#### 2.1 Fork the Harbor repository -Fork the Harbor repository and create a new branch for your adapter (e.g., `{adapter-name}-adapter`). +### 2.1 Fork and branch ```bash -git clone https://github.com/{your-github-username}/harbor.git +git clone https://github.com/{you}/harbor.git cd harbor -git checkout -b {your-adapter-name}-adapter +git checkout -b {adapter-name}-adapter ``` -#### 2.2 Develop the adapter code -Develop the adapter under `adapters/{adapter-name}`. You may refer to the existing adapters in the `adapters/` directory and follow the patterns. The adapter's primary job is to parse the original benchmark's data and generate task directories in the standard Harbor format. Here is an example architecture of the task directory: +### 2.2 Target task directory structure + +Each generated task should look like this: - - - - - - - - - - - - - - - - -[Here](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) is an example task directory. Your code should prepare task directories locally following a similar format. - - -#### 2.3 Requirements and Tips for the Adapter Code -Your adapter code is used to generate task directories. A typical directory structure for your adapter code is as follows: - - - - - - - - - - @@ -132,569 +95,227 @@ Your adapter code is used to generate task directories. A typical directory stru + -More details (expand to view): - - - Harbor supports multiple metrics represented as rewards to seamlessly serve for RL. Reward can be float values. We will further support aggregation of metrics across dataset (e.g., average or custom ones). - - This allows you to use the same metrics of any type as the original benchmark and convert them to RL-compatible formats. - - - - - - It should support: - - Temporarily cloning the source benchmark, preparing the tasks, and cleaning up the temporary clone. - - Generating tasks from an existing, already-cloned benchmark repository without deleting it. - - Also, by default, your adapter should create tasks in `datasets/`, but you should also allow users to specify a custom output path via command-line arguments `--output-path`. - - - - - - The `template/` directory stores the template files required for the tasks. For your reference, all files [above](#22-develop-the-adapter-code) or in the [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) are recommended to be included in the `template/` directory. Then your adapter code would use the templates to generate the actual task directories. - - - - - - A file to store the parity experiment results (i.e., comparison between the original benchmark and the Harbor adapter). More details are provided in the [Recording Parity Results](#6-record-parity-results) section. - - +See the [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) for a concrete reference. - - - This is the last thing you should work on before PR submission. More details are provided in the [Document and Submit](#9-document-and-submit) section. You can follow the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). - - +### 2.3 Adapter code structure +Your adapter lives in `harbor/adapters/{adapter-name}/`: +| File | Purpose | +|------|---------| +| `adapter.py` | Core logic: parse benchmark data, generate task dirs | +| `run_adapter.py` | CLI entry point (supports `--output-path`) | +| `template/` | Template files copied into each task | +| `parity_experiment.json` | Parity results (filled in later) | +| `run_{name}.yaml` | Reference config for reproducibility | +| `README.md` | Final documentation (written last) | +| `adapter_metadata.json` | Structured metadata about the adapter | - - - - It is acceptable to make prompt modifications to the task description to support CLI agents. For example, if adding prompts like "directly write the files in place without asking for my approval" would be helpful, it's fine to do so. **You just need to ensure that they apply to both the forked original benchmark repository and the Harbor adapter.** - - It is acceptable to adapt only part of the original benchmark (e.g., only SWE-Bench-Verified). Excluding certain tasks for valid reasons is also understandable (e.g., extensive GPU requirements). **You just need to ensure that the relevant information is included in the README.** - - - - - - -### 3. Running Harbor Harness and Verify Oracle Solutions - -There are several ways to run Harbor harness on your adapter: - -**Option 1: Using individual trials (for testing single tasks)** -```bash -# Run oracle agent on a single task -uv run harbor trials start -p datasets// - -# Run with specific agent and model -uv run harbor trials start -p datasets// -a -m -``` +**Requirements for `run_adapter.py`:** +- Support cloning the source benchmark temporarily (with cleanup) +- Support using an already-cloned repo +- Default output to `datasets/{adapter-name}`, with `--output-path` override -**Option 2: Using jobs with local dataset path** -```bash -# Run on entire local dataset -uv run harbor jobs start -p datasets/ -a -m -``` +**Tips:** +- Minor prompt tweaks (e.g., "write files in place without asking") are fine, as long as they apply to both the original benchmark and Harbor sides. +- Adapting only a subset of tasks is acceptable if documented in the README. -**Option 3: Using jobs with configuration file**. Refer to [harbor/examples/configs](https://github.com/laude-institute/harbor/tree/main/examples/configs) for configuration examples. It's highly recommended to write a reference config file for your adapter to ensure reproducibility. -```bash -# Create a job config YAML (see harbor/examples/configs/ for examples) -uv run harbor jobs start -c adapters//.yaml -a -m -``` +## 3. Verify Oracle Solutions -**Option 4: Using local registry after your [dataset registry PR](#8-register-the-dataset) gets merged**. This step is required to check the correctness of (1) your registered dataset and (2) your updated `registry.json` in the Harbor repository. If this run successfully passes all oracle tests, then after your adapter PR gets merged, people can directly use `-d` without `--registry-path` to run evaluation (the same way as Option 5). -```bash -# Run from registry -uv run harbor jobs start -d --registry-path registry.json -a -m "" -``` +Run your adapter with the oracle agent and confirm **100% reward on all tasks**. -**Option 5: Using registry dataset (after registration and all PRs merged)** ```bash -# Run from registry -uv run harbor jobs start -d -a -m "" -``` +# Single task +uv run harbor trials start -p datasets// -You should include instructions for running in multiple ways in the `README.md` for your adapter, following the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). **Note that the order of these options is organized differently in the final adapter README**. This is because from the user's perspective, Option 5 is the primary way to run the adapter without needing to prepare task directories; the adapter code and other running methods are mainly used for development and reproduction. +# Entire dataset +uv run harbor jobs start -p datasets/ -#### 3.1 Verify Oracle Solutions Pass 100% - -Before proceeding further, you must ensure that all oracle solutions pass with a 100% reward. Run the oracle agent on your entire dataset: - -```bash -uv run harbor jobs start -p datasets/ +# With a config file (recommended for reproducibility) +uv run harbor jobs start -c adapters//.yaml -a -m ``` -Once you've verified that all oracle solutions pass, you can create a Work-In-Progress (WIP) pull request to the Harbor repository: - -1. **Create a WIP PR:** Push your branch and create a pull request with the title `[WIP] Adapter: {adapter_name}`. -2. **Include a screenshot:** Paste a screenshot of your terminal showing the oracle solution 100% pass results. This demonstrates that your adapter correctly generates tasks and that the oracle solutions work as expected. - -This WIP PR allows the team to review your adapter structure early and provide feedback before you proceed with parity experiments. - -### 4. Discuss Parity Plans and Implement Agents - -After your oracle solutions pass and you've created a WIP PR, reach out to the team (e.g., **Lin Shi**) through Discord to discuss your parity experiment plans before running them. We will help you determine which agents and models to use, how many trials are needed, and we can provide API keys for running parity experiments. Based on your benchmark's characteristics, you'll need to implement agents accordingly. There are three main scenarios: - - -If the original benchmark already supports agents that are also supported in Harbor (e.g., OpenHands, Codex, Claude-Code, Gemini-CLI), you can run parity experiments using identical agent and model settings on both sides. No additional agent implementation is needed. - - - -If the original benchmark is LLM-based but doesn't have Harbor-compatible agents implemented, you'll need to: - -1. **Fork the original benchmark repository** and create a branch for your adaptation work (e.g., `harbor-adapter`). -2. **Implement Harbor-compatible agents** (e.g., codex) in the forked repository to enable fair comparisons. -3. **Document the implementation** in a `README.md` file in your fork. - -For an example, see the [EvoEval adapter's parity experiment configuration](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json), which shows how agents were implemented in a fork of the original benchmark. - - - -If the original benchmark uses custom agents that aren't available in Harbor, you'll need to: - -1. **Implement the custom agent in Harbor** under your adapter directory (e.g., `adapters//.py`). This is adapter-specific and doesn't need to be installed as a general Harbor agent. -2. **Run parity experiments** using this custom agent to ensure equivalence with the original benchmark. -3. **Additionally run experiments** with other Harbor-supported agents (e.g., Codex, Claude-Code) to demonstrate that the adaptation works well for multiple agent types. In other words, show that "using other supported agents to run the adapter makes sense". - +Once oracle passes, create a **WIP PR** titled `[WIP] Adapter: {adapter_name}` with a screenshot of the 100% pass results. -Keep a link to any forked repositories, and document your agent implementation approach in your adapter's README. +## 4. Plan Parity & Implement Agents - -If the original benchmark is very large and expensive to run, you may want to run parity experiments on a fixed, representative subset of samples instead of the full dataset. Please discuss with the team to confirm sampling and parity plans! +Reach out to the team (e.g., **Lin Shi**) on [Discord](https://discord.com/invite/6xWPKhGDbA) **before** running parity experiments. They will help decide: +- Which agents and models to use +- How many runs are needed +- API key provisioning -This approach has two important implications: +Depending on your benchmark, you'll fall into one of three scenarios: -1. **README Documentation:** In your adapter's README, you must clearly: - - State how the parity subset was selected (e.g., random seed, "stratified sample across difficulty levels", etc.) - - Explicitly indicate that parity experiments were run on a subset - - Provide instructions for users on how to use the full dataset with the adapter code, typically using an argument like `--split parity` (or similar) to generate only the parity subset - ```bash - # Example of adapter code usage - # Generate only the parity subset - uv run run_adapter.py --split parity --output-dir /path/to/output +**Scenario A — Compatible agents exist:** The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, etc.). No extra work needed. - # Generate the full dataset - uv run run_adapter.py --output-dir /path/to/output - ``` - -2. **Registry Version Naming:** When uploading the dataset to the registry, use the version name `"parity"` instead of `"1.0"` or `"2.0"` to avoid confusion. This allows users to run parity reproduction using `-d @parity` while keeping the full dataset available separately. +**Scenario B — LLM-based, no compatible agents:** Fork the original benchmark, implement a Harbor-compatible agent there, and document it. See the [EvoEval example](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json). +**Scenario C — Custom agents:** Implement the custom agent in Harbor under `adapters/{name}/{agent}.py` and run parity experiments with the custom agents. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. + +For expensive benchmarks, you can run parity on a representative subset. Discuss sampling strategy with the team first. Use `--split parity` in your adapter and register with `"version": "parity"` so users can run `-d {name}@parity`. -### 5. Run Parity Experiments - +## 5. Run Parity Experiments -Once you've implemented the necessary agents (if needed), run parity experiments to verify your adapter. Use the Harbor harness (see [Section 3](#3-running-harbor-harness-and-verify-oracle-solutions)) with the same set of agents and models that you used (or will use) on the original benchmark side. Ensure the config and parameter settings are identical as well (e.g., codex version). Run them multiple times on each side to compare average scores and standard deviations. +Run the **same agents, models, and settings** on both the original benchmark and your Harbor adapter, multiple times each. Compare average scores and standard deviations — they should be **comparable** to demonstrate equivalence. -The average scores across multiple trials should be **comparable to demonstrate equivalence of adaptation** (i.e., running the benchmark with Harbor is equivalent to running it with the original harness). +```bash +# Harbor side +uv run harbor jobs start -p datasets/ -a -m +``` -### 6. Record Parity Results +## 6. Record Parity Results -To formally store and track the performance parity between the original benchmark and your adapter, create a `parity_experiment.json` file in your adapter's directory. A typical file would look like this: +Create `parity_experiment.json` in your adapter directory: ```json [ { - "adapter_name": , - "agent": @, - "model": , - "date": , - "adapted_benchmark_size": // Full set size - "parity_benchmark_size": , // Same as adapted_benchmark_size if we ran parity on full set - "number_of_trials": // Unless special case, this should be identical for original and harbor runs. - "notes": , // additional explanations on special treatments, etc. - "original_parity_repo": , // For reproducing the parity experiments on the original benchmark side; usually this is a fork of the original benchmark repo whose README includes instructions + scripts for running the parity experiments - "adapter_pr": [, ...], // Adapter PR link(s) in the `harbor` repo; show all PR links related to the adapter, including later fixes. - "dataset_pr": [, ...], // All PR link(s) in `harbor-datasets` repo that are registering the adapter. - "parity_pr": [, ...], // All PR link(s) to the HuggingFace parity experiment dataset (instructions below)) + "adapter_name": "", + "agent": "@", + "model": "", + "date": "", + "adapted_benchmark_size": "", + "parity_benchmark_size": "", + "number_of_runs": "", + "notes": "", + "original_parity_repo": "", + "adapter_pr": [""], + "dataset_pr": [""], + "parity_pr": [""], "metrics": [ { - "benchmark_name": , - "metric": , - "original": , // Average scores obtained from the original benchmark - "harbor": , // Average scores obtained from Harbor adapter - "original_trials": [, , , ...], // Individual trial scores - "harbor_trials": [, , , ...], // Individual trial scores - }, - { - "benchmark_name": , - "metric": , - "original": , // Average scores obtained from the original benchmark - "harbor": , // Average scores obtained from Harbor adapter - "original_trials": [, , , ...], // Individual trial scores - "harbor_trials": [, , , ...], // Individual trial scores - }, // ... more metrics + "benchmark_name": "", + "metric": "", + "original": "", + "harbor": "", + "original_runs": ["", "", "..."], + "harbor_runs": ["", "", "..."] + } ] - }, - ... + } ] ``` -You should also include the parity experiment results in the `README.md` of your adapter. For example, you can add the following table: +Also include a summary table in your README: + ```markdown -| Agent | Model | Metric | Number of Trials | Dataset Size | Original Benchmark Performance | Harbor Adapter Performance | -|-------|-------|--------|------------------|--------------|--------------------------------|----------------------------| -| claude-code | claude-4-opus | Metric | 3 | 100 tasks (5% of full set) | Score ± Std | Score ± Std | -| codex | gpt-5 | Metric | 5 | 2000 tasks (100% of full set) | Score ± Std | Score ± Std | -| ... | ... | ... | ... | ... | ... | ... | +| Agent | Model | Metric | Runs | Dataset Size | Original | Harbor | +|-------|-------|--------|------|--------------|----------|--------| +| codex | gpt-5 | pass@1 | 5 | 2000 (100%) | X ± Y | X ± Y | ``` -Then include the following links: -- The link to the original benchmark's GitHub repository -- The link to the forked repo of the original benchmark (if applicable) from [Step 4](#4-discuss-parity-plans-and-implement-agents) -- The link to the dataset PR from [Step 8](#8-register-the-dataset) -- The link to the parity experiment PR to the HuggingFace parity experiment dataset (instructions below in [Section 7](#7-upload-parity-results)) -- The link to the adapter PR -### 7. Upload Parity Results +## 7. Upload Results -After recording your parity results, you need to upload both the parity experiment results and oracle results to the [Harbor Parity Experiments HuggingFace dataset](https://huggingface.co/datasets/harborframework/parity-experiments). This allows the community to track adapter quality and helps estimate costs for each adapter on diverse agents and models. - -Follow the README instructions in the HuggingFace dataset repository to upload your results. The dataset expects results to be organized in the following format: +Upload parity and oracle results to the [HuggingFace parity-experiments dataset](https://huggingface.co/datasets/harborframework/parity-experiments): ``` -adapters/ - └── {adapter_name}/ - ├── README.md # Results overview, interpretation, notes, etc. - ├── config.yaml # The yaml file that can be directly used to run parity experiments in Harbor. - ├── original_parity/ - ├── harbor_parity/ - ├── oracle/ - └── results_collection/ # copy the valid result.json files from parity to this directory - ├── result_{original/harbor}_trial1.json - ├── result_{original/harbor}_trial2.json - ├── ... - └── result_{original/harbor}_trial{N}.json +adapters// +├── README.md +├── config.yaml +├── original_parity/ +├── harbor_parity/ +├── oracle/ +└── results_collection/ + ├── result_{original/harbor}_trial1.json + └── ... ``` +## 8. Register the Dataset -### 8. Register the Dataset - -Once your adapter correctly generates tasks and you verify the parity experiments, you should add them to the official [Harbor datasets repository](https://github.com/laude-institute/harbor-datasets). +### 8.1 Add tasks to `harbor-datasets` -- **Fork and clone the dataset repository:** - ```bash - git clone https://github.com/{your-github-username}/harbor-datasets.git - ``` -- **Add your tasks:** Place the generated task directories under `datasets//`. For example, if you follow the adapter development instructions above correctly, you should be able to run the following example commands to add your tasks to the dataset repository: - ```bash - cd harbor/adapters/ - - # Specify custom path to the harbor-datasets repo - uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/ - ``` -- **Pull Request:** Create a pull request to the `harbor-datasets` repository. It's recommended to link the original benchmark's GitHub repository in your PR. Request @Slimshilin for review, and he will merge it so you can try `--registry-path` for the `harbor` harness. You may always submit another PR to update the dataset registry. +```bash +git clone https://github.com/{you}/harbor-datasets.git +cd harbor/adapters/ +uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/ +``` +Create a PR to `harbor-datasets`. Request `@Slimshilin` for review. -Then you should navigate to the `harbor` repository (not the dataset repo!) and add a new entry to the `registry.json` file in the root. **Note:** Harbor's registry format uses task-level entries with Git URLs. +### 8.2 Add to `registry.json` -**Version naming:** Use `"version": "1.0"` by default. If there's a version naming of the original benchmark (e.g., "verified", "lite"), then please follow their naming. If you ran parity experiments on a subset (see [Section 4](#4-discuss-parity-plans-and-implement-agents)), use `"version": "parity"` for the parity subset to allow users to run `-d @parity` for parity reproduction. +In the **harbor** repo (not datasets), add an entry: -For example: ```json -[ - // existing entries... +{ + "name": "", + "version": "1.0", + "description": "Brief description. Original: [URL].", + "tasks": [ { - "name": "", - "version": "1.0", // or "parity" if this is a parity subset; if there are named splits from the original benchmark, use them accordingly - "description": "A brief description of the adapter. Original benchmark: [URL]. More details at [adapter README URL].", - "tasks": [ - { - "name": "", - "git_url": "https://github.com/laude-institute/harbor-datasets.git", - "git_commit_id": "", - "path": "datasets//" - }, - { - "name": "", - "git_url": "https://github.com/laude-institute/harbor-datasets.git", - "git_commit_id": "", - "path": "datasets//" - } - // ... more tasks - ] + "name": "", + "git_url": "https://github.com/laude-institute/harbor-datasets.git", + "git_commit_id": "", + "path": "datasets//" } -] + ] +} ``` -For initial development, you can use `"git_commit_id": "head"` to reference the latest commit, but for production you should pin to a specific commit hash for reproducibility. - -#### 8.1 Verify Registry Configuration +Use `"head"` for `git_commit_id` during development; pin to a specific commit for production. -**Important:** After your dataset registry PR is merged, you must verify that your registered dataset and `registry.json` are correctly configured. Run the following command to test oracle solutions using the registry: +### 8.3 Verify registry works ```bash -# Run from registry -uv run harbor jobs start -d --registry-path registry.json -``` - -Once the oracle tests pass successfully, **take a screenshot of your terminal** showing both: -- The command you ran -- The successful oracle test logs/results - -**Paste this screenshot in your adapter PR** to demonstrate that the registry configuration is correct and that the dataset can be successfully loaded from the registry. - -### 9. Document and Submit - -Follow the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) to draft comprehensive documentation for your adapter. - -Your README must clearly and comprehensively document all adaptation details, including: -- **Benchmark bugs or issues** that were discovered and how they were handled -- **Special treatments for agent adaptation** (e.g., prompt modifications, environment adjustments) -- **Any deviations from the original benchmark** and the rationale behind them -- **Agent implementation details** (if custom agents were created) -- **Known limitations or constraints** - -The documentation should be detailed enough for other community users to understand your adaptation choices and reproduce your work. - -Next, you need to write a `harbor/adapters/{adapter_name}/adapter_metadata.json` that follows the format below: -```json -[ - { - "adapter_name": , - "adapter_builders": [ (), ...] - "original_benchmark": [ - { - "split": , // if there's no split or subset name, use "full". - "size": , // "task" may mean different things in different benchmarks; for term consistency, we count tasks in Harbor context. - "harness": // choose between "agent", "llm", or `None`, depending on whether the benchmark has scripts for agent / llm inference. - "supported_agents": [agent_1, agent_2, ...], // supported agents (including custom agents) in the original harness; if no agents are originally supported, use `None`. Please use agent@version if version is available. - "adaptable": , // if this split can be converted to Harbor tasks with the provided adapter code. - "notes": , // e.g., term explanation, special task structures or requirements on machine or compute. Fill `None` if not applicable. - }, - ... // more splits or subsets if there exist. - ], - "harbor_adapter": [ - { - "split": , // if there's no split or subset name, use "full"; if the adapter code works for all splits and we ran parity collectively, we can just write "full" without needing to split them one by one; however, if different splits are registered / validated in different ways, we need to split them out. - "adapted_benchmark_size": , // this may be different than the size of the original benchmark's corresponding split, because we might exclude certain tasks for sufficient reasons documented in the README. - "parity_benchmark_size": , // same as adapted_benchmark_size if we ran parity on full set - "parity_sampling_rate": adapted_benchmark_size / parity_benchmark_size - "registry_benchmark_size": // we will match this number with adapted_benchmark_size or parity_benchmark_size to determine whether the full set or parity set is being registered. Please use the exact match integer-value count here. - "added_agents": [custom_agent1, custom_agent2], // custom agents added by the adapter to align with the original benchmark. - "parity_matching_agents": [agent_1@version+model, agent_1@version+model, ...] // agents (including custom ones) used for parity experiment AND achieved comparable scores to original benchmark. - "parity_unmatching_agents": [agent_1@version+model, agent_1@version+model, ...] // agents used for parity experiment BUT didn't achieve comparable scores to original benchmark. This may happen for some weak models. Fill `None` if there's no unmatching parity results. - "parity_costs": // total expense used for running parity experiments on the adapter - "notes": , // e.g., special treatment on the adapter. Fill `None` if not applicable. - }, - ... // more splits or subsets if necessary. - ], - }, - ... // if the adapter ran parity between Harbor Adapter <--> Terminal Bench Adapter <--> Original Benchmark, then substitute "harbor_adapter" with "tb_adapter" above and copy paste the dictionary below to include corresponding information for "tb_adapter" and "harbor_adapter" comparison. -] +uv run harbor jobs start -d --registry-path registry.json ``` -Once everything is ready for review (all steps completed, documentation finalized, screenshots added), update your Harbor adapter PR: +Screenshot the passing oracle results and paste them in your adapter PR. -1. **Change the PR title** from `[WIP] Adapter: {adapter_name}` to `[Ready for Review] Adapter: {adapter_name}` -2. **Request review** from `@Slimshilin` in the PR +## 9. Document & Submit -This signals to the team that your adapter is complete and ready for final review and merge. +Fill out the [README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) covering: +- Benchmark bugs discovered and how they were handled +- Special treatments (prompt tweaks, environment adjustments) +- Deviations from the original and why +- Agent implementation details +- Known limitations -### Other Useful Resources -- The [Harbor documentation](/docs/getting-started) provides detailed information about running tasks and jobs with Harbor. -- The [Harbor repository](https://github.com/laude-institute/harbor) contains example tasks and configurations. -- The [agent tutorial](/docs/agents) provides instructions on how to create and use your customized agent in Harbor. +Create `adapter_metadata.json` ([see format in full docs](/docs/datasets/adapters-ai#9-document-and-submit)). -### Getting Help -Thank you for your interest in Harbor and building an adapter! If you have any questions, please ask in the `#adapters-spam` channel in our [Discord](https://discord.com/invite/6xWPKhGDbA). +When ready, update your PR title from `[WIP]` to `[Ready for Review]` and request review from `@Slimshilin`. --- -## Translating Terminal-Bench Adapters to Harbor - -If you have an existing [Terminal-Bench adapter](https://github.com/laude-institute/terminal-bench/tree/main/adapters) and want to convert it to Harbor format, this section outlines the key differences and migration steps. Harbor maintains the same core principles as Terminal-Bench but uses a different file structure and configuration format. - -Note that the Harbor adapter should be isolated from the Terminal-Bench repo. You are expected to write adapter code following the same process as for Terminal-Bench instead of applying a direct translation script. Fortunately, with a good Terminal-Bench adapter, it is relatively easy to create a Harbor adapter by handling a slightly different task format. +## Appendix: Terminal-Bench Migration -### Key Format Differences - -The following table summarizes the main differences between Terminal-Bench and Harbor task formats: +If you're converting a Terminal-Bench adapter, here are the key differences: | Aspect | Terminal-Bench | Harbor | -|--------|----------------|---------| -| **Task Configuration** | `task.yaml` (YAML format) | `task.toml` (TOML format) | -| **Instruction** | Embedded in `task.yaml` as `instruction` field | Separate `instruction.md` file | -| **Dockerfile Location** | Root level: `Dockerfile` | Subdirectory: `environment/Dockerfile` | -| **Solution Script** | Root level: `solution.sh` | Subdirectory: `solution/solve.sh` | -| **Test Scripts** | Root level: `run-tests.sh` + `tests/test_outputs.py` | Subdirectory: `tests/test.sh` | -| **Test Verification** | Exit code based (pytest) | Reward-based: write to `/logs/verifier/reward.txt` | -| **Docker Compose** | `docker-compose.yaml` in task root | Not typically used per-task | -| **Default Output Directory** | `tasks/` | `datasets/` | -| **Registry Format** | Dataset-level with `dataset_path` | Task-level with `git_url` and `path` per task | -| **CLI Commands** | `tb run --dataset` / `tb run --dataset-path` | `harbor jobs start -d` / `harbor trials start -p` | -| **Metrics** | Resolved rate (binary pass/fail per task) | Rewards that support multiple metrics and float-type values | - -**IMPORTANT:** If the Terminal-Bench adapter used a tweaked metric (e.g., threshold-based scoring, ignoring certain metrics), then you'll need to re-implement the adapter for Harbor to support the **original** metrics used by the benchmark, as Harbor now supports multiple metrics as rewards. - -### File Structure Migration - -**Terminal-Bench structure:** -``` -task-id/ -├── task.yaml -├── Dockerfile -├── docker-compose.yaml -├── run-tests.sh -├── solution.sh -└── tests/ - └── test_outputs.py -``` - -**Harbor structure:** -``` -task-id/ -├── task.toml -├── instruction.md -├── environment/ -│ └── Dockerfile -├── solution/ -│ └── solve.sh -└── tests/ - ├── test.sh - └── test_*.py (optional) -``` - -### Migration Steps - -#### Step 1: Update Task Configuration Format - -Convert `task.yaml` to `task.toml` and extract the instruction: - -**Before (task.yaml):** -```yaml -instruction: | - Your task instruction here... - Multiple lines... -author_email: example@email.com -author_name: Author Name -difficulty: hard -category: programming -tags: - - debugging - - python -parser_name: swebench -max_agent_timeout_sec: 3000.0 -max_test_timeout_sec: 3000.0 -``` +|--------|---------------|--------| +| Config | `task.yaml` | `task.toml` | +| Instruction | In `task.yaml` | Separate `instruction.md` | +| Dockerfile | Root level | `environment/Dockerfile` | +| Solution | `solution.sh` | `solution/solve.sh` | +| Tests | `run-tests.sh` + `tests/` | `tests/test.sh` | +| Verification | Exit code (pytest) | Reward file: `/logs/verifier/reward.txt` | +| Output dir | `tasks/` | `datasets/` | +| Registry | Dataset-level `dataset_path` | Task-level `git_url` + `path` per task | +| CLI | `tb run --dataset` | `harbor jobs start -d` / `harbor runs start -p` | +| Metrics | Binary pass/fail | Float rewards, multiple metrics | + +**Important:** If Terminal-Bench used a tweaked metric, re-implement to support the **original** benchmark metrics — Harbor supports multiple metrics as rewards. + +Migration checklist: +1. Convert `task.yaml` → `task.toml` + `instruction.md` +2. Reorganize files into `environment/`, `solution/`, `tests/` subdirs +3. Update test scripts to write rewards to `/logs/verifier/reward.txt` +4. Change output directory from `tasks/` to `datasets/` +5. Update registry format to task-level entries -**After (task.toml):** -```toml -version = "1.0" - -[metadata] -author_email = "example@email.com" -author_name = "Author Name" -difficulty = "hard" -category = "programming" -tags = ["debugging", "python"] - -[agent] -timeout_sec = 3000.0 - -[verifier] -timeout_sec = 3000.0 -``` - -**And create instruction.md:** -```markdown -Your task instruction here... -Multiple lines... -``` - -#### Step 2: Reorganize Files into Subdirectories - -- Move `Dockerfile` → `environment/Dockerfile` -- Move `solution.sh` → `solution/solve.sh` -- Move `run-tests.sh` → `tests/test.sh` -- Remove `docker-compose.yaml` (usually not needed per-task in Harbor) - -#### Step 3: Update Test Scripts for Reward-Based System - -**Before (run-tests.sh in Terminal-Bench):** -```bash -#!/bin/bash -# Run tests and create marker file -pytest tests/ > test_results.txt -if [ $? -eq 0 ]; then - echo "PASSED" > /tmp/test_marker.txt -else - echo "FAILED" > /tmp/test_marker.txt -fi -``` - -**After (tests/test.sh in Harbor):** -```bash -#!/bin/bash -# Install dependencies if needed -apt-get update && apt-get install -y python3-pip -pip3 install pytest - -# Run tests -pytest /tests/test_*.py - -# Write reward based on test results -if [ $? -eq 0 ]; then - echo 1 > /logs/verifier/reward.txt -else - echo 0 > /logs/verifier/reward.txt -fi -``` - -**Key changes:** -- Harbor mounts `/logs/verifier` for test outputs -- Write numeric reward (can be float type) to `/logs/verifier/reward.txt` -- Can still use pytest, but final output must be the reward file - -#### Step 4: Update Adapter Code - -- Change default output directory from `tasks/` to `datasets/` -- Update template directory to match Harbor structure -- Modify file generation logic to create subdirectories (`environment/`, `solution/`, `tests/`) -- Split instruction extraction into separate `instruction.md` file -- Convert YAML generation to TOML generation - -#### Step 5: Update Registry Format - -**Terminal-Bench registry.json:** -```json -{ - "name": "my-adapter", - "version": "head", - "description": "...", - "github_url": "https://github.com/laude-institute/terminal-bench-datasets.git", - "dataset_path": "datasets/my-adapter", - "task_id_subset": null -} -``` - -**Harbor registry.json:** -```json -{ - "name": "my-adapter", - "version": "1.0", - "description": "...", - "tasks": [ - { - "name": "task-1", - "git_url": "https://github.com/laude-institute/harbor-datasets.git", - "git_commit_id": "abc123", - "path": "datasets/my-adapter/task-1" - } - // ... one entry per task - ] -} -``` +--- -### Getting Help +## Resources -If you have any questions about translating your Terminal-Bench adapter to Harbor, please ask in the `#adapters-spam` channel in our [Discord](https://discord.com/invite/6xWPKhGDbA) or reach out to [Lin Shi](mailto:ls2282@cornell.edu). +- [Harbor docs](/docs/getting-started) — Running tasks and jobs +- [Harbor repo](https://github.com/laude-institute/harbor) — Examples and configs +- [Agent tutorial](/docs/agents) — Creating custom agents +- [Discord](https://discord.com/invite/6xWPKhGDbA) — Ask questions in `#adapters-spam` diff --git a/content/docs/datasets/meta.json b/content/docs/datasets/meta.json index 4476aa3..d701595 100644 --- a/content/docs/datasets/meta.json +++ b/content/docs/datasets/meta.json @@ -1,4 +1,4 @@ { "title": "Datasets", - "pages": ["index", "registering-datasets", "adapters", "metrics"] + "pages": ["index", "registering-datasets", "adapters", "adapters-ai", "metrics"] }