feat: add webshop benchmark and update description by Jensen246 · Pull Request #1337 · microsoft/RD-Agent

Jensen246 · 2026-03-04T07:42:19Z

Description

Motivation and Context

How Has This Been Tested?

If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

Your own tests:

Types of changes

Fix bugs
Add new feature
Update documentation

📚 Documentation preview 📚: https://RDAgent--1337.org.readthedocs.build/en/1337/

- Add chemcot dataset to DATASETS registry using new DatasetConfig structure - Keep CoT quality guidelines from chemcot branch in prompts.yaml - Migrate chemcot from old dict-based interface to DatasetConfig - Remove legacy consolidation logic (datasets lib handles this)

… rendering

…ging

…mark dir

…ne scenario

Fallback to common miniconda paths when conda is not in PATH. Fixes B200 pod startup failure (conda: command not found). Made-with: Cursor

No more conda detection logic. Just set TRAINING_PYTHON in .env. Fallback to conda only if not set. Made-with: Cursor

start.sh now uses OPENHANDS_PYTHON for main.py execution, since the parent process may be in a different conda env. Made-with: Cursor

- Add agents/opencode/ with config.yaml, start.sh, README.md - Include opencode-rl pipeline code (pipeline/, runner_fsm/, benchmarks/) - Merge opencode-rl dependencies into autorl_bench requirements.txt - Remove separate venv requirement, share main environment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Sync opencode-rl runner_fsm with latest simplifications - Add smith benchmarks integration - Update opencompass configs and server with GPU support + error handling

- Document external repo architecture (opencode-rl as independent plugin) - Add setup instructions for cloning and configuring opencode-rl - Add architecture diagram showing RD-Agent ↔ opencode-rl interaction - Document OPENCODE_RL_ROOT for custom paths

- Add smith/ module for dynamic benchmark discovery from rl-smith - Add PerSampleEvaluator for per-sample scoring via vLLM - Update utils.py to support script-based data download for smith benchmarks - Update opencode agent config

- instructions.md: prohibit SFT, require RL (GRPO/PPO) for all benchmarks - remove agents/opencode/opencode-rl/ (runtime uses external OPENCODE_RL_ROOT) Made-with: Cursor

openai, httpx, python-dotenv, tenacity are for OpenCode agent's separate environment. Keep peft and pydantic as shared deps. Made-with: Cursor

- run.py: replace 2x nested 3-level try/except with shared _kill_process_group() using loop + specific exceptions - server.py: except Exception → except (RuntimeError, ValueError, OSError) - utils.py: except Exception → except requests.ConnectionError Made-with: Cursor

Extract from run.py into core/utils.py so other runners can also use it. Exported via core/__init__.py. Made-with: Cursor

Made-with: Cursor

Use relative paths, forbid cd outside workspace, ignore symlink targets. Made-with: Cursor

…CLI, remove unsupported args Made-with: Cursor

Ensures OpenCode-FSM-Runner writes outputs into the workspace prepared by AutoRL-Bench instead of creating its own runs/ directory. Made-with: Cursor

Ensures LLM agent bash calls (e.g. python3 -c "from trl import ...") resolve to the correct training environment, instead of relying on parent shell conda activation. Made-with: Cursor

…ode-rl Made-with: Cursor

Jensen246 and others added 30 commits December 21, 2025 09:33

feat: add panorama dataset, refactor dataset interface

936a181

feat: calculate token using tiktoken, and ndarray bug

a81ffb4

fix: download subtasks of chemcotdataset seperately

7763cc6

feat: customized prepare func for datasets

0a41502

feat: update new benchmarks

781d6d0

add datasets package

5b22c9c

docs: readme for llm finetune

a963bfe

feat: download raw data directly, with post-process function

4a6a4fe

feat: analyze raw dataset

b26e72c

suppress litellm debug info

3d71857

feat(ui): summary page

0d1fd17

feat: run multi-jobs

473cfe5

feat: improve ui

fe3374c

feat: add path and checkout options to LLM finetune loop entrypoint

60c3e75

feat: add FinanceIQ_ppl benchmark with auto-download and dataset desc…

37c7804

… rendering

refactor: remove unused imports and dead code, fix session folder log…

37147c4

…ging

feat: enable tablebench and tableInstruct dataset

1000aa0

refine dataset readme, and coder prompt

3d18e0a

Merge branch 'finetune' of github.com:microsoft/RD-Agent into finetune

93ecc78

refine proposal and coder prompt

5b88eac

fix: ui path (default log path)

d830351

feat: add automatic LoRA model merging for benchmarking with vLLM

a225fd5

refactor: reorganize finetune benchmark and merge modules under bench…

90c621d

…mark dir

refactor: modularize benchmark config and error extraction for finetu…

7cc2a8a

…ne scenario

fix: update benchmark import paths and disable env cache for device info

d232af0

refactor docke&conda env and fix import bugs

bc0742b

Merge branch 'finetune' of github.com:microsoft/RD-Agent into finetune

46743d0

modify init python file

18f85be

feat: add FinanceIQ dataset split utility and integrate with pipeline

97e2f4c

couragec and others added 27 commits February 28, 2026 10:59

alfworld

240a7ec

parallex

7bff58d

alfworld

dd0faa3

run

d7919d4

eval gpu

c3fa363

alfworld

6e50e82

alfworld

1135bb3

fix conda init in start.sh for non-interactive shells

d24ad8a

Fallback to common miniconda paths when conda is not in PATH. Fixes B200 pod startup failure (conda: command not found). Made-with: Cursor

simplify start.sh: read TRAINING_PYTHON from .env

6a192e0

No more conda detection logic. Just set TRAINING_PYTHON in .env. Fallback to conda only if not set. Made-with: Cursor

use OPENHANDS_PYTHON from .env to run agent

866a4df

start.sh now uses OPENHANDS_PYTHON for main.py execution, since the parent process may be in a different conda env. Made-with: Cursor

Update opencode agent, benchmarks, and eval configs

f542ca2

- Sync opencode-rl runner_fsm with latest simplifications - Add smith benchmarks integration - Update opencompass configs and server with GPU support + error handling

enforce RL-only in instructions.md; remove embedded opencode-rl

088f4b7

- instructions.md: prohibit SFT, require RL (GRPO/PPO) for all benchmarks - remove agents/opencode/opencode-rl/ (runtime uses external OPENCODE_RL_ROOT) Made-with: Cursor

comment out OpenCode-only deps in requirements.txt

6bf943e

openai, httpx, python-dotenv, tenacity are for OpenCode agent's separate environment. Keep peft and pydantic as shared deps. Made-with: Cursor

move kill_process_group to core/utils for reuse

ca520db

Extract from run.py into core/utils.py so other runners can also use it. Exported via core/__init__.py. Made-with: Cursor

add comments to run.py for workspace isolation and signal handling

e2ae657

Made-with: Cursor

remove OpenCode-only deps from requirements.txt entirely

83ff188

Made-with: Cursor

allow SFT in instructions, RL as ultimate goal

02c1068

Made-with: Cursor

add workspace isolation rules to instructions.md

3ac4f8c

Use relative paths, forbid cd outside workspace, ignore symlink targets. Made-with: Cursor

update opencode start.sh: use OPENCODE_PYTHON, add PATH for opencode …

278308a

…CLI, remove unsupported args Made-with: Cursor

opencode start.sh: pass --run-dir to use AutoRL-Bench workspace

0683730

Ensures OpenCode-FSM-Runner writes outputs into the workspace prepared by AutoRL-Bench instead of creating its own runs/ directory. Made-with: Cursor

opencode start.sh: prepend training env bin to PATH

5007063

Ensures LLM agent bash calls (e.g. python3 -c "from trl import ...") resolve to the correct training environment, instead of relying on parent shell conda activation. Made-with: Cursor

opencode start.sh: restore --max-retries and --eval-timeout for openc…

56fbab3

…ode-rl Made-with: Cursor

Add webshop benchmark and update description

cdc83ff

Jensen246 requested a review from couragec March 4, 2026 07:42

adjust flask dependency

3785a3e

Base automatically changed from rl-posttraining to main March 17, 2026 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add webshop benchmark and update description#1337

feat: add webshop benchmark and update description#1337
Jensen246 wants to merge 504 commits intomainfrom
webshop

Jensen246 commented Mar 4, 2026 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Uh oh!

Conversation

Jensen246 commented Mar 4, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots of Test Results (if appropriate):

Types of changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Jensen246 commented Mar 4, 2026 •

edited by github-actions bot

Loading