π€ Models | π Dataset | π Paper | π Website
WebArbiter is a reasoning-first, principle-inducing Process Reward Model (PRM) for web agents. It formulates step-level reward modeling as structured text generation, producing interpretable justifications and preference verdicts that explicitly assess task progress. Trained via reasoning distillation and reinforcement learning, WebArbiter delivers robust, progress-aware supervision for long-horizon web navigation. This repository also releases WEBPRMBENCH, a comprehensive benchmark for evaluating WebPRMs across diverse real-world web environments.
- [2026/05] Search trajectories coming soon! Full reward-guided search trajectories from WebArena-Lite (72 trajectories across 5 websites) will be released on HuggingFace.
- [2026/04] Code, models, training data, and WEBPRMBENCH released! See our HuggingFace collection.
- [2026/01] Paper accepted at ICLR 2026.
- Reasoning as reward modeling. Produces structured
<State>,<Criteria>,<Analysis>, and<Answer>outputs with auditable reasoning chains, instead of scalar scores or brittle checklists. - Principle-inducing evaluation. Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
- Two-stage training. Reasoning distillation (SFT) followed by RL with Verifiable Rewards (RLVR) to correct teacher biases and align verdicts with ground-truth correctness.
- State-of-the-art performance. WebArbiter-7B outperforms GPT-5 by 9.1 points and surpasses the previous SOTA WebPRM (WebShepherd-8B) by 31 points in Avg. BoN Acc on WEBPRMBENCH.
- WEBPRMBENCH. The first comprehensive WebPRM benchmark spanning 4 web environments (AssistantBench, Mind2Web, WorkArena, WebArena) with 1,150 step-level preference instances.
- Reward-guided search. Guides Best-of-N selection or tree search at inference time, achieving up to +6.4 points in success rate on WebArena-Lite over the best prior WebPRM.
- Guided Search Explorer. Full-stack web app for trajectory visualization, candidate comparison, knockout tournament brackets, principle-guided reasoning inspection, and reward-guided search experimentation. Try the interactive demo.
- π§© Installation
- π Quick Start
- ποΈ Training Workflow
- π§ͺ Evaluation
- π₯οΈ Guided Search Explorer
- ποΈ Search Trajectories
- π Project Structure
- π Acknowledgements
- π Citation
conda create -n llamafactory python=3.11 -y
conda activate llamafactory
cd webarbiter/llamafactory
pip install -e .conda create -n verl python=3.11 -y
conda activate verl
# Install veRL
cd verl
pip install -e .
cd ..
# We recommend installing vllm in a directory separate from WebArbiter
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout ed6e9075d31e32c8548b480a47d1ffb77da1f54c
git cherry-pick caac5c2e597b1780c3df54a537c34e6061c32cff
export VLLM_COMMIT=ed6e9075d31e32c8548b480a47d1ffb77da1f54c
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install --editable .
cd ..
# flash-attention 2 (>2x speed-up)
pip install flash-attn==2.7.2.post1 --no-build-isolationimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Use a local checkpoint path or a HuggingFace repo id.
# Available models (HuggingFace):
# WebArbiter-7B, WebArbiter-3B, WebArbiter-8B-Qwen3, WebArbiter-4B-Qwen3
model_name_or_path = "ZYao720/WebArbiter-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
# Fill in your prompt here.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for prompt format examples.
user_prompt = "" # Your evaluation prompt
messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
gen = model.generate(
input_ids=input_ids,
max_new_tokens=2048,
do_sample=False,
)
output = tokenizer.decode(gen[0][len(input_ids[0]) :], skip_special_tokens=True)
print(output)
# Output contains structured <State>, <Criteria>, <Analysis>, and <Answer> tagsAll training recipes live in ./webarbiter/. For instruct models, we use a simple two-stage pipeline:
- Distillation (SFT):
webarbiter/scripts/distill/train_*_sft.sh - RL with Verifiable Rewards (RLVR):
webarbiter/scripts/RLVR/train_*.sh
# 1οΈβ£ Distillation (SFT)
conda activate llamafactory
cd webarbiter/llamafactory
sbatch ../scripts/distill/train_7instruct_sft.sh
# 2οΈβ£ RLVR fine-tuning
conda deactivate
conda activate verl
cd ../..
sbatch webarbiter/scripts/RLVR/train_rm_r1_sft_nc_b256_8e-4_r128_e5_rlvr_qwen2.5_instruct_7b_7e-6.shWe evaluate on WEBPRMBENCH, our pairwise preference benchmark for web process reward models. All evaluation scripts are provided in the eval/ directory. The benchmark data is automatically downloaded from HuggingFace on first run.
# Set the model path and run from any directory
export MODEL=path/to/your/model
bash eval/WebPRMBench/eval_one_command.shResults (Pairwise and BoN Accuracy per environment) are saved to results/.
Guided Search Explorer is an interactive platform for reward-guided web agent trajectory analysis on WebArena-Lite. It provides a transparent environment for inspecting how WebArbiter evaluates candidate actions at each decision point and selects the best action through principle-guided knockout tournaments, making the entire guided search process from the paper fully auditable.
At every step, the Explorer surfaces the sampled candidates, dynamically induced evaluation principles, structured pairwise comparison analysis (<State>, <Criteria>, <Analysis>, <Answer>), and the knockout tournament bracket that determines the winning action.
Built with React, TypeScript, FastAPI, MongoDB, and Playwright.
The Explorer visualizes the full reward-guided search pipeline: browser state observation, candidate action sampling, knockout tournament brackets, and WebArbiter's principle-guided reasoning with structured <Criteria>, <Analysis>, and <Answer> outputs.
Examples: Principle-Guided Trajectory Analysis
Example 1: WebArbiter dynamically induces evaluation principles from the task intent and page state, then produces structured pairwise comparisons to select the best candidate action.
Example 2: A second trajectory illustrating how principle-guided reasoning generalizes across different tasks and web environments.
| Feature | Description |
|---|---|
| Step-by-step trajectory visualization | Browse runs, view annotated screenshots, and step through state transitions with keyboard shortcuts |
| Candidate action comparison | Inspect all sampled candidate actions at each step with confidence scores, reasoning traces, and WebArbiter's evaluation |
| Knockout tournament bracket | Visualize how WebArbiter's pairwise preference verdicts select the best action through single-elimination tournaments, the core mechanism of reward-guided search |
| Principle-guided reasoning display | View dynamically induced evaluation criteria, structured comparative analysis, and final preference verdicts for each matchup |
| Search tree view | Explore the full trajectory tree as an interactive node-link diagram, showing branching decisions, selected paths, and alternative branches |
| Reward-guided search utilities | Generate candidates via policy model sampling, run knockout tournaments with WebArbiter, and execute the selected action β all from the UI. Optional reranking is available for additional candidate filtering before the tournament |
| GIF export | Export step-by-step trajectory animations for papers and presentations |
| Batch run management | Create, import, export, and monitor multiple runs through a task queue |
An interactive demo is available on the project homepage, where you can step through real reward-guided search trajectories across five WebArena-Lite websites (GitLab, Shopping, Reddit, Map, Shopping Admin) and see how WebArbiter selects the best action at each step. The demo features:
- A guided search pipeline visualization (Sample β Tournament β Execute) showing which phase is active
- Tournament bracket diagrams illustrating the knockout elimination process
- Principle-guided evaluation panels with induced criteria and structured comparative analysis
- Interactive search tree minimaps showing the trajectory's branching structure
- Auto-play mode for hands-free trajectory replay
No WebArena environment needed. Docker only. Import trajectory ZIPs and explore them in the browser.
cd viewer
cp env.example .env # default settings are sufficient for viewing
docker compose upOpen http://localhost:3000, then use Import to load trajectory ZIP files (e.g., from the Search Trajectories dataset). Each ZIP contains a trajectory.json and a screenshots/ folder.
To run live reward-guided search (sample β tournament β execute), you additionally need:
- A running WebArena-Lite environment. See setup guide.
- API keys for the policy model and reward model configured in
.env.
cd viewer
cp env.example .env # fill in API keys and site URLs
docker compose upSee
viewer/env.examplefor all configurable environment variables (API keys, site URLs, browser settings, etc.).
Security note: viewer/docker-compose.yml and env.example use default MongoDB credentials (admin / password) via ${MONGO_PASSWORD:-password}. That is only appropriate for local development. For any shared or production deployment, set strong MONGO_USER / MONGO_PASSWORD (and matching MONGODB_URL) in .env and avoid exposing the database port publicly.
Local Development (without Docker)
If you prefer to run the services individually for development:
- MongoDB: Start a local MongoDB instance on the default port (
27017). EnsureMONGODB_URLin.envincludes authentication parameters (e.g.,mongodb://admin:password@localhost:27017/?authSource=admin). - Backend: Install Python dependencies and start the FastAPI server:
cd viewer pip install -r requirements.txt python -m src.server.main - Frontend: Install Node dependencies and start the React dev server:
cd viewer/src/client npm install npm run dev
For detailed instructions, see Explorer Quickstart.
Coming Soon. Full reward-guided search trajectories will be released at WebArbiter-Trajectories.
The trajectory dataset contains 72 trajectories across 5 WebArena-Lite websites (GitLab, Map, Reddit, Shopping, Shopping Admin), generated using GPT-4o / GPT-4o-mini as the policy model and WebArbiter as the reward model. Each trajectory includes step-by-step browser screenshots, 5 candidate actions per step with confidence scores and reasoning traces, WebArbiter's principle-guided evaluations with structured <State>, <Criteria>, <Analysis>, and <Answer> outputs, tournament bracket results, and the full search tree structure.
The trajectories are compatible with the Guided Search Explorer for interactive visualization.
WebArbiter/
βββ webarbiter/ # Training code
β βββ llamafactory/ # LLaMA-Factory fork (SFT distillation)
β βββ verl/ # Custom veRL extensions (RLVR)
β β βββ trainer/ # PPO/GRPO trainer & entry point
β β βββ utils/ # Dataset loader & reward functions
β β βββ workers/ # FSDP distributed workers
β βββ scripts/ # Training launch scripts
β βββ distill/ # SFT distillation (3B, 7B)
β βββ RLVR/ # RLVR fine-tuning (3B, 7B)
βββ eval/ # Evaluation
β βββ WebPRMBench/ # WEBPRMBENCH benchmark
β βββ script/ # vLLM-based evaluation script
βββ viewer/ # Guided Search Explorer (full-stack app)
β βββ src/
β β βββ server/ # FastAPI backend (API, MongoDB, task queue)
β β βββ client/ # React/TypeScript frontend
β β βββ browser_env/ # Playwright browser environment
β β βββ evaluation_harness/ # Task evaluation logic
β β βββ llms/ # LLM provider integrations
β β βββ prompts/ # Prompt templates
β βββ docker-compose.yml # One-command deployment
β βββ env.example # Environment variable template
β βββ docs/ # Quickstart guides (EN / ZH)
βββ verl/ # veRL framework (distributed RL for LLMs)
βββ res/ # Figures and images
This project builds upon the following open-source projects:
- LLaMA-Factory: Unified framework for LLM fine-tuning
- veRL: Distributed RL training framework for LLMs
- WebArena: Realistic web environment benchmark
- VisualAgentBench: Visual agent evaluation framework
If you find this work useful, please cite our paper:
@misc{zhang2026webarbiterprincipleguidedreasoningprocess,
title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
year={2026},
eprint={2601.21872},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.21872},
}


