WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

🤗 Models | 📊 Dataset | 📖 Paper | 🌐 Website

WebArbiter is a reasoning-first, principle-inducing Process Reward Model (PRM) for web agents. It formulates step-level reward modeling as structured text generation, producing interpretable justifications and preference verdicts that explicitly assess task progress. Trained via reasoning distillation and reinforcement learning, WebArbiter delivers robust, progress-aware supervision for long-horizon web navigation. This repository also releases WEBPRMBENCH, a comprehensive benchmark for evaluating WebPRMs across diverse real-world web environments.

🔥 News

[2026/05] Search trajectories coming soon! Full reward-guided search trajectories from WebArena-Lite (72 trajectories across 5 websites) will be released on HuggingFace.
[2026/04] Code, models, training data, and WEBPRMBENCH released! See our HuggingFace collection.
[2026/01] Paper accepted at ICLR 2026.

✨ Highlights

Reasoning as reward modeling. Produces structured <State>, <Criteria>, <Analysis>, and <Answer> outputs with auditable reasoning chains, instead of scalar scores or brittle checklists.
Principle-inducing evaluation. Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
Two-stage training. Reasoning distillation (SFT) followed by RL with Verifiable Rewards (RLVR) to correct teacher biases and align verdicts with ground-truth correctness.
State-of-the-art performance. WebArbiter-7B outperforms GPT-5 by 9.1 points and surpasses the previous SOTA WebPRM (WebShepherd-8B) by 31 points in Avg. BoN Acc on WEBPRMBENCH.
WEBPRMBENCH. The first comprehensive WebPRM benchmark spanning 4 web environments (AssistantBench, Mind2Web, WorkArena, WebArena) with 1,150 step-level preference instances.
Reward-guided search. Guides Best-of-N selection or tree search at inference time, achieving up to +6.4 points in success rate on WebArena-Lite over the best prior WebPRM.
Guided Search Explorer. Full-stack web app for trajectory visualization, candidate comparison, knockout tournament brackets, principle-guided reasoning inspection, and reward-guided search experimentation. Try the interactive demo.

🧩 Installation

Distillation (SFT) environment

conda create -n llamafactory python=3.11 -y
conda activate llamafactory

cd webarbiter/llamafactory
pip install -e .

RLVR environment (veRL + vLLM + flash-attention)

conda create -n verl python=3.11 -y
conda activate verl

# Install veRL
cd verl
pip install -e .
cd ..

# We recommend installing vllm in a directory separate from WebArbiter
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout ed6e9075d31e32c8548b480a47d1ffb77da1f54c
git cherry-pick caac5c2e597b1780c3df54a537c34e6061c32cff
export VLLM_COMMIT=ed6e9075d31e32c8548b480a47d1ffb77da1f54c
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install --editable .
cd ..

# flash-attention 2 (>2x speed-up)
pip install flash-attn==2.7.2.post1 --no-build-isolation

🚀 Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Use a local checkpoint path or a HuggingFace repo id.
# Available models (HuggingFace):
#   WebArbiter-7B, WebArbiter-3B, WebArbiter-8B-Qwen3, WebArbiter-4B-Qwen3
model_name_or_path = "ZYao720/WebArbiter-7B"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# Fill in your prompt here.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for prompt format examples.
user_prompt = ""  # Your evaluation prompt

messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

gen = model.generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    do_sample=False,
)

output = tokenizer.decode(gen[0][len(input_ids[0]) :], skip_special_tokens=True)
print(output)
# Output contains structured <State>, <Criteria>, <Analysis>, and <Answer> tags

🏋️ Training Workflow

All training recipes live in ./webarbiter/. For instruct models, we use a simple two-stage pipeline:

Distillation (SFT): webarbiter/scripts/distill/train_*_sft.sh
RL with Verifiable Rewards (RLVR): webarbiter/scripts/RLVR/train_*.sh

🔧 Example: Training a 7B Instruct model from scratch

# 1️⃣ Distillation (SFT)
conda activate llamafactory
cd webarbiter/llamafactory
sbatch ../scripts/distill/train_7instruct_sft.sh

# 2️⃣ RLVR fine-tuning
conda deactivate
conda activate verl
cd ../..
sbatch webarbiter/scripts/RLVR/train_rm_r1_sft_nc_b256_8e-4_r128_e5_rlvr_qwen2.5_instruct_7b_7e-6.sh

🧪 Evaluation

We evaluate on WEBPRMBENCH, our pairwise preference benchmark for web process reward models. All evaluation scripts are provided in the eval/ directory. The benchmark data is automatically downloaded from HuggingFace on first run.

# Set the model path and run from any directory
export MODEL=path/to/your/model
bash eval/WebPRMBench/eval_one_command.sh

Results (Pairwise and BoN Accuracy per environment) are saved to results/.

🖥️ Guided Search Explorer

Guided Search Explorer is an interactive platform for reward-guided web agent trajectory analysis on WebArena-Lite. It provides a transparent environment for inspecting how WebArbiter evaluates candidate actions at each decision point and selects the best action through principle-guided knockout tournaments, making the entire guided search process from the paper fully auditable.

At every step, the Explorer surfaces the sampled candidates, dynamically induced evaluation principles, structured pairwise comparison analysis (<State>, <Criteria>, <Analysis>, <Answer>), and the knockout tournament bracket that determines the winning action.

Built with React, TypeScript, FastAPI, MongoDB, and Playwright.

The Explorer visualizes the full reward-guided search pipeline: browser state observation, candidate action sampling, knockout tournament brackets, and WebArbiter's principle-guided reasoning with structured <Criteria>, <Analysis>, and <Answer> outputs.

Examples: Principle-Guided Trajectory Analysis

Example 1: WebArbiter dynamically induces evaluation principles from the task intent and page state, then produces structured pairwise comparisons to select the best candidate action.

Example 2: A second trajectory illustrating how principle-guided reasoning generalizes across different tasks and web environments.

Features

Feature	Description
Step-by-step trajectory visualization	Browse runs, view annotated screenshots, and step through state transitions with keyboard shortcuts
Candidate action comparison	Inspect all sampled candidate actions at each step with confidence scores, reasoning traces, and WebArbiter's evaluation
Knockout tournament bracket	Visualize how WebArbiter's pairwise preference verdicts select the best action through single-elimination tournaments, the core mechanism of reward-guided search
Principle-guided reasoning display	View dynamically induced evaluation criteria, structured comparative analysis, and final preference verdicts for each matchup
Search tree view	Explore the full trajectory tree as an interactive node-link diagram, showing branching decisions, selected paths, and alternative branches
Reward-guided search utilities	Generate candidates via policy model sampling, run knockout tournaments with WebArbiter, and execute the selected action — all from the UI. Optional reranking is available for additional candidate filtering before the tournament
GIF export	Export step-by-step trajectory animations for papers and presentations
Batch run management	Create, import, export, and monitor multiple runs through a task queue

Online Demo

An interactive demo is available on the project homepage, where you can step through real reward-guided search trajectories across five WebArena-Lite websites (GitLab, Shopping, Reddit, Map, Shopping Admin) and see how WebArbiter selects the best action at each step. The demo features:

A guided search pipeline visualization (Sample → Tournament → Execute) showing which phase is active
Tournament bracket diagrams illustrating the knockout elimination process
Principle-guided evaluation panels with induced criteria and structured comparative analysis
Interactive search tree minimaps showing the trajectory's branching structure
Auto-play mode for hands-free trajectory replay

Quick Start (view existing trajectories)

No WebArena environment needed. Docker only. Import trajectory ZIPs and explore them in the browser.

cd viewer
cp env.example .env          # default settings are sufficient for viewing
docker compose up

Open http://localhost:3000, then use Import to load trajectory ZIP files (e.g., from the Search Trajectories dataset). Each ZIP contains a trajectory.json and a screenshots/ folder.

Full Setup (run agents with reward-guided search)

To run live reward-guided search (sample → tournament → execute), you additionally need:

A running WebArena-Lite environment. See setup guide.
API keys for the policy model and reward model configured in .env.

cd viewer
cp env.example .env          # fill in API keys and site URLs
docker compose up

See viewer/env.example for all configurable environment variables (API keys, site URLs, browser settings, etc.).

Security note: viewer/docker-compose.yml and env.example use default MongoDB credentials (admin / password) via ${MONGO_PASSWORD:-password}. That is only appropriate for local development. For any shared or production deployment, set strong MONGO_USER / MONGO_PASSWORD (and matching MONGODB_URL) in .env and avoid exposing the database port publicly.

Local Development (without Docker)

If you prefer to run the services individually for development:

MongoDB: Start a local MongoDB instance on the default port (27017). Ensure MONGODB_URL in .env includes authentication parameters (e.g., mongodb://admin:password@localhost:27017/?authSource=admin).

Backend: Install Python dependencies and start the FastAPI server:

cd viewer
pip install -r requirements.txt
python -m src.server.main

Frontend: Install Node dependencies and start the React dev server:
```
cd viewer/src/client
npm install
npm run dev
```

For detailed instructions, see Explorer Quickstart.

🗂️ Search Trajectories

Coming Soon. Full reward-guided search trajectories will be released at WebArbiter-Trajectories.

The trajectory dataset contains 72 trajectories across 5 WebArena-Lite websites (GitLab, Map, Reddit, Shopping, Shopping Admin), generated using GPT-4o / GPT-4o-mini as the policy model and WebArbiter as the reward model. Each trajectory includes step-by-step browser screenshots, 5 candidate actions per step with confidence scores and reasoning traces, WebArbiter's principle-guided evaluations with structured <State>, <Criteria>, <Analysis>, and <Answer> outputs, tournament bracket results, and the full search tree structure.

The trajectories are compatible with the Guided Search Explorer for interactive visualization.

📁 Project Structure

WebArbiter/
├── webarbiter/                   # Training code
│   ├── llamafactory/             #   LLaMA-Factory fork (SFT distillation)
│   ├── verl/                     #   Custom veRL extensions (RLVR)
│   │   ├── trainer/              #     PPO/GRPO trainer & entry point
│   │   ├── utils/                #     Dataset loader & reward functions
│   │   └── workers/              #     FSDP distributed workers
│   └── scripts/                  #   Training launch scripts
│       ├── distill/              #     SFT distillation (3B, 7B)
│       └── RLVR/                 #     RLVR fine-tuning (3B, 7B)
├── eval/                         # Evaluation
│   └── WebPRMBench/              #   WEBPRMBENCH benchmark
│       └── script/               #     vLLM-based evaluation script
├── viewer/                       # Guided Search Explorer (full-stack app)
│   ├── src/
│   │   ├── server/               #   FastAPI backend (API, MongoDB, task queue)
│   │   ├── client/               #   React/TypeScript frontend
│   │   ├── browser_env/          #   Playwright browser environment
│   │   ├── evaluation_harness/   #   Task evaluation logic
│   │   ├── llms/                 #   LLM provider integrations
│   │   └── prompts/              #   Prompt templates
│   ├── docker-compose.yml        #   One-command deployment
│   ├── env.example               #   Environment variable template
│   └── docs/                     #   Quickstart guides (EN / ZH)
├── verl/                         # veRL framework (distributed RL for LLMs)
└── res/                          # Figures and images

🙏 Acknowledgements

This project builds upon the following open-source projects:

LLaMA-Factory: Unified framework for LLM fine-tuning
veRL: Distributed RL training framework for LLMs
WebArena: Realistic web environment benchmark
VisualAgentBench: Visual agent evaluation framework

📚 Citation

If you find this work useful, please cite our paper:

@misc{zhang2026webarbiterprincipleguidedreasoningprocess,
      title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, 
      author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
      year={2026},
      eprint={2601.21872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.21872}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eval/WebPRMBench		eval/WebPRMBench
res		res
verl @ 516657f		verl @ 516657f
viewer		viewer
webarbiter		webarbiter
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

🔥 News

✨ Highlights

📌 Table of Contents

🧩 Installation

Distillation (SFT) environment

RLVR environment (veRL + vLLM + flash-attention)

🚀 Quick Start

🏋️ Training Workflow

🔧 Example: Training a 7B Instruct model from scratch

🧪 Evaluation

🖥️ Guided Search Explorer

Features

Online Demo

Quick Start (view existing trajectories)

Full Setup (run agents with reward-guided search)

🗂️ Search Trajectories

📁 Project Structure

🙏 Acknowledgements

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

🔥 News

✨ Highlights

📌 Table of Contents

🧩 Installation

Distillation (SFT) environment

RLVR environment (veRL + vLLM + flash-attention)

🚀 Quick Start

🏋️ Training Workflow

🔧 Example: Training a 7B Instruct model from scratch

🧪 Evaluation

🖥️ Guided Search Explorer

Features

Online Demo

Quick Start (view existing trajectories)

Full Setup (run agents with reward-guided search)

🗂️ Search Trajectories

📁 Project Structure

🙏 Acknowledgements

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages