AggAgent is a framework for parallel test-time scaling of long-horizon agentic tasks. Instead of running a single agent trajectory, you sample K independent trajectories and then use AggAgent to synthesize the best final solution. AggAgent inspects tool observations across trajectories, cross-checks reasoning, and resolves conflicts β producing answers that are more reliable than any single run.
TL;DR: Sample K agent trajectories in parallel β aggregate with AggAgent β get a better solution.
The aggagent package is available on PyPI. If you already have multiple agent trajectories and want to aggregate them, this is all you need:
pip install aggagentfrom aggagent import AggAgent
# Each trajectory is a list of message dicts (OpenAI message format)
traj_1 = [
{"role": "user", "content": "Who won the 1986 FIFA World Cup?"},
{"role": "assistant", "content": "...", "reasoning": "...", "tool_calls": [...]},
{"role": "tool", "tool_call_id": "...", "name": "search", "content": "..."},
{"role": "assistant", "content": "Argentina won the 1986 FIFA World Cup.", "reasoning": "..."},
]
# ... collect traj_2, traj_3, traj_4 from parallel runs
agent = AggAgent(
model="gpt-4.1", # aggregation model; use api_base for local vLLM
task="browsecomp", # task type
# optionally override litellm kwargs (messages and tools are always injected)
# llm_kwargs={"model": "gemini/gemini-2.0-flash", "api_key": "...", "temperature": 0.7},
)
result = agent.run(
question="Who won the 1986 FIFA World Cup?",
trajectories=[traj_1, traj_2, traj_3, traj_4],
)
print(result["solution"]) # self-contained answer string
print(result["reason"]) # meta-reasoning about how trajectories were evaluatedSee the AggAgent Package section for the full API.
Requires uv.
git clone https://github.com/princeton-pli/AggAgent.git
cd AggAgent
uv sync --extra rolloutOr with pip:
git clone https://github.com/princeton-pli/AggAgent.git
cd AggAgent
pip install -e ".[rollout]"Copy .env.example to .env and fill in your API keys:
cp .env.example .envOPENAI_API_KEY=sk-... # for GPT-based judge / aggregation
GEMINI_API_KEY=... # for Gemini-based judge / aggregation
SERPER_KEY_ID=... # for web search (Google Serper)
The rollout stage runs N independent ReAct agent trajectories over a benchmark dataset. Each trajectory uses web search and page-visit tools to gather evidence and arrive at an answer.
If you only need to reproduce or build on top of the base ReAct trajectories from the paper, skip rollout generation and download them directly. We release roll_out_count = 8 parallel rollouts per benchmark instance for three backbones: GLM-4.7-Flash, MiniMax-M2.5, Qwen3.5-122B-A10B.
| Source | Coverage |
|---|---|
π€ yoonsanglee/aggagent collection |
DeepSearchQA, HLE, HealthBench, ResearchRubrics |
Google Drive aggagent-browsecomp-react.tar |
BrowseComp, BrowseComp-Plus |
BrowseComp / BrowseComp-Plus are distributed as a tar via Google Drive rather than Hugging Face to limit web-crawl contamination of these evals.
The published format is flat parquet (one row per rollout). The aggregation pipeline expects the on-disk layout output/rollout/<MODEL>/<DATASET>/iter{1..N}/<question>.json, so use scripts/hf_to_rollout.py to materialize that:
# From Hugging Face
python scripts/hf_to_rollout.py \
--repo yoonsanglee/deepsearchqa-react \
--model Qwen3.5-122B-A10B \
--out output/rollout/Qwen3.5-122B-A10B/deepsearchqa
# From a local parquet (e.g. extracted from the BrowseComp tar)
python scripts/hf_to_rollout.py path/to/Qwen3.5-122B-A10B.parquet \
--out output/rollout/Qwen3.5-122B-A10B/browsecompAfter conversion you can run Aggregation directly on the resulting directory.
# BrowseComp, DeepSearchQA, HealthBench, ResearchRubrics
uv run python scripts/download_dataset.py
# BrowseComp-Plus (also downloads FAISS indexes and corpus)
uv run python scripts/download_dataset.py --browsecomp-plusFirst, serve your model with vLLM:
uv run vllm serve <MODEL_PATH> \
--served-model-name <MODEL_NAME> \
--host 0.0.0.0 --port 6000 \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser glm47 # adjust per modelThen run rollouts:
# Run all datasets with the settings in scripts/rollout.sh
bash scripts/rollout.sh
# Or run a single dataset directly
uv run python rollout/run_multi_react.py \
--model GLM-4.7-Flash \
--dataset browsecomp \
--roll_out_count 8 \
--max_workers 3 \
--api_base http://localhost:6000/v1 \
--output_dir output/rollout/GLM-4.7-Flash/browsecompResults are written as individual JSON files under output/rollout/<MODEL>/<DATASET>/iter{k}/.
Supported datasets: browsecomp, browsecomp-plus, hle, deepsearchqa, healthbench, researchrubrics
For detailed instructions β model-specific flags, distributed (multi-worker) splits, BrowseComp+ local retrieval setup β see rollout/README.md.
Given a directory of rollout results, aggregation/aggregate.py computes aggregation metrics across a range of strategies and k values.
uv run python aggregation/aggregate.py \
--strategy heuristic \
--task browsecomp \
output/rollout/GLM-4.7-Flash/browsecomp# SolAgg: integrate raw predictions from k trajectories
uv run python aggregation/aggregate.py \
--strategy solagg \
--model GLM-4.7-Flash \
--api_base http://localhost:6000/v1 \
--task browsecomp \
--k 4 \
output/rollout/GLM-4.7-Flash/browsecomp
# SummAgg: summarize each trajectory first, then integrate
uv run python aggregation/aggregate.py \
--strategy summagg \
--model GLM-4.7-Flash \
--api_base http://localhost:6000/v1 \
--task browsecomp \
--k 4 \
output/rollout/GLM-4.7-Flash/browsecomp# Using the script
bash scripts/aggregation.sh
# Or directly
uv run python aggregation/aggregate.py \
--strategy aggagent \
--model GLM-4.7-Flash \
--api_base http://localhost:6000/v1 \
--task browsecomp \
--k 4 \
output/rollout/GLM-4.7-Flash/browsecompLogs are written to output/aggregation/<MODEL>/<DIR_TAG>/aggagent_logs_k{k}.jsonl and a summary to aggagent_stats_k{k}.json.
| Strategy | Type | Description |
|---|---|---|
pass |
Heuristic | Pass@k upper bound β correct if any trajectory is correct |
mv |
Heuristic | Majority voting over extracted answers |
wmv |
Heuristic | Confidence-weighted majority voting |
bon |
Heuristic | Best of N β pick the trajectory with highest confidence |
fewtool |
Heuristic | Pick the trajectory that used the fewest tool calls |
solagg |
LLM-based | Feed k raw predictions to an LLM to integrate |
summagg |
LLM-based | Summarize each trajectory into a report, then integrate |
aggagent |
LLM-based | Agentic aggregation β inspect tool evidence, cross-check, synthesize |
uv run python aggregation/aggregate.py \
--strategy all \
--model GLM-4.7-Flash \
--api_base http://localhost:6000/v1 \
--task browsecomp \
output/rollout/GLM-4.7-Flash/browsecompThe aggagent Python package exposes a single class: AggAgent.
pip install aggagent| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
"" |
Model name. Use the model's served name for local vLLM, or gpt-4.1 / gemini-... for API models. |
api_base |
str | None |
None |
Base URL for a local vLLM server (e.g. http://localhost:6000/v1). Set None for OpenAI/Gemini. |
task |
str |
"" |
Task type. Controls the output format of the finish tool. Use one of the supported task names or "" for generic short-answer tasks. |
max_context_tokens |
int |
102400 |
Approximate token budget. When exceeded, the agent is forced to call finish immediately. |
llm_kwargs |
dict | None |
None |
If provided, passed directly to litellm.completion (bypassing built-in defaults). messages and tools are always injected. Useful for setting model, api_key, api_base, temperature, top_p, max_tokens, etc. |
Supported task types: browsecomp, browsecomp-plus, hle, deepsearchqa, healthbench, researchrubrics
- Short-answer tasks (
browsecomp,hle,deepsearchqa) β solution format:<explanation>...</explanation><answer>...</answer> - Long-form tasks (
healthbench,researchrubrics) β solution format: a full synthesized report with inline citations
result = agent.run(
question="...",
trajectories=[traj_1, traj_2, ...],
)| Parameter | Type | Description |
|---|---|---|
question |
str |
The task or question being answered. |
trajectories |
list[list[dict]] |
N trajectories. Each trajectory is a list of message dicts in OpenAI message format (role, content, optionally tool_calls, reasoning_content). |
Returns on success:
{"solution": str, "reason": str}solution: a self-contained answer string (does not reference trajectories or agents).reason: the agent's meta-reasoning β how it evaluated and reconciled trajectories.
Returns on failure:
{"solution": None, "reason": None, "error": str}Each trajectory is a list of messages with standard OpenAI roles. Tool calls and tool responses are supported:
trajectory = [
{"role": "system", "content": "You are a research assistant..."},
{"role": "user", "content": "What year did ..."},
{"role": "assistant", "content": "", "tool_calls": [
{"id": "call_1", "type": "function", "function": {
"name": "search", "arguments": '{"query": "..."}'
}}
]},
{"role": "tool", "tool_call_id": "call_1", "name": "search",
"content": "Search results: ..."},
{"role": "assistant", "content": "Based on the search results, the answer is ..."},
]Reasoning/thinking tokens can be included under the reasoning_content key in assistant messages.
AggAgent operates as a tool-calling agent with four internal tools:
| Tool | Description |
|---|---|
get_solution |
Retrieve the final message from one or all trajectories |
search_trajectory |
Search for a keyword/phrase within a trajectory (ROUGE-L ranked) |
get_segment |
Read a contiguous range of steps from a trajectory in full |
finish |
Submit the final synthesized answer |
The agent is instructed to: survey trajectory metadata β retrieve final solutions β verify key claims against raw tool observations (search_trajectory + get_segment) β cross-check reasoning β call finish.
AggAgent is also available as a Claude Code skill. The skill lives in .claude/skills/aggagent/.
The skill expects a single flat directory containing one JSON file per trajectory, all for the same question. Use scripts/collect_trajs.py to assemble this from a rollout output directory:
# Collect trajectories for a question matching "MMORPG"
python scripts/collect_trajs.py output/rollout/GLM-4.7-Flash/deepsearchqa "MMORPG"
# Then aggregate
/aggagent trajs_mmorpg/The skill surveys all final solutions, verifies key claims against raw tool observations, and synthesizes a final answer β the same aggregation logic as the Python package, but interactive inside Claude Code.
If you find this work useful, please cite:
@article{lee2026agentic,
title={Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks},
author={Yoonsang Lee and Howard Yen and Xi Ye and Danqi Chen},
journal={arXiv preprint arXiv:2604.11753},
year={2026}
}Princeton Language and Intelligence (PLI) Β· Apache 2.0 License