- [2026-04-20] Data and cold-start model can be found in this Huggingface collection https://huggingface.co/collections/namezz/checklist.
- [2026-01-31] We released the code and paper for CM2.
CM2 (RL with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use) is a Reinforcement Learning framework designed to solve the challenges of training agents for complex, open-ended tool-use scenarios.
Current RL approaches often rely on verifiable rewards (e.g., exact match), which are scarce in realistic multi-turn and multi-step interactions. CM2 introduces Checklist Rewards: we decompose an agent's intended behavior into fine-grained, binary, evidence-grounded criteria.
Key Features:
- Checklist Rewards: Replaces vague scalar rewards with interpretable, binary checklist items annotated by LLMs.
- Sparse Assignment, Dense Criteria: Adopts a "Sparse in assignment; Dense in criteria" strategy to balance signal informativeness with training stability.
- Scalable Tool Environment: Trains in an LLM-simulated environment capable of handling 5,000+ tools without heavy engineering overhead.
-
Significant Performance: Achieves significant improvements over SFT on
$\tau^2$ -Bench (+8 pts), BFCL-V4 (+10 pts), and ToolSandbox (+12 pts).
Follow ./config_env.sh.
Dependencies: This project relies on VeRL for RL training and LLaMA-Factory for SFT.
Follow ./pipeline/run.sh.
We evaluate CM2 on three major benchmarks:
Code for evaluation will be released soon.
- VeRL: For the RL training framework.
- LLaMA-Factory: For the SFT implementation.
- Qwen: For the powerful base models.
If you use this code, please cite our paper:
@misc{zhang2026cm2reinforcementlearningchecklist,
title={CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use},
author={Zhen Zhang and Kaiqiang Song and Xun Wang and Yebowen Hu and Weixiang Yan and Chenyang Zhao and Henry Peng Zou and Haoyun Deng and Sathish Reddy Indurthi and Shujian Liu and Simin Ma and Xiaoyang Wang and Xin Eric Wang and Song Wang},
year={2026},
eprint={2602.12268},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.12268},
}