GitHub - Qwen-Applications/GD2PO

GD²PO: Group-Dynamic Reward-Decoupled Policy Optimization for Mitigating Multi-Reward RL Conflicts

Qwen Large Model Application Team, Alibaba

📖 Overview

In this project, we provide an implementation of GD²PO (Group-Dynamic Reward-Decoupled Policy Optimization), a conflict-aware multi-reward policy optimization method for LLM post-training. When multiple reward signals are aggregated, the same rollout may have positive advantages on some dimensions but negative on others, causing signals to cancel out. GD²PO addresses this by filtering conflicting rollouts before aggregation and reweighting each query's update strength based on reward consensus.

We validate GD²PO on two multi-reward post-training tasks:

Task	Directory	Description
Tool Calling	`tool-calling/`	Multi-reward optimization with correctness, length, and format rewards
Helpfulness–Safety Alignment	`safe-alignment/`	Helpfulness–safety alignment with dual reward models

🚀 Getting Started

Each task is self-contained with its own dependencies, data, and training scripts. Please enter the corresponding directory and follow its README:

Tool Calling

cd tool-calling
# See tool-calling/README.md for installation and usage
bash scripts/correctness_length/train_gd2po_hard.sh /path/to/model

Safe Alignment

cd safe-alignment
# See safe-alignment/README.md for installation and usage
POLICY_MODEL_PATH=/path/to/model \
RM_MODEL_PATH=/path/to/reward_model \
CM_MODEL_PATH=/path/to/cost_model \
bash scripts/run_gd2po_hard.sh

📁 Project Structure

├── tool-calling/                # Tool-calling task
│   ├── scripts/                 #   Training & evaluation scripts
│   ├── verl/                    #   Core framework (GD²PO implementation)
│   ├── API_Bank/                #   Evaluation toolkit
│   ├── dataset/                 #   Training and test data
│   └── README.md
├── safe-alignment/              # Safe alignment task
│   ├── scripts/                 #   Training & evaluation scripts
│   ├── verl/                    #   Core framework (GD²PO implementation)
│   ├── dataset/                 #   Training and validation data
│   └── README.md
└── README.md                    # This file

🙏 Acknowledgements

This codebase is built upon verl, GDPO, ToolRL, and Amo. We thank all teams for their excellent open-source contributions.

📜 Citation

If you find our work useful, please consider citing:

@misc{liu2026gd2pomitigatingmultirewardconflicts,
      title={GD$^2$PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization}, 
      author={Haotian Liu and Yihao Liu and Jingwei Ni and Siyuan Huang and Xinpeng Liu and Pengyu Cheng and Jiajun Song and Ruijin Ding and Junfeng Li and Zhechao Yu and Mengyu Zhou and Hongteng Xu and Xiaoxi Jiang and Guanjun Jiang},
      year={2026},
      eprint={2606.16771},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.16771}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
safe-alignment		safe-alignment
tool-calling		tool-calling
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GD²PO: Group-Dynamic Reward-Decoupled Policy Optimization for Mitigating Multi-Reward RL Conflicts

📖 Overview

🚀 Getting Started

Tool Calling

Safe Alignment

📁 Project Structure

🙏 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GD²PO: Group-Dynamic Reward-Decoupled Policy Optimization for Mitigating Multi-Reward RL Conflicts

📖 Overview

🚀 Getting Started

Tool Calling

Safe Alignment

📁 Project Structure

🙏 Acknowledgements

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages