In this project, we provide an implementation of GD²PO (Group-Dynamic Reward-Decoupled Policy Optimization), a conflict-aware multi-reward policy optimization method for LLM post-training. When multiple reward signals are aggregated, the same rollout may have positive advantages on some dimensions but negative on others, causing signals to cancel out. GD²PO addresses this by filtering conflicting rollouts before aggregation and reweighting each query's update strength based on reward consensus.
We validate GD²PO on two multi-reward post-training tasks:
| Task | Directory | Description |
|---|---|---|
| Tool Calling | tool-calling/ |
Multi-reward optimization with correctness, length, and format rewards |
| Helpfulness–Safety Alignment | safe-alignment/ |
Helpfulness–safety alignment with dual reward models |
Each task is self-contained with its own dependencies, data, and training scripts. Please enter the corresponding directory and follow its README:
cd tool-calling
# See tool-calling/README.md for installation and usage
bash scripts/correctness_length/train_gd2po_hard.sh /path/to/modelcd safe-alignment
# See safe-alignment/README.md for installation and usage
POLICY_MODEL_PATH=/path/to/model \
RM_MODEL_PATH=/path/to/reward_model \
CM_MODEL_PATH=/path/to/cost_model \
bash scripts/run_gd2po_hard.sh├── tool-calling/ # Tool-calling task
│ ├── scripts/ # Training & evaluation scripts
│ ├── verl/ # Core framework (GD²PO implementation)
│ ├── API_Bank/ # Evaluation toolkit
│ ├── dataset/ # Training and test data
│ └── README.md
├── safe-alignment/ # Safe alignment task
│ ├── scripts/ # Training & evaluation scripts
│ ├── verl/ # Core framework (GD²PO implementation)
│ ├── dataset/ # Training and validation data
│ └── README.md
└── README.md # This file
This codebase is built upon verl, GDPO, ToolRL, and Amo. We thank all teams for their excellent open-source contributions.
If you find our work useful, please consider citing:
@misc{liu2026gd2pomitigatingmultirewardconflicts,
title={GD$^2$PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization},
author={Haotian Liu and Yihao Liu and Jingwei Ni and Siyuan Huang and Xinpeng Liu and Pengyu Cheng and Jiajun Song and Ruijin Ding and Junfeng Li and Zhechao Yu and Mengyu Zhou and Hongteng Xu and Xiaoxi Jiang and Guanjun Jiang},
year={2026},
eprint={2606.16771},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2606.16771},
}
