Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions examples/true_on_policy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,15 @@ In order to support true on policy for other cases, please refer to the flags ch

After running the training, you can see in wandb that the metric `train/train_rollout_logprob_abs_diff` should be exactly `0`. This indicates that there is no difference between the log probabilities from the training and the inference. Without the feature enabled, this value should be nonzero.

## Experiments
Apply this patch (not merged yet).
``` bash
curl -L https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/13207.patch -o /root/temp.patch && (cd /sgl-workspace/sglang && (patch -p1 < /root/temp.patch))
```
We fine-tune Qwen3-4B-Base on dapo-math-17k dataset with max_new_tokens = 2048, and evaluate on aime-2024 dataset with max_new_tokens = 8192.
Global batch size is 64 × 16. Results are summarized below.
<p align="center"> <img width="360" alt="raw_rewards" src="https://github.com/user-attachments/assets/752b5923-11c2-4819-818b-c5bb19c0ac39" /> <img width="360" alt="diff" src="https://github.com/user-attachments/assets/5cd0d413-a989-4fc1-83a4-4076d96f6e94" /> <img width="360" alt="rollout_time" src="https://github.com/user-attachments/assets/d2217c8e-93aa-4d9a-96e3-97a467ac80b3" /> <img width="360" alt="eval" src="https://github.com/user-attachments/assets/3cd294a7-e54e-460e-9335-f5aadbf0428a" /> </p>

## How it is Implemented

The core idea is to make each and every operation in training and inference be bitwise equal. The main code is implemented in [#566](https://github.com/THUDM/slime/pull/566) and [SGLang#12058](https://github.com/sgl-project/sglang/pull/12058).
Expand Down