diff --git a/examples/true_on_policy/README.md b/examples/true_on_policy/README.md index 06444b7936..663b21a150 100644 --- a/examples/true_on_policy/README.md +++ b/examples/true_on_policy/README.md @@ -30,6 +30,15 @@ In order to support true on policy for other cases, please refer to the flags ch After running the training, you can see in wandb that the metric `train/train_rollout_logprob_abs_diff` should be exactly `0`. This indicates that there is no difference between the log probabilities from the training and the inference. Without the feature enabled, this value should be nonzero. +## Experiments +Apply this patch (not merged yet). +``` bash +curl -L https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/13207.patch -o /root/temp.patch && (cd /sgl-workspace/sglang && (patch -p1 < /root/temp.patch)) +``` +We fine-tune Qwen3-4B-Base on dapo-math-17k dataset with max_new_tokens = 2048, and evaluate on aime-2024 dataset with max_new_tokens = 8192. +Global batch size is 64 × 16. Results are summarized below. +

raw_rewards diff rollout_time eval

+ ## How it is Implemented The core idea is to make each and every operation in training and inference be bitwise equal. The main code is implemented in [#566](https://github.com/THUDM/slime/pull/566) and [SGLang#12058](https://github.com/sgl-project/sglang/pull/12058).