zyzshishui · zyzshishui · Nov 13, 2025
diff --git a/examples/true_on_policy/README.md b/examples/true_on_policy/README.md
@@ -30,6 +30,15 @@ In order to support true on policy for other cases, please refer to the flags ch
 
 After running the training, you can see in wandb that the metric `train/train_rollout_logprob_abs_diff` should be exactly `0`. This indicates that there is no difference between the log probabilities from the training and the inference. Without the feature enabled, this value should be nonzero.
 
+## Experiments
+Apply this patch (not merged yet).
+``` bash
+curl -L https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/13207.patch -o /root/temp.patch && (cd /sgl-workspace/sglang && (patch -p1 < /root/temp.patch))
+```
+We fine-tune Qwen3-4B-Base on dapo-math-17k dataset with max_new_tokens = 2048, and evaluate on aime-2024 dataset with max_new_tokens = 8192.
+Global batch size is 64 × 16. Results are summarized below.
+<p align="center"> <img width="360" alt="raw_rewards" src="https://github.com/user-attachments/assets/752b5923-11c2-4819-818b-c5bb19c0ac39" /> <img width="360" alt="diff" src="https://github.com/user-attachments/assets/5cd0d413-a989-4fc1-83a4-4076d96f6e94" /> <img width="360" alt="rollout_time" src="https://github.com/user-attachments/assets/d2217c8e-93aa-4d9a-96e3-97a467ac80b3" /> <img width="360" alt="eval" src="https://github.com/user-attachments/assets/3cd294a7-e54e-460e-9335-f5aadbf0428a" /> </p>
+
 ## How it is Implemented
 
 The core idea is to make each and every operation in training and inference be bitwise equal. The main code is implemented in [#566](https://github.com/THUDM/slime/pull/566) and [SGLang#12058](https://github.com/sgl-project/sglang/pull/12058).