From e54b107985cca3972e7c7c90ba5d1179fb3c8c8f Mon Sep 17 00:00:00 2001 From: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Date: Thu, 13 Nov 2025 12:29:46 -0500 Subject: [PATCH] Update README with experiments and results Added experiments section detailing fine-tuning and evaluation results. --- examples/true_on_policy/README.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/examples/true_on_policy/README.md b/examples/true_on_policy/README.md index 06444b7936..663b21a150 100644 --- a/examples/true_on_policy/README.md +++ b/examples/true_on_policy/README.md @@ -30,6 +30,15 @@ In order to support true on policy for other cases, please refer to the flags ch After running the training, you can see in wandb that the metric `train/train_rollout_logprob_abs_diff` should be exactly `0`. This indicates that there is no difference between the log probabilities from the training and the inference. Without the feature enabled, this value should be nonzero. +## Experiments +Apply this patch (not merged yet). +``` bash +curl -L https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/13207.patch -o /root/temp.patch && (cd /sgl-workspace/sglang && (patch -p1 < /root/temp.patch)) +``` +We fine-tune Qwen3-4B-Base on dapo-math-17k dataset with max_new_tokens = 2048, and evaluate on aime-2024 dataset with max_new_tokens = 8192. +Global batch size is 64 × 16. Results are summarized below. +