[Feature] support async rl #1360

YanhuiDua · 2025-12-15T06:52:49Z

This PR introduces asynchronous RL support to Xtuner, enabling partial rollouts and version-based sample management for more efficient training data generation.

1. Key Concepts:

staleness_threshold: The maximum allowed threshold of stale (expired) samples in a training batch.
enable_partial_rollout: Whether to enable partial rollout for asynchronous data generation.
tail_batch_candidate_steps: Number of rollout steps after which a sample becomes a candidate for the tail batch. Set to 0 to disable. 0 means no tail batch.
tail_batch_trigger_size: Number of candidate samples needed in the queue to trigger a tail batch operation. It will be set to global_batch_size when not provided by user or set to 0

2. Async logic:

Strategy Type	Settings	Core Features
Synchronous Strategy	`staleness_threshold=0.0` `enable_partial_rollout=0` `tail_batch_candidate_steps=0`	1. No data oversending
Asynchronous 1	`staleness_threshold=0.2` `enable_partial_rollout=0` `tail_batch_candidate_steps=0`	1. 20% data oversending 2. Responses not retained when paused rollout 3. Prioritize sampling data from the abort queue
Asynchronous 2	`staleness_threshold=0.2` `enable_partial_rollout=0` `tail_batch_candidate_steps=1` `tail_batch_trigger_size=0`	1. 20% data oversending 2. Responses not retained when paused 3. Prioritize sampling data from the abort queue 4. Put it into the candidate pool when sample abort num reaches `tail_batch_candidate_steps+1`
Asynchronous 3	`staleness_threshold=0.2` `enable_partial_rollout=1` `tail_batch_candidate_steps=0` `tail_batch_trigger_size=0`	1. 20% data oversending 2. Responses retained & concatenated when paused 3. Prioritize sampling data from the abort queue
Asynchronous 4	`staleness_threshold=0.2` `enable_partial_rollout=1` `tail_batch_candidate_steps=1` `tail_batch_trigger_size=0`	1. 20% data oversending 2. Responses retained & concatenated when paused 3. Prioritize sampling data from the abort queue 4. Put it into the candidate pool when sample abort num reaches `tail_batch_candidate_steps+1`. the `tail_batch_candidate_steps` means off policy step

3. BenchMark

4. Relative PR

PR: [Feat][1/N] support async_rl in replaybuffer #1337

Added async-related configuration parameters including partial_rollout, tail_batch_candidate_steps, tail_batch_trigger_size and staleness_threshold；
Refactored replay buffer storage to support versioned samples with bucketed tracking of completed, aborted, and expired states
Renamed Sampler to DatasetSampler and separated dataset sampling logic from replay buffer sampling

PR2: [Feat][2/N] support async_rl in dataflow YanhuiDua/xtuner#2

Apply sample_from_expired_storage in dataflow. When sample_from_expired_storage is set to True, the dataflow will not oversend data and will return data only after all tasks of the current batch are completed.
Add task time log info.

PR3: [Feat][3/N] support async_rl in rollout YanhuiDua/xtuner#3

Added partial rollout functionality with versioned response tracking to accumulate tokens across multiple generation steps
Implemented automatic worker restart mechanism when all rollout workers become inactive
Fixed state handling for aborted rollouts and improved error logging

PR4: [Feat][4/4] support async_rl in rl_trainer YanhuiDua/xtuner#4

Add tensorboard for training and rollout metrics.
Refactored the training loop in fit() to conditionally execute rollout, training, and weight synchronization based on debug mode
Fix async running bugs

…nd storage

…orage

jayhenry · 2025-12-23T07:44:28Z

xtuner/v1/ray/dataflow/flow.py

        waiting_tasks = set()
        dataflow_start_time = time.perf_counter()
        task_completion_times = []
        with tqdm(total=self.target_batch_size, desc="rollout_controller for training samples") as pbar:


使用 tqdm(miniters=10) （Minimum progress display update interval in iters）并在循环中使用 pbar.update(finished_samples) 来代替 manual pbar.fresh。最小化pbar在loop中的操作。

jayhenry · 2025-12-23T07:48:38Z

xtuner/v1/train/rl_trainer.py

-                        data_batches, pack_max_length=self._train_worker_cfg.pack_max_length, rollout_idx=rollout_idx
-                    )
-                )



Nice hierarchical code!

jayhenry · 2025-12-23T08:14:22Z

xtuner/v1/ray/dataflow/replay_buffer.py

-                collator="fake_collator",
-                pack_level="none",
+            expired_threshold = (
+                min(remain_size, self.config.tail_batch_trigger_size)


use cast(int, xxx) instead

jayhenry · 2025-12-23T08:34:12Z

xtuner/v1/ray/dataflow/flow.py

+                self.finished_samples_count = await self.replay_buffer.get_completed_samples_count.remote()
                waiting_tasks = pending_tasks

+                while len(waiting_tasks) + self.finished_samples_count < max(data_concurrency, self.target_batch_size):


len(waiting_tasks) + self.finished_samples_count < data_concurrency + init_finished_samples_count

hhaAndroid · 2025-12-24T10:09:39Z

xtuner/v1/data_proto/rl_data.py

+    extra_info: Dict[str, Any] = Field(default_factory=dict)
    state: RolloutState = RolloutState.INIT

+    def _update_by_append(self, other: "RLRolloutResponseItem") -> None:


Suggested change

def _update_by_append(self, other: "RLRolloutResponseItem") -> None:

def _update_by_append(self, other: Self) -> None:

hhaAndroid · 2025-12-24T10:11:03Z

xtuner/v1/data_proto/rl_data.py

+        self.state = other.state
+        return
+
+    def update(self, other: "RLRolloutResponseItem") -> None:


和上面一样修改

hhaAndroid · 2025-12-24T10:14:43Z

xtuner/v1/data_proto/rl_data.py

+
+        if other_ids_copy is not None:
+            assert self.response_ids is not None, "response_ids must not be None when updating partial data."
+            self.response_ids.extend(other_ids_copy.copy())


为啥要 copy 两次？

hhaAndroid · 2025-12-24T10:22:00Z

xtuner/v1/ray/dataflow/flow.py

+    tail_batch_trigger_size: Annotated[
+        Optional[int],
+        Parameter(
+            help="Number of candidate samples needed in the queue to trigger a tail batch operation. Set to 0 to disable."


Set to 0 to disable. 这句描述不对

这个没有所谓的 enable说法吧，需要配合 tail_batch_candidate_steps 才生效

hhaAndroid · 2026-01-06T06:13:59Z

xtuner/v1/data_proto/rl_data.py

    response_ids: Optional[List[int]] = None
+    logprobs: Optional[List[float]] = None
    num_return_tokens: Optional[int] = None
+    versioned_response: List[str] = Field(default_factory=list)


思考下未来多轮情况下，这个地方是否有改动？

hhaAndroid · 2026-01-06T06:23:03Z

xtuner/v1/ray/dataflow/flow.py

+    tail_batch_trigger_size: Annotated[
+        Optional[int],
+        Parameter(
+            help="Number of candidate samples needed in the queue to trigger a tail batch operation. Set to 0 to disable."


这个没有所谓的 enable说法吧，需要配合 tail_batch_candidate_steps 才生效

hhaAndroid · 2026-01-06T07:08:54Z

xtuner/v1/ray/dataflow/replay_buffer.py

        Parameter(help="Weights for different states in the replay buffer."),
    ] = {}
+    # async rollout related configs, assigned from dataflow cfg
+    enable_partial_rollout: Annotated[


既然这些参数不让用户设置，是自动赋值，是否有其他实现方法，强制不让用户误以为可以设置

hhaAndroid · 2026-01-06T07:10:59Z

xtuner/v1/ray/dataflow/replay_buffer.py

+        else:
+            self.dataloader_cfg = DataloaderConfig(
+                collator="fake_collator",
+                pack_level="none",


num_worker 默认可以设置为 1 或者 2。考虑到多模态场景

hhaAndroid · 2026-01-06T07:18:53Z

xtuner/v1/ray/dataflow/replay_buffer.py

+            self._completed_actions[replay_meta.version].append(action_id)
+            self.logger.debug(f"Add sample with root_id: {root_id}, action_id: {action_id} to finished_actions.")
+        else:
+            assert False, f"Unsupported rollout state {state} for action_id {action_id} in ReplayBufferStorage."


raise AssertionError(xxxx)

hhaAndroid · 2026-01-06T07:50:14Z

xtuner/v1/ray/dataflow/replay_buffer.py

+
+        for sample in group_samples:
+            assert sample.data.input_ids and sample.data.num_tokens, "input_ids or num_tokens is empty!"
+            if "routed_experts" in sample.env.rollout.extra_info:


不能暴力删除，需要考虑内存泄露情况

hhaAndroid · 2026-01-06T07:56:56Z

xtuner/v1/ray/rollout/worker.py

+                            data = base64.b64decode(routed_experts)
+                            routed_experts = ray.cloudpickle.loads(data)
+                        else:
+                            routed_experts = torch.tensor(routed_experts)  # n,layer,expert


sglang 是走的这个分支。如果运行到这个分支那么先 routed_experts = ray.put(routed_experts) 然后 await routed_experts 就太怪了。建议还是处理下

hhaAndroid · 2026-01-06T08:00:10Z

xtuner/v1/ray/rollout/worker.py

+                            cur_routed_experts = cur_routed_experts[exist_routed_experts.shape[0] :, :, :]
+                            concat_routed_experts = np.concatenate((exist_routed_experts, cur_routed_experts), axis=0)
+                            prompt_tokens = response["meta_info"].get("prompt_tokens", 0)
+                            response_tokens = response["meta_info"].get("completion_tokens", 0)


这个可以加个 assert，判断 concat_routed_experts 序列长度等于 prompt_tokens+response_tokens-1

hhaAndroid · 2026-01-06T08:09:16Z

xtuner/v1/ray/dataflow/replay_buffer.py

+            if not self.enable_partial_rollout:
+                # 清除上次的response_ids等env数据
+                if "routed_experts" in sample.env.rollout.extra_info:
+                    del sample.env.rollout.extra_info["routed_experts"]


是否有考虑在中断情况下，下一次发送请求时候发给同一个 server，从而复用 cache.

YanhuiDua added 6 commits December 8, 2025 15:30

[Feat][1/N] support async_rl in replaybuffer by refactoring sampler a…

6cff996

…nd storage

[Feat][1/N] support async_rl in replaybuffer by supporting expired st…

4c6d2fc

…orage

[Feat][2/N] support async_rl in dataflow

0b634e4

[Feat][3/N] support async_rl in rollout

d87b4b1

[Feat][4/4] support async_rl in rl_trainer

ce74b1f

[Feat][5/5] add tensorboard metrics

7b4d41a

YanhuiDua force-pushed the support_async_rl_4 branch from efb3109 to 1601d51 Compare December 16, 2025 09:40

[Feat][6/N] fix some bugs and add logs

aaa4860

YanhuiDua force-pushed the support_async_rl_4 branch 2 times, most recently from 5e3f135 to aaa4860 Compare December 19, 2025 04:20

jayhenry reviewed Dec 23, 2025

View reviewed changes

[Fix] fix concating routed_experts in r3 with partial rollout

953a613

YanhuiDua force-pushed the support_async_rl_4 branch from 31b3535 to 953a613 Compare December 23, 2025 09:38

YanhuiDua added 2 commits December 23, 2025 20:41

[fix] add token-level entropy

f1deb88

tmp-commit: fix r3 bug

4bd4c4f

YanhuiDua force-pushed the support_async_rl_4 branch from f6fa0fd to 4bd4c4f Compare December 25, 2025 08:48

fix routed_experts in rl_trainer

ba993a3

hhaAndroid reviewed Jan 6, 2026

View reviewed changes

YanhuiDua force-pushed the support_async_rl_4 branch from 2397345 to 003ca72 Compare January 6, 2026 08:43

Merge branch 'main' into HEAD

ab6bccc

YanhuiDua force-pushed the support_async_rl_4 branch from 003ca72 to ab6bccc Compare January 6, 2026 08:59

	def _update_by_append(self, other: "RLRolloutResponseItem") -> None:
	def _update_by_append(self, other: Self) -> None:

[Feature] support async rl #1360

Are you sure you want to change the base?

[Feature] support async rl #1360

Uh oh!

Conversation

YanhuiDua commented Dec 15, 2025 • edited by hhaAndroid Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayhenry Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YanhuiDua commented Dec 15, 2025 •

edited by hhaAndroid

Loading

jayhenry Dec 23, 2025 •

edited

Loading