-
Notifications
You must be signed in to change notification settings - Fork 245
Description
Hi, thanks for the great work on ROLL.
I am trying to run asynchronous GRPO training with Qwen3-8B on 8×A100, but the training is extremely slow. I would like to ask whether this is expected, or whether I may have configured something incorrectly.
Problem
The main issue is that 1 training step takes more than 40 minutes, which seems far too slow for this setup.
From GPU utilization monitoring, I observed that during training, the rollout GPUs and training GPUs do not seem to work concurrently. Instead, it looks like only one side is active at a time:
when rollout is busy, training is mostly idle
when training is busy, rollout is mostly idle
So although I am using the asynchronous setup, it appears that the pipeline is behaving almost serially rather than asynchronously, with little overlap between rollout and training.
Training type: asynchronous GRPO
Model: Qwen3-8B
GPUs: 8 × A100
Dataset: Dapo-17k / AIME2024
My startup script for reproduction as below:
hydra:
run:
dir: .
output_subdir: null
exp_name: "qwen3_8b_dapo_math_lr1e-6_async"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
system_envs:
USE_MODELSCOPE: '1'
checkpoint_config:
type: file_system
output_dir: /data/cpfs_0/rl_examples/models/${exp_name}
track_with: wandb
tracker_kwargs:
project: "verl_grpo_qwen3_8b_dapo_17k"
name: ${exp_name}
api_key: ""
num_gpus_per_node: 8
max_steps: 500
save_steps: 50
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false
async_generation_ratio: 1
rollout_batch_size: 128 # prompt
prompt_length: 1024
response_length: 8192
num_return_sequences_in_group: 16
ppo_epochs: 1
adv_estimator: "grpo"
# GRPO KL Loss
use_kl_loss: true
kl_loss_coef: 0.001
# clip
value_clip: 0.5
reward_clip: 10
advantage_clip: 2.0
dual_clip_loss: true
# normalize
norm_mean_type: ~
norm_std_type: ~
# data mask
max_len_mask: true
difficulty_mask: true
difficulty_low_threshold: 0.1
difficulty_high_threshold: 0.95
error_max_len_clip: false
# data weight
difficulty_loss_weight: false
length_loss_weight: false
# reward
add_token_level_kl: false
# advantage
whiten_advantages: true
pretrain: models/Qwen3-8B
reward_pretrain: /models/Qwen3-8B
validation:
data_args:
template: qwen3
file_name:
- /aime-2024-roll.parquet
prompt: prompt
response: ground_truth
generating_args:
max_new_tokens: ${response_length}
top_p: 0.6
top_k: 50
num_beams: 1
temperature: 0.6
num_return_sequences: 1
actor_train:
model_args:
disable_gradient_checkpointing: false
dtype: bf16
model_type: ~
training_args:
learning_rate: 1.0e-6
weight_decay: 0
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
warmup_steps: 0
num_train_epochs: 3
data_args:
template: qwen3
file_name:
- /dapo-math-17k-roll.parquet
dataset_dir: data
messages: messages
response: ground_truth
domain_interleave_probs:
math_rule: 1.0
preprocessing_num_workers: 16
strategy_args:
strategy_name: megatron_train
strategy_config:
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
use_distributed_optimizer: true
recompute_granularity: full
device_mapping: list(range(0,4))
infer_batch_size: 4
actor_infer:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
generating_args:
max_new_tokens: ${response_length}
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: ${num_return_sequences_in_group}
data_args:
template: qwen3
strategy_args:
strategy_name: vllm
strategy_config:
gpu_memory_utilization: 0.6
block_size: 16
max_model_len: 9216
device_mapping: list(range(4,8))
infer_batch_size: 4
reference:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
model_type: ~
data_args:
template: qwen3
strategy_args:
strategy_name: megatron_infer
strategy_config:
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
device_mapping: list(range(4,8))
infer_batch_size: 4
rewards:
crossthinkqa:
worker_cls: roll.pipeline.rlvr.rewards.crossthinkqa_rule_reward_worker.CrossThinkQARuleRewardWorker
reward_type: soft
response_length_penalty_coef: 0.0
model_args:
model_name_or_path: ${reward_pretrain}
data_args:
template: qwen3
tag_included: [crossthinkqa]
world_size: 4
infer_batch_size: 4
ifeval:
worker_cls: roll.pipeline.rlvr.rewards.ifeval_rule_reward_worker.GeneralRuleRewardWorker
reward_type: soft
model_args:
model_name_or_path: ${reward_pretrain}
data_args:
template: qwen3
tag_included: [ifeval]
world_size: 4
infer_batch_size: 4
math_rule:
worker_cls: roll.pipeline.rlvr.rewards.math_rule_reward_worker.MathRuleRewardWorker
model_args:
model_name_or_path: ${reward_pretrain}
data_args:
template: qwen3
tag_included: [deepmath_103k, aime]
world_size: 4
infer_batch_size: 1
code_sandbox:
use_local: true
worker_cls: roll.pipeline.rlvr.rewards.code_sandbox_reward_worker.CodeSandboxRewardWorker
tag_included: [KodCode]
model_args:
model_name_or_path: ${reward_pretrain}
data_args:
template: qwen3
world_size: 4
infer_batch_size: 1
I would be very grateful if someone could help me.