Very slow asynchronous GRPO training on 8×A100

Hi, thanks for the great work on ROLL.

I am trying to run asynchronous GRPO training with Qwen3-8B on 8×A100, but the training is extremely slow. I would like to ask whether this is expected, or whether I may have configured something incorrectly.

**Problem**

The main issue is that 1 training step takes more than 40 minutes, which seems far too slow for this setup.

From GPU utilization monitoring, I observed that during training, the rollout GPUs and training GPUs do not seem to work concurrently. Instead, it looks like only one side is active at a time:

when rollout is busy, training is mostly idle
when training is busy, rollout is mostly idle

So although I am using the asynchronous setup, it appears that the pipeline is behaving almost serially rather than asynchronously, with little  overlap between rollout and training.

Training type: asynchronous GRPO
Model: Qwen3-8B
GPUs: 8 × A100
Dataset: Dapo-17k / AIME2024

My startup script for reproduction as below:
```
hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "qwen3_8b_dapo_math_lr1e-6_async"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
system_envs:
  USE_MODELSCOPE: '1'


checkpoint_config:
  type: file_system
  output_dir: /data/cpfs_0/rl_examples/models/${exp_name}

track_with: wandb
tracker_kwargs:
  project: "verl_grpo_qwen3_8b_dapo_17k"
  name: ${exp_name}
  api_key: ""

num_gpus_per_node: 8

max_steps: 500
save_steps: 50
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false

async_generation_ratio: 1

rollout_batch_size: 128  # prompt
prompt_length: 1024
response_length: 8192

num_return_sequences_in_group: 16
ppo_epochs: 1
adv_estimator: "grpo"

# GRPO KL Loss
use_kl_loss: true
kl_loss_coef: 0.001

# clip
value_clip: 0.5
reward_clip: 10
advantage_clip: 2.0
dual_clip_loss: true

# normalize
norm_mean_type: ~
norm_std_type: ~

# data mask
max_len_mask: true
difficulty_mask: true
difficulty_low_threshold: 0.1
difficulty_high_threshold: 0.95
error_max_len_clip: false

# data weight
difficulty_loss_weight: false
length_loss_weight: false

# reward
add_token_level_kl: false

# advantage
whiten_advantages: true

pretrain: models/Qwen3-8B
reward_pretrain: /models/Qwen3-8B

validation:
  data_args:
    template: qwen3
    file_name:
      - /aime-2024-roll.parquet
    prompt: prompt
    response: ground_truth
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.6
    top_k: 50
    num_beams: 1
    temperature: 0.6
    num_return_sequences: 1


actor_train:
  model_args:
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 16
    warmup_steps: 0
    num_train_epochs: 3
  data_args:
    template: qwen3
    file_name:
      - /dapo-math-17k-roll.parquet
    dataset_dir: data
    messages: messages
    response: ground_truth
    domain_interleave_probs:
      math_rule: 1.0
    preprocessing_num_workers: 16
  strategy_args:
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 4
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 1
      use_distributed_optimizer: true
      recompute_granularity: full
  device_mapping: list(range(0,4))
  infer_batch_size: 4

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: ${num_return_sequences_in_group}
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.6
      block_size: 16
      max_model_len: 9216
  device_mapping: list(range(4,8))
  infer_batch_size: 4

reference:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: megatron_infer
    strategy_config:
      tensor_model_parallel_size: 4
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 1
  device_mapping: list(range(4,8))
  infer_batch_size: 4

rewards:
  crossthinkqa:
    worker_cls: roll.pipeline.rlvr.rewards.crossthinkqa_rule_reward_worker.CrossThinkQARuleRewardWorker
    reward_type: soft
    response_length_penalty_coef: 0.0
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen3
    tag_included: [crossthinkqa]
    world_size: 4
    infer_batch_size: 4
  ifeval:
    worker_cls: roll.pipeline.rlvr.rewards.ifeval_rule_reward_worker.GeneralRuleRewardWorker
    reward_type: soft
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen3
    tag_included: [ifeval]
    world_size: 4
    infer_batch_size: 4
  math_rule:
    worker_cls: roll.pipeline.rlvr.rewards.math_rule_reward_worker.MathRuleRewardWorker
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen3
    tag_included: [deepmath_103k, aime]
    world_size: 4
    infer_batch_size: 1
  code_sandbox:
    use_local: true
    worker_cls: roll.pipeline.rlvr.rewards.code_sandbox_reward_worker.CodeSandboxRewardWorker
    tag_included: [KodCode]
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen3
    world_size: 4
    infer_batch_size: 1
```
I would be very grateful if someone could help me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow asynchronous GRPO training on 8×A100 #394

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Very slow asynchronous GRPO training on 8×A100 #394

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions