Skip to content

Very slow asynchronous GRPO training on 8×A100 #394

@5SSjw

Description

@5SSjw

Hi, thanks for the great work on ROLL.

I am trying to run asynchronous GRPO training with Qwen3-8B on 8×A100, but the training is extremely slow. I would like to ask whether this is expected, or whether I may have configured something incorrectly.

Problem

The main issue is that 1 training step takes more than 40 minutes, which seems far too slow for this setup.

From GPU utilization monitoring, I observed that during training, the rollout GPUs and training GPUs do not seem to work concurrently. Instead, it looks like only one side is active at a time:

when rollout is busy, training is mostly idle
when training is busy, rollout is mostly idle

So although I am using the asynchronous setup, it appears that the pipeline is behaving almost serially rather than asynchronously, with little overlap between rollout and training.

Training type: asynchronous GRPO
Model: Qwen3-8B
GPUs: 8 × A100
Dataset: Dapo-17k / AIME2024

My startup script for reproduction as below:

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "qwen3_8b_dapo_math_lr1e-6_async"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
system_envs:
  USE_MODELSCOPE: '1'


checkpoint_config:
  type: file_system
  output_dir: /data/cpfs_0/rl_examples/models/${exp_name}

track_with: wandb
tracker_kwargs:
  project: "verl_grpo_qwen3_8b_dapo_17k"
  name: ${exp_name}
  api_key: ""

num_gpus_per_node: 8

max_steps: 500
save_steps: 50
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false

async_generation_ratio: 1

rollout_batch_size: 128  # prompt
prompt_length: 1024
response_length: 8192

num_return_sequences_in_group: 16
ppo_epochs: 1
adv_estimator: "grpo"

# GRPO KL Loss
use_kl_loss: true
kl_loss_coef: 0.001

# clip
value_clip: 0.5
reward_clip: 10
advantage_clip: 2.0
dual_clip_loss: true

# normalize
norm_mean_type: ~
norm_std_type: ~

# data mask
max_len_mask: true
difficulty_mask: true
difficulty_low_threshold: 0.1
difficulty_high_threshold: 0.95
error_max_len_clip: false

# data weight
difficulty_loss_weight: false
length_loss_weight: false

# reward
add_token_level_kl: false

# advantage
whiten_advantages: true

pretrain: models/Qwen3-8B
reward_pretrain: /models/Qwen3-8B

validation:
  data_args:
    template: qwen3
    file_name:
      - /aime-2024-roll.parquet
    prompt: prompt
    response: ground_truth
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.6
    top_k: 50
    num_beams: 1
    temperature: 0.6
    num_return_sequences: 1


actor_train:
  model_args:
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 16
    warmup_steps: 0
    num_train_epochs: 3
  data_args:
    template: qwen3
    file_name:
      - /dapo-math-17k-roll.parquet
    dataset_dir: data
    messages: messages
    response: ground_truth
    domain_interleave_probs:
      math_rule: 1.0
    preprocessing_num_workers: 16
  strategy_args:
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 4
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 1
      use_distributed_optimizer: true
      recompute_granularity: full
  device_mapping: list(range(0,4))
  infer_batch_size: 4

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: ${num_return_sequences_in_group}
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.6
      block_size: 16
      max_model_len: 9216
  device_mapping: list(range(4,8))
  infer_batch_size: 4

reference:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: megatron_infer
    strategy_config:
      tensor_model_parallel_size: 4
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 1
  device_mapping: list(range(4,8))
  infer_batch_size: 4

rewards:
  crossthinkqa:
    worker_cls: roll.pipeline.rlvr.rewards.crossthinkqa_rule_reward_worker.CrossThinkQARuleRewardWorker
    reward_type: soft
    response_length_penalty_coef: 0.0
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen3
    tag_included: [crossthinkqa]
    world_size: 4
    infer_batch_size: 4
  ifeval:
    worker_cls: roll.pipeline.rlvr.rewards.ifeval_rule_reward_worker.GeneralRuleRewardWorker
    reward_type: soft
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen3
    tag_included: [ifeval]
    world_size: 4
    infer_batch_size: 4
  math_rule:
    worker_cls: roll.pipeline.rlvr.rewards.math_rule_reward_worker.MathRuleRewardWorker
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen3
    tag_included: [deepmath_103k, aime]
    world_size: 4
    infer_batch_size: 1
  code_sandbox:
    use_local: true
    worker_cls: roll.pipeline.rlvr.rewards.code_sandbox_reward_worker.CodeSandboxRewardWorker
    tag_included: [KodCode]
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen3
    world_size: 4
    infer_batch_size: 1

I would be very grateful if someone could help me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions