Skip to content

WIP: Add MoE config support and correctness fixes for GLM MoE training#1210

Closed
tyler-griggs wants to merge 3 commits intomainfrom
tgriggs/glm-moe-support
Closed

WIP: Add MoE config support and correctness fixes for GLM MoE training#1210
tyler-griggs wants to merge 3 commits intomainfrom
tgriggs/glm-moe-support

Conversation

@tyler-griggs
Copy link
Member

@tyler-griggs tyler-griggs commented Feb 25, 2026

Summary

Lays the groundwork for training GLM MoE models (30B-350B) on SkyRL with the Megatron backend. Based on a comprehensive gap analysis between SkyRL and slime (see gaps/MASTER_GAP_ANALYSIS.md).

MoE Config Fields (TODO-1)

  • Exposes 5 MoE runtime config flags as first-class MegatronConfig dataclass fields: moe_token_dispatcher_type, moe_router_load_balancing_type, moe_grouped_gemm, moe_router_score_function, moe_router_enable_expert_bias
  • Replaces 2 hardcoded values in megatron_worker.py with config-driven values
  • Advanced flags continue to work through the existing transformer_config_kwargs pass-through
  • AutoBridge auto-detects most architecture flags (num_experts, topk, etc.) from HF config.json

GLM-4.7-Flash Bridge Registration (TODO-2)

  • Registers Glm4MoeLiteForCausalLM as a trivial DeepSeekV3Bridge subclass (same approach as slime — the architecture is identical)
  • Enables AutoBridge.from_hf_pretrained() to handle GLM-4.7-Flash models
  • Even the latest Megatron-Bridge HEAD does not include this registration, so we add it ourselves

Megatron-Bridge Bump

  • Bumps megatron-bridge from 04e370ee (Jan 14) to b058b662 (HEAD, +252 commits)
  • Includes: DeepSeek-V3 H100 large-scale config fixes, num_query_groups mapping fix, MoE FlexDispatcher backend fix, memory savings for MoE param_l2_norm

Correctness Fixes (TODO-3)

  • C1: Set config.grad_scale_func = optimizer.scale_loss for proper mixed-precision loss scaling (latent bug, no-op with bf16 but needed for fp16/MoE aux losses)
  • C4: Vary random seed by PP rank (seed + 100 * pp_rank) to decorrelate dropout masks across pipeline stages
  • C5: Add pause_generation/resume_generation for non-colocated weight sync to prevent inference engines from seeing partially-updated weights

GLM-4.7-Flash Example Config (TODO-4)

  • YAML config for GRPO training with Megatron (TP=2, EP=8, sigmoid routing, expert bias)
  • Launch script following existing conventions
  • Model: zai-org/GLM-4.7-Flash (64 routed experts, 4 active per token, ~3B active params)

Remaining Blockers for Full RL Loop

  • transformers: May need >=5.0 for Glm4MoeLiteForCausalLM class (need to test if trust_remote_code=True works on 4.x)
  • vLLM: Needs nightly (>=0.14) for GLM inference engine support (only blocks rollout generation, not Megatron training)

GPU Validation Needed

  • Smoke test: GLM-4.7-Flash initializes correctly with new MoE flags via AutoBridge + bridge registration
  • Weight sync round-trip: Megatron → HF export → vLLM load for MoE model
  • E2E training: 1-2 steps of GRPO on a small MoE model
  • Pause/resume weight sync: verify non-colocated path works end-to-end
  • Megatron-bridge bump: verify no regressions on existing models (Qwen, Llama, Moonlight)

Test plan

  • CPU unit tests for config dataclass fields and defaults (6 tests)
  • CPU unit tests for weight sync pause/resume lifecycle (3 tests)
  • All changes verified by independent review sub-agents
  • GPU CI: MegatronModelWrapper init with grad_scale_func
  • GPU CI: seed variation with PP>1
  • GPU E2E: MoE model training loop

Expose 5 MoE runtime config flags as first-class fields on MegatronConfig:
- moe_token_dispatcher_type (replaces hardcoded "alltoall")
- moe_router_load_balancing_type (replaces hardcoded "none")
- moe_grouped_gemm (enables fused expert computation)
- moe_router_score_function (sigmoid for GLM/DeepSeek-V3 models)
- moe_router_enable_expert_bias (expert bias for load balancing)

Correctness fixes:
- Set config.grad_scale_func = optimizer.scale_loss on the Megatron model
  config for proper mixed-precision loss scaling (latent bug, no-op with
  bf16 default but needed for fp16/MoE aux losses)
- Vary random seed by PP rank (seed + 100 * pp_rank) to decorrelate
  dropout masks across pipeline stages
- Add pause_generation/resume_generation for non-colocated weight sync
  to prevent inference engines from seeing partially-updated weights

Adds GLM4.7-30B-A3B MoE example config and launch script.
Adds CPU unit tests for config fields and weight sync lifecycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Register Glm4MoeLiteForCausalLM as a trivial DeepSeekV3Bridge subclass
  (same approach as slime: the architecture is identical). This allows
  AutoBridge.from_hf_pretrained() to handle GLM-4.7-Flash models.

- Bump megatron-bridge from 04e370ee (Jan 14) to b058b662 (HEAD, 252
  commits). Includes DeepSeek-V3 H100 large-scale config fixes,
  num_query_groups mapping fix, MoE FlexDispatcher backend fix, and
  memory savings for MoE param_l2_norm.

- Fix GLM config: correct model ID (zai-org/GLM-4.7-Flash), correct
  expert counts (64 routed experts, 4 active per token).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements for training Mixture-of-Experts (MoE) models with the Megatron backend. The changes are well-structured, including new MoE configuration options, important correctness fixes for mixed-precision training and distributed seeding, and safer weight synchronization for non-colocated setups. The addition of a new example configuration and comprehensive unit tests is commendable. My main feedback is a minor maintainability suggestion for the example shell script to reduce configuration duplication.

Comment on lines +65 to +74
trainer.policy.megatron_config.moe_token_dispatcher_type=$MOE_TOKEN_DISPATCHER \
trainer.policy.megatron_config.moe_router_load_balancing_type=$MOE_ROUTER_LB \
trainer.policy.megatron_config.moe_grouped_gemm=$MOE_GROUPED_GEMM \
trainer.policy.megatron_config.moe_router_score_function=$MOE_ROUTER_SCORE_FN \
trainer.policy.megatron_config.moe_router_enable_expert_bias=$MOE_ROUTER_EXPERT_BIAS \
trainer.ref.megatron_config.moe_token_dispatcher_type=$MOE_TOKEN_DISPATCHER \
trainer.ref.megatron_config.moe_router_load_balancing_type=$MOE_ROUTER_LB \
trainer.ref.megatron_config.moe_grouped_gemm=$MOE_GROUPED_GEMM \
trainer.ref.megatron_config.moe_router_score_function=$MOE_ROUTER_SCORE_FN \
trainer.ref.megatron_config.moe_router_enable_expert_bias=$MOE_ROUTER_EXPERT_BIAS \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These MoE configuration parameters are duplicated for both the policy and ref models. This is a maintainability risk, as any changes to the policy's MoE config must be manually mirrored for the reference model. A mismatch could lead to subtle bugs due to model inconsistency.

Consider leveraging Hydra's interpolation, as done in the corresponding YAML configuration (skyrl/train/config/examples/glm4_7_30b_moe_grpo.yaml), to ensure the reference model's configuration is always in sync with the policy model's. If this script is intended to override the YAML, you could remove the trainer.ref.* overrides and let them be inherited from the policy config via interpolation in the base YAML config.

@tyler-griggs
Copy link
Member Author

@claude review

devin-ai-integration[bot]

This comment was marked as resolved.

Set provider.moe_grouped_gemm unconditionally instead of guarding with
a truthiness check. The previous `if megatron_config.moe_grouped_gemm:`
would skip the assignment when the value is False, making it impossible
to override a provider's auto-detected True default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tyler-griggs
Copy link
Member Author

Superseded by 4 focused PRs: #1212 (correctness fixes), #1213 (MoE config fields), #1214 (GLM bridge + megatron-bridge bump), #1215 (GLM example config). See each PR for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant