WIP: Add MoE config support and correctness fixes for GLM MoE training#1210
WIP: Add MoE config support and correctness fixes for GLM MoE training#1210tyler-griggs wants to merge 3 commits intomainfrom
Conversation
Expose 5 MoE runtime config flags as first-class fields on MegatronConfig: - moe_token_dispatcher_type (replaces hardcoded "alltoall") - moe_router_load_balancing_type (replaces hardcoded "none") - moe_grouped_gemm (enables fused expert computation) - moe_router_score_function (sigmoid for GLM/DeepSeek-V3 models) - moe_router_enable_expert_bias (expert bias for load balancing) Correctness fixes: - Set config.grad_scale_func = optimizer.scale_loss on the Megatron model config for proper mixed-precision loss scaling (latent bug, no-op with bf16 default but needed for fp16/MoE aux losses) - Vary random seed by PP rank (seed + 100 * pp_rank) to decorrelate dropout masks across pipeline stages - Add pause_generation/resume_generation for non-colocated weight sync to prevent inference engines from seeing partially-updated weights Adds GLM4.7-30B-A3B MoE example config and launch script. Adds CPU unit tests for config fields and weight sync lifecycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Register Glm4MoeLiteForCausalLM as a trivial DeepSeekV3Bridge subclass (same approach as slime: the architecture is identical). This allows AutoBridge.from_hf_pretrained() to handle GLM-4.7-Flash models. - Bump megatron-bridge from 04e370ee (Jan 14) to b058b662 (HEAD, 252 commits). Includes DeepSeek-V3 H100 large-scale config fixes, num_query_groups mapping fix, MoE FlexDispatcher backend fix, and memory savings for MoE param_l2_norm. - Fix GLM config: correct model ID (zai-org/GLM-4.7-Flash), correct expert counts (64 routed experts, 4 active per token). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements for training Mixture-of-Experts (MoE) models with the Megatron backend. The changes are well-structured, including new MoE configuration options, important correctness fixes for mixed-precision training and distributed seeding, and safer weight synchronization for non-colocated setups. The addition of a new example configuration and comprehensive unit tests is commendable. My main feedback is a minor maintainability suggestion for the example shell script to reduce configuration duplication.
| trainer.policy.megatron_config.moe_token_dispatcher_type=$MOE_TOKEN_DISPATCHER \ | ||
| trainer.policy.megatron_config.moe_router_load_balancing_type=$MOE_ROUTER_LB \ | ||
| trainer.policy.megatron_config.moe_grouped_gemm=$MOE_GROUPED_GEMM \ | ||
| trainer.policy.megatron_config.moe_router_score_function=$MOE_ROUTER_SCORE_FN \ | ||
| trainer.policy.megatron_config.moe_router_enable_expert_bias=$MOE_ROUTER_EXPERT_BIAS \ | ||
| trainer.ref.megatron_config.moe_token_dispatcher_type=$MOE_TOKEN_DISPATCHER \ | ||
| trainer.ref.megatron_config.moe_router_load_balancing_type=$MOE_ROUTER_LB \ | ||
| trainer.ref.megatron_config.moe_grouped_gemm=$MOE_GROUPED_GEMM \ | ||
| trainer.ref.megatron_config.moe_router_score_function=$MOE_ROUTER_SCORE_FN \ | ||
| trainer.ref.megatron_config.moe_router_enable_expert_bias=$MOE_ROUTER_EXPERT_BIAS \ |
There was a problem hiding this comment.
These MoE configuration parameters are duplicated for both the policy and ref models. This is a maintainability risk, as any changes to the policy's MoE config must be manually mirrored for the reference model. A mismatch could lead to subtle bugs due to model inconsistency.
Consider leveraging Hydra's interpolation, as done in the corresponding YAML configuration (skyrl/train/config/examples/glm4_7_30b_moe_grpo.yaml), to ensure the reference model's configuration is always in sync with the policy model's. If this script is intended to override the YAML, you could remove the trainer.ref.* overrides and let them be inherited from the policy config via interpolation in the base YAML config.
|
@claude review |
Set provider.moe_grouped_gemm unconditionally instead of guarding with a truthiness check. The previous `if megatron_config.moe_grouped_gemm:` would skip the assignment when the value is False, making it impossible to override a provider's auto-detected True default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Lays the groundwork for training GLM MoE models (30B-350B) on SkyRL with the Megatron backend. Based on a comprehensive gap analysis between SkyRL and slime (see
gaps/MASTER_GAP_ANALYSIS.md).MoE Config Fields (TODO-1)
MegatronConfigdataclass fields:moe_token_dispatcher_type,moe_router_load_balancing_type,moe_grouped_gemm,moe_router_score_function,moe_router_enable_expert_biasmegatron_worker.pywith config-driven valuestransformer_config_kwargspass-throughGLM-4.7-Flash Bridge Registration (TODO-2)
Glm4MoeLiteForCausalLMas a trivialDeepSeekV3Bridgesubclass (same approach as slime — the architecture is identical)AutoBridge.from_hf_pretrained()to handle GLM-4.7-Flash modelsMegatron-Bridge Bump
megatron-bridgefrom04e370ee(Jan 14) tob058b662(HEAD, +252 commits)num_query_groupsmapping fix, MoE FlexDispatcher backend fix, memory savings for MoEparam_l2_normCorrectness Fixes (TODO-3)
config.grad_scale_func = optimizer.scale_lossfor proper mixed-precision loss scaling (latent bug, no-op with bf16 but needed for fp16/MoE aux losses)seed + 100 * pp_rank) to decorrelate dropout masks across pipeline stagespause_generation/resume_generationfor non-colocated weight sync to prevent inference engines from seeing partially-updated weightsGLM-4.7-Flash Example Config (TODO-4)
zai-org/GLM-4.7-Flash(64 routed experts, 4 active per token, ~3B active params)Remaining Blockers for Full RL Loop
>=5.0forGlm4MoeLiteForCausalLMclass (need to test iftrust_remote_code=Trueworks on 4.x)>=0.14) for GLM inference engine support (only blocks rollout generation, not Megatron training)GPU Validation Needed
Test plan