Description
The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
🎉 This Roadmap is based on the dev branch ; please see the details in its README.
Model Support
✅ DeepSeek
✅ Qwen
✅ Qwen2-57B-A14B
✅ Qwen3-235B-A22B
✅ (🚀New!) Qwen3-Next
✅ Mixtral
Core MoE Functionality
✅ Token dropless MoE - Advanced routing without token dropping
✅ Top-K Router with flexible K selection
✅ Load balancing losses for expert load balancing optimization
Advanced Parallelism
✅ Expert Parallel (EP) with 3D parallelism integration
✅ Full parallelism combo : EP + DP + TP + PP + SP support
✅ Context Parallel (CP) for long sequence MoE training
✅ Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
✅ Distributed Optimizer for MoE (ZeRO-1 equivalent)
✅ (🚀New!) Megatron FSDP /HSDP with full expert parallel support
Optimizations
✅ Memory Efficient token permutation
✅ Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
✅ GroupedGEMM and Gradient Accumulation Fusion
✅ DP/PP/TP/EP Communication Overlapping
✅ Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
✅ cuDNN fused Attention and FlashAttn integration
✅ (🚀New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
✅ (🚀New!) Muon and Layer-wise distributed optimizer
✅ (🚀New!) Pipeline-aware fine-grained activation offloading [Dev] feat(moe): Fine-grained activation offloading #1912
✅ (🚀New!) Production-ready cudaGraph support for MoE
Precision Support
✅ GroupedGEMM including FP8/MXFP8 support
✅ FP8 weights with BF16 optimizer states
✅ FP8 training full support
Optimized Expert Parallel Communication Support
✅ DeepEP support for H100 and B200
✅ (🚀New!) HybridEP for GB200
Developer Experience
✅ MoE Model Zoo with pre-training best practices
✅ MCore2HF Converter for ecosystem compatibility in megatron-bridge
✅ Distributed Checkpointing Support
✅ Runtime Upcycling Support for efficient model scaling
✅ Layer-wise logging for detailed monitoring
Next Release Roadmap (MCore v0.17)
Performance & Kernel Optimizations
Long Context & Context Parallel
Model & Architecture
Advanced Functionality
CUDA Graph Enhancements
Ongoing Long-term Features
E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
Sync-Free and Full-Iter cudaGraph MoE Training
CPU Overhead Optimizations for Blackwell Performance
MLA Optimizations
THD and Long Context
Megatron FSDP Performance Optimization for MoE Training
Kernel fusions and optimizations for MoE models from TE MoE training optimization TransformerEngine#2438
New Architecture Support
v0.16 Update Highlights
Performance & Memory
CUDA Graph
Model & Parallelism
Fine-grained Activation Offloading Enhancement
Megatron-FSDP
Communication
Optimizer
Critical Bug Fixes
Call for Community Contributions
Model implementations - Additional MoE model variants
Performance testing - Performance tests across different platforms and workloads
Documentation and tutorials - Best practices and optimization guides
Bug fixes
This roadmap reflects the collective efforts of NVIDIA and our collaborators
Credits: MCore MoE Team and @sbhavani
Labels: roadmap, moe, call-for-contribution
Description
The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
🎉 This Roadmap is based on the dev branch; please see the details in its README.
Model Support
Core MoE Functionality
Advanced Parallelism
Optimizations
Precision Support
Optimized Expert Parallel Communication Support
Developer Experience
Next Release Roadmap (MCore v0.17)
Performance & Kernel Optimizations
Long Context & Context Parallel
Model & Architecture
Advanced Functionality
CUDA Graph Enhancements
Ongoing Long-term Features
v0.16 Update Highlights
Performance & Memory
CUDA Graph
Model & Parallelism
--fake-process-groupfor profiling [DEV] Add support of fake distributed process group #2254Fine-grained Activation Offloading Enhancement
Megatron-FSDP
Communication
Optimizer
Critical Bug Fixes
Call for Community Contributions
This roadmap reflects the collective efforts of NVIDIA and our collaborators
Credits: MCore MoE Team and @sbhavani
Labels:
roadmap,moe,call-for-contribution