Projection Accuracy, communication model core changes and more optimizations added#644
Merged
wenxie-amd merged 34 commits intomainfrom May 11, 2026
Merged
Conversation
… scheduler comparison Performance projection fixes: - Fix double-counting of DeepEP A2A overlap when EP is unchanged - Correctly reconstruct sequential compute time when EP changes with DeepEP ON - Fix VPP handling: use interleaved_1f1b when zero-bubble + VPP>1 Scheduler comparison (--pipeline-schedule-algorithm): - Thread scheduler_algorithm from CLI through projection engine - Add zbv-formatted and zbv-greedy as CLI choices - Add _print_scheduler_comparison for multi-scheduler results table - 'all' mode runs all applicable schedulers + SeaAILab ILP and picks best CLI fixes: - Re-add --pipeline-schedule-algorithm argument with full choices - Rename megatron-ilp to seaailab-ilp
…ive ratio The ratio-based A2A scaling (measured × analytical_ratio) amplified dispatch/combine overhead that doesn't scale with EP. Switch to additive correction (measured + analytical_delta) which adjusts only the communication portion. Also fix experts_per_rank handling during sub-node benchmarking: - When benchmark EP >= 2, preserve original num_experts for accurate A2A routing sparsity measurement - Estimate expert MLP GEMM delta analytically instead of scaling the entire MLP compute (which is dominated by non-scaling routing overhead) - Track per-layer A2A adjustment to avoid double-correction in multinode projection
…delta Always reduce num_experts proportionally to preserve experts_per_rank during benchmarking, removing the expert GEMM correction logic. The measured compute is already correct for the target since epr is preserved. A2A scaling uses only the additive delta (analytical_target - analytical_bench).
Refactor intra-node A2A overhead to use fixed sync + small per-peer cost instead of large per-peer linear scaling. Generalize mesh derate in get_effective_node_bw to accept arbitrary group sizes for A2A. Add PP-aware SendRecv that distinguishes intra-node (single xGMI link with p2p_bw_eff) from inter-node (NIC). Add num_nodes, pp, and p2p_bw_eff to CollectiveArgs.
…, and NIC RDMA warmup Add hierarchical_allreduce (intra-RS, inter-AR, intra-AG) with size-dependent pipeline overlap model. Refactor run_allgather and run_reduce_scatter to use overlap model for multi-node. Replace fixed hier_inter_node_overhead with power-law NIC RDMA warmup model (nic_rdma_setup_us, nic_warmup_bytes) that decays as per-NIC data approaches peak throughput. Add RCCL fixed overhead (rccl_overhead_us) and pipeline parameters (ar_overlap_factor, ar_warmup_chunk_bytes) to CollectiveArgs. Calibrated against MI325X preflight measurements. Made-with: Cursor
…r-node BW Apply quadratic mesh contention derate for intra-node A2A (a2a_mesh_contention) that grows with link saturation. Use P2P bandwidth efficiency (p2p_bw_eff) for inter-node A2A and SendRecv instead of collective bw_eff, since A2A uses point-to-point NIC streams. Add per-remote-node contention derate (a2a_remote_contention) for NIC QP multiplexing overhead. Add fixed A2A RCCL overhead (a2a_rccl_overhead_us). Refine NIC RDMA warmup to account for inter-node AllReduce steps. Store _raw_pod_bw for P2P efficiency calculations.
Replace the constant 0.93 FSDP overlap factor with a model that computes overlap from the per-layer compute/comm ratio. AG forward, AG recompute, and RS each get independent overlap fractions with calibrated ceilings (0.95/0.93/0.92). When compute_time_ms is available, per-layer compute is split using empirical fwd/bwd ratio (0.37/0.63) and compared against per-layer comm. Falls back to legacy 0.93 constant when compute time is not provided.
When benchmark EP is reduced below original EP, the measured compute at the reduced config includes artifacts (routing differences, A2A contention on attention) that misrepresent the target config. This adds an auto-hybrid path: rank 0 spawns a bg=1 subprocess to capture clean per-layer compute, then merges it with the measured bg=N A2A at projection time. Exposes hidden CLI flags (--save-profiling, --compute-baseline, --profile-only) used by the subprocess handoff and also available for manual workflows. Made-with: Cursor
…nable for profiling - Add analytical LossProfiler for cross-entropy loss timing (fused vs unfused) and integrate it into language model profiling output layer stats - Add --sync-free-stage CLI arg and tiered DeepEP/SyncFree overlap efficiency model (stage 0=65%, 1=75%, 2=80%, 3=85%) replacing hardcoded 0.65 - Auto-enable Primus Turbo kernels (PrimusMLASelfAttention, TurboColumnParallelLinear, TurboGroupedMLP) during profiling when MLA or DeepEP is configured, matching the measured training stack - Fix _limit_layers_for_projection to profile both dense and MoE layer types (2 layers minimum) for correct extrapolation in mixed-layout models - Add turbo_sync_free_moe_stage and cross_entropy_loss_fusion to ModelConfig - Clean up memory projection to use load_primus_config for CLI override support
…ories covering MoE comm, pipeline layout, memory, parallelism, precision, batch config, and network tuning
… dispatcher assert
Made-with: Cursor
…-all, resolve conflicts by taking main for backend extensions Made-with: Cursor
Remove unused baseline_ep variable in _merge_hybrid_profiling and reformat projection.py, collective_model.py, and cli/projection.py with black to pass CI pre-commit checks. Made-with: Cursor
Made-with: Cursor
A shallow `import primus_turbo` succeeds even when `primus_turbo.pytorch` fails to initialize (e.g. when a transitive dep like `aiter` is broken in the runtime environment). That let us enter the HAVE_TURBO branch and crash at module-load time when the deeper imports actually run. Probe the same deep path the HAVE_TURBO branch uses so the gate accurately reflects whether the deeper API is importable. test(cli-runner): bump dry-run timeouts from 5s to 30s 5s was flaky on slower / loaded machines; the runner is dry-run only here so a longer ceiling has no real cost and stops spurious test timeouts.
…ches Follow-up to e1aa903. The te_spec_provider and general_gemm_workspace patches both reach into the primus_turbo / primus.backends.transformer_engine chain at patch-application time. A shallow `importlib.util.find_spec` (or no guard at all) is not enough: if a transitive dep (e.g. `aiter`/`csrc`) is broken in the runtime image, the patch body crashes mid-application and leaves Megatron in a half-patched state, which can produce silent NaNs in FP8 training rather than a clean fallback to the stock TE provider.
| from megatron.core.models.gpt import moe_module_specs | ||
|
|
||
| moe_module_specs.GroupedMLP = DeprecatedGroupedMLP | ||
| except ImportError: |
wenxie-amd
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Projection Accuracy, communication model core changes and more optimizations added. TE gemm patch is also included.