Skip to content

Projection Accuracy, communication model core changes and more optimizations added#644

Merged
wenxie-amd merged 34 commits intomainfrom
araina/dev/projection-deepep-fix-and-scheduler-all
May 11, 2026
Merged

Projection Accuracy, communication model core changes and more optimizations added#644
wenxie-amd merged 34 commits intomainfrom
araina/dev/projection-deepep-fix-and-scheduler-all

Conversation

@araina-amd
Copy link
Copy Markdown
Collaborator

@araina-amd araina-amd commented Apr 6, 2026

Projection Accuracy, communication model core changes and more optimizations added. TE gemm patch is also included.

root and others added 6 commits April 1, 2026 09:43
… scheduler comparison

Performance projection fixes:
- Fix double-counting of DeepEP A2A overlap when EP is unchanged
- Correctly reconstruct sequential compute time when EP changes with DeepEP ON
- Fix VPP handling: use interleaved_1f1b when zero-bubble + VPP>1

Scheduler comparison (--pipeline-schedule-algorithm):
- Thread scheduler_algorithm from CLI through projection engine
- Add zbv-formatted and zbv-greedy as CLI choices
- Add _print_scheduler_comparison for multi-scheduler results table
- 'all' mode runs all applicable schedulers + SeaAILab ILP and picks best

CLI fixes:
- Re-add --pipeline-schedule-algorithm argument with full choices
- Rename megatron-ilp to seaailab-ilp
Comment thread primus/backends/megatron/megatron_pretrain_trainer.py Fixed
…ive ratio

The ratio-based A2A scaling (measured × analytical_ratio) amplified
dispatch/combine overhead that doesn't scale with EP. Switch to additive
correction (measured + analytical_delta) which adjusts only the
communication portion.

Also fix experts_per_rank handling during sub-node benchmarking:
- When benchmark EP >= 2, preserve original num_experts for accurate
  A2A routing sparsity measurement
- Estimate expert MLP GEMM delta analytically instead of scaling the
  entire MLP compute (which is dominated by non-scaling routing overhead)
- Track per-layer A2A adjustment to avoid double-correction in multinode
  projection
…delta

Always reduce num_experts proportionally to preserve experts_per_rank during benchmarking, removing the expert GEMM correction logic. The measured compute is already correct for the target since epr is preserved. A2A scaling uses only the additive delta (analytical_target - analytical_bench).
Refactor intra-node A2A overhead to use fixed sync + small per-peer cost instead of large per-peer linear scaling. Generalize mesh derate in get_effective_node_bw to accept arbitrary group sizes for A2A. Add PP-aware SendRecv that distinguishes intra-node (single xGMI link with p2p_bw_eff) from inter-node (NIC). Add num_nodes, pp, and p2p_bw_eff to CollectiveArgs.
…, and NIC RDMA warmup

Add hierarchical_allreduce (intra-RS, inter-AR, intra-AG) with size-dependent pipeline overlap model. Refactor run_allgather and run_reduce_scatter to use overlap model for multi-node. Replace fixed hier_inter_node_overhead with power-law NIC RDMA warmup model (nic_rdma_setup_us, nic_warmup_bytes) that decays as per-NIC data approaches peak throughput. Add RCCL fixed overhead (rccl_overhead_us) and pipeline parameters (ar_overlap_factor, ar_warmup_chunk_bytes) to CollectiveArgs. Calibrated against MI325X preflight measurements.

Made-with: Cursor
…r-node BW

Apply quadratic mesh contention derate for intra-node A2A (a2a_mesh_contention) that grows with link saturation. Use P2P bandwidth efficiency (p2p_bw_eff) for inter-node A2A and SendRecv instead of collective bw_eff, since A2A uses point-to-point NIC streams. Add per-remote-node contention derate (a2a_remote_contention) for NIC QP multiplexing overhead. Add fixed A2A RCCL overhead (a2a_rccl_overhead_us). Refine NIC RDMA warmup to account for inter-node AllReduce steps. Store _raw_pod_bw for P2P efficiency calculations.
Replace the constant 0.93 FSDP overlap factor with a model that computes overlap from the per-layer compute/comm ratio. AG forward, AG recompute, and RS each get independent overlap fractions with calibrated ceilings (0.95/0.93/0.92). When compute_time_ms is available, per-layer compute is split using empirical fwd/bwd ratio (0.37/0.63) and compared against per-layer comm. Falls back to legacy 0.93 constant when compute time is not provided.
When benchmark EP is reduced below original EP, the measured compute at
the reduced config includes artifacts (routing differences, A2A
contention on attention) that misrepresent the target config. This adds
an auto-hybrid path: rank 0 spawns a bg=1 subprocess to capture clean
per-layer compute, then merges it with the measured bg=N A2A at
projection time. Exposes hidden CLI flags (--save-profiling,
--compute-baseline, --profile-only) used by the subprocess handoff and
also available for manual workflows.

Made-with: Cursor
…nable for profiling

- Add analytical LossProfiler for cross-entropy loss timing (fused vs unfused)
  and integrate it into language model profiling output layer stats
- Add --sync-free-stage CLI arg and tiered DeepEP/SyncFree overlap efficiency
  model (stage 0=65%, 1=75%, 2=80%, 3=85%) replacing hardcoded 0.65
- Auto-enable Primus Turbo kernels (PrimusMLASelfAttention,
  TurboColumnParallelLinear, TurboGroupedMLP) during profiling when MLA or
  DeepEP is configured, matching the measured training stack
- Fix _limit_layers_for_projection to profile both dense and MoE layer types
  (2 layers minimum) for correct extrapolation in mixed-layout models
- Add turbo_sync_free_moe_stage and cross_entropy_loss_fusion to ModelConfig
- Clean up memory projection to use load_primus_config for CLI override support
Comment thread primus/core/projection/memory_projection/projection.py Fixed
…ories covering MoE comm, pipeline layout, memory, parallelism, precision, batch config, and network tuning
…-all, resolve conflicts by taking main for backend extensions

Made-with: Cursor
Remove unused baseline_ep variable in _merge_hybrid_profiling and reformat projection.py, collective_model.py, and cli/projection.py with black to pass CI pre-commit checks.

Made-with: Cursor
@araina-amd araina-amd changed the title Fix projection accuracy, deepep support and add scheduler comparison Projection Accuracy, communication Model core changes and more optimizations added Apr 24, 2026
@araina-amd araina-amd changed the title Projection Accuracy, communication Model core changes and more optimizations added Projection Accuracy, communication model core changes and more optimizations added Apr 24, 2026
A shallow `import primus_turbo` succeeds even when `primus_turbo.pytorch`
fails to initialize (e.g. when a transitive dep like `aiter` is broken in
the runtime environment). That let us enter the HAVE_TURBO branch and crash
at module-load time when the deeper imports actually run. Probe the same
deep path the HAVE_TURBO branch uses so the gate accurately reflects
whether the deeper API is importable.

test(cli-runner): bump dry-run timeouts from 5s to 30s

5s was flaky on slower / loaded machines; the runner is dry-run only here
so a longer ceiling has no real cost and stops spurious test timeouts.
araina-amd and others added 3 commits April 27, 2026 11:18
…ches

Follow-up to e1aa903. The te_spec_provider and general_gemm_workspace
patches both reach into the primus_turbo / primus.backends.transformer_engine
chain at patch-application time. A shallow `importlib.util.find_spec` (or no
guard at all) is not enough: if a transitive dep (e.g. `aiter`/`csrc`) is
broken in the runtime image, the patch body crashes mid-application and
leaves Megatron in a half-patched state, which can produce silent NaNs in
FP8 training rather than a clean fallback to the stock TE provider.
from megatron.core.models.gpt import moe_module_specs

moe_module_specs.GroupedMLP = DeprecatedGroupedMLP
except ImportError:
model_comm_pgs=comm_pgs,
)

_LEGACY_GROUPED_MLP_CLS = PrimusLegacyGroupedMLP
@wenxie-amd wenxie-amd merged commit 35f9c97 into main May 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants