Projection Accuracy, communication model core changes and more optimizations added by araina-amd · Pull Request #644 · AMD-AGI/Primus

araina-amd · 2026-04-06T22:02:10Z

Projection Accuracy, communication model core changes and more optimizations added. TE gemm patch is also included.

… scheduler comparison Performance projection fixes: - Fix double-counting of DeepEP A2A overlap when EP is unchanged - Correctly reconstruct sequential compute time when EP changes with DeepEP ON - Fix VPP handling: use interleaved_1f1b when zero-bubble + VPP>1 Scheduler comparison (--pipeline-schedule-algorithm): - Thread scheduler_algorithm from CLI through projection engine - Add zbv-formatted and zbv-greedy as CLI choices - Add _print_scheduler_comparison for multi-scheduler results table - 'all' mode runs all applicable schedulers + SeaAILab ILP and picks best CLI fixes: - Re-add --pipeline-schedule-algorithm argument with full choices - Rename megatron-ilp to seaailab-ilp

…ata patch

…ive ratio The ratio-based A2A scaling (measured × analytical_ratio) amplified dispatch/combine overhead that doesn't scale with EP. Switch to additive correction (measured + analytical_delta) which adjusts only the communication portion. Also fix experts_per_rank handling during sub-node benchmarking: - When benchmark EP >= 2, preserve original num_experts for accurate A2A routing sparsity measurement - Estimate expert MLP GEMM delta analytically instead of scaling the entire MLP compute (which is dominated by non-scaling routing overhead) - Track per-layer A2A adjustment to avoid double-correction in multinode projection

…delta Always reduce num_experts proportionally to preserve experts_per_rank during benchmarking, removing the expert GEMM correction logic. The measured compute is already correct for the target since epr is preserved. A2A scaling uses only the additive delta (analytical_target - analytical_bench).

Refactor intra-node A2A overhead to use fixed sync + small per-peer cost instead of large per-peer linear scaling. Generalize mesh derate in get_effective_node_bw to accept arbitrary group sizes for A2A. Add PP-aware SendRecv that distinguishes intra-node (single xGMI link with p2p_bw_eff) from inter-node (NIC). Add num_nodes, pp, and p2p_bw_eff to CollectiveArgs.

…, and NIC RDMA warmup Add hierarchical_allreduce (intra-RS, inter-AR, intra-AG) with size-dependent pipeline overlap model. Refactor run_allgather and run_reduce_scatter to use overlap model for multi-node. Replace fixed hier_inter_node_overhead with power-law NIC RDMA warmup model (nic_rdma_setup_us, nic_warmup_bytes) that decays as per-NIC data approaches peak throughput. Add RCCL fixed overhead (rccl_overhead_us) and pipeline parameters (ar_overlap_factor, ar_warmup_chunk_bytes) to CollectiveArgs. Calibrated against MI325X preflight measurements. Made-with: Cursor

…r-node BW Apply quadratic mesh contention derate for intra-node A2A (a2a_mesh_contention) that grows with link saturation. Use P2P bandwidth efficiency (p2p_bw_eff) for inter-node A2A and SendRecv instead of collective bw_eff, since A2A uses point-to-point NIC streams. Add per-remote-node contention derate (a2a_remote_contention) for NIC QP multiplexing overhead. Add fixed A2A RCCL overhead (a2a_rccl_overhead_us). Refine NIC RDMA warmup to account for inter-node AllReduce steps. Store _raw_pod_bw for P2P efficiency calculations.

Replace the constant 0.93 FSDP overlap factor with a model that computes overlap from the per-layer compute/comm ratio. AG forward, AG recompute, and RS each get independent overlap fractions with calibrated ceilings (0.95/0.93/0.92). When compute_time_ms is available, per-layer compute is split using empirical fwd/bwd ratio (0.37/0.63) and compared against per-layer comm. Falls back to legacy 0.93 constant when compute time is not provided.

When benchmark EP is reduced below original EP, the measured compute at the reduced config includes artifacts (routing differences, A2A contention on attention) that misrepresent the target config. This adds an auto-hybrid path: rank 0 spawns a bg=1 subprocess to capture clean per-layer compute, then merges it with the measured bg=N A2A at projection time. Exposes hidden CLI flags (--save-profiling, --compute-baseline, --profile-only) used by the subprocess handoff and also available for manual workflows. Made-with: Cursor

…nable for profiling - Add analytical LossProfiler for cross-entropy loss timing (fused vs unfused) and integrate it into language model profiling output layer stats - Add --sync-free-stage CLI arg and tiered DeepEP/SyncFree overlap efficiency model (stage 0=65%, 1=75%, 2=80%, 3=85%) replacing hardcoded 0.65 - Auto-enable Primus Turbo kernels (PrimusMLASelfAttention, TurboColumnParallelLinear, TurboGroupedMLP) during profiling when MLA or DeepEP is configured, matching the measured training stack - Fix _limit_layers_for_projection to profile both dense and MoE layer types (2 layers minimum) for correct extrapolation in mixed-layout models - Add turbo_sync_free_moe_stage and cross_entropy_loss_fusion to ModelConfig - Clean up memory projection to use load_primus_config for CLI override support

…ories covering MoE comm, pipeline layout, memory, parallelism, precision, batch config, and network tuning

… dispatcher assert

Made-with: Cursor

…-all, resolve conflicts by taking main for backend extensions Made-with: Cursor

Remove unused baseline_ep variable in _merge_hybrid_profiling and reformat projection.py, collective_model.py, and cli/projection.py with black to pass CI pre-commit checks. Made-with: Cursor

Made-with: Cursor

A shallow `import primus_turbo` succeeds even when `primus_turbo.pytorch` fails to initialize (e.g. when a transitive dep like `aiter` is broken in the runtime environment). That let us enter the HAVE_TURBO branch and crash at module-load time when the deeper imports actually run. Probe the same deep path the HAVE_TURBO branch uses so the gate accurately reflects whether the deeper API is importable. test(cli-runner): bump dry-run timeouts from 5s to 30s 5s was flaky on slower / loaded machines; the runner is dry-run only here so a longer ceiling has no real cost and stops spurious test timeouts.

…ches Follow-up to e1aa903. The te_spec_provider and general_gemm_workspace patches both reach into the primus_turbo / primus.backends.transformer_engine chain at patch-application time. A shallow `importlib.util.find_spec` (or no guard at all) is not enough: if a transitive dep (e.g. `aiter`/`csrc`) is broken in the runtime image, the patch body crashes mid-application and leaves Megatron in a half-patched state, which can produce silent NaNs in FP8 training rather than a clean fallback to the stock TE provider.

+        from megatron.core.models.gpt import moe_module_specs
+
+        moe_module_specs.GroupedMLP = DeprecatedGroupedMLP
+    except ImportError:


…er-all

+                model_comm_pgs=comm_pgs,
+            )
+
+    _LEGACY_GROUPED_MLP_CLS = PrimusLegacyGroupedMLP


…er-all

root and others added 6 commits April 1, 2026 09:43

feat(primus-pipeline): fix te gemm patch for zerobubble

2443313

feat(primus-pipeline): add lagacy grouped gemm support

910bcde

feat(primus-pipeline): add turbo fp8 support

187ad1d

feat(primus-pipeline): fix newest megatron issue

64024ca

fix(pp): fix original PP dump data logic

a7197eb

araina-amd requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners April 6, 2026 22:02

github-code-quality Bot found potential problems Apr 6, 2026

View reviewed changes

Comment thread primus/backends/megatron/megatron_pretrain_trainer.py Fixed

araina-amd added 10 commits April 6, 2026 15:40

fix: replace empty except clause with log_rank_0 warning in dump_pp_d…

8aaa46e

…ata patch

style: run black formatter on projection and pretrain_trainer

bc9a4e7

github-code-quality Bot found potential problems Apr 22, 2026

View reviewed changes

Comment thread primus/core/projection/memory_projection/projection.py Fixed

araina-amd added 5 commits April 22, 2026 12:50

skills: expand projection optimization exploration guide with 7 categ…

a7a3c88

…ories covering MoE comm, pipeline layout, memory, parallelism, precision, batch config, and network tuning

projection: auto-disable turbo-deepep when benchmark TP*EP=1 to avoid…

f641fd8

… dispatcher assert

fix: remove unused PrimusParser import in memory projection

e8d2d8c

Made-with: Cursor

Merge origin/main into araina/dev/projection-deepep-fix-and-scheduler…

d5bef41

…-all, resolve conflicts by taking main for backend extensions Made-with: Cursor

style: fix autoflake unused variable and run black formatter

9106d01

Remove unused baseline_ep variable in _merge_hybrid_profiling and reformat projection.py, collective_model.py, and cli/projection.py with black to pass CI pre-commit checks. Made-with: Cursor

araina-amd changed the title ~~Fix projection accuracy, deepep support and add scheduler comparison~~ Projection Accuracy, communication Model core changes and more optimizations added Apr 24, 2026

style: re-run black with --line-length=110 to match CI config

2d937b9

Made-with: Cursor

araina-amd changed the title ~~Projection Accuracy, communication Model core changes and more optimizations added~~ Projection Accuracy, communication model core changes and more optimizations added Apr 24, 2026

araina-amd and others added 3 commits April 27, 2026 11:18

style: collapse log_rank_0 call to single line for black

3afefee

fix primus pipeline bugs

d2c9043

github-code-quality Bot found potential problems May 7, 2026

View reviewed changes

ChengYao-amd added 2 commits May 7, 2026 11:57

fix projection issues and 1f1b interleaved

59759eb

Merge branch 'main' into araina/dev/projection-deepep-fix-and-schedul…

4d3745c

…er-all

github-code-quality Bot found potential problems May 8, 2026

View reviewed changes

Comment thread primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py Fixed

ChengYao-amd added 3 commits May 9, 2026 09:24

fix projection bugs

3c3e01f

update UT file, fix lagacy gg packs

52b32d6

Merge branch 'main' into araina/dev/projection-deepep-fix-and-schedul…

ea457f5

…er-all

github-code-quality Bot found potential problems May 9, 2026

View reviewed changes

Comment thread primus/backends/megatron/core/extensions/transformer_engine_spec_provider.py

model_comm_pgs=comm_pgs,

)

_LEGACY_GROUPED_MLP_CLS = PrimusLegacyGroupedMLP

ChengYao-amd added 3 commits May 11, 2026 01:55

Trim Trailing Whitespace

9f57592

fix Deprecated Group MLP grok2 failed

2a691da

Merge branch 'main' into araina/dev/projection-deepep-fix-and-schedul…

e199419

…er-all

wenxie-amd approved these changes May 11, 2026

View reviewed changes

wenxie-amd merged commit 35f9c97 into main May 11, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Projection Accuracy, communication model core changes and more optimizations added#644

Projection Accuracy, communication model core changes and more optimizations added#644
wenxie-amd merged 34 commits intomainfrom
araina/dev/projection-deepep-fix-and-scheduler-all

araina-amd commented Apr 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

araina-amd commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

araina-amd commented Apr 6, 2026 •

edited

Loading