TP-invariant Training: bitwise-identical training across TP degrees and GPU Architecture by jinzex · Pull Request #4740 · NVIDIA/Megatron-LM

jinzex · 2026-05-11T21:48:22Z

What does this PR do ?

Bitwise-identical fwd/bwd/E2E training on Megatron-Core TransformerBlocks across TP=1/2/4/8, gated by NVTE_TP_INVARIANT_MODE=1 (default off).

Fixes span TP-invariant GEMM (fwd+bwd, column + row parallel), gated deinterleave, cross-entropy all-gather, output-projection all-gather, float64 + pow2 gradient clipping, RMSNorm dgamma rank-0 reduction, and batch-invariant Triton kernels.

Companion TE PR: NVIDIA/TransformerEngine#2977.

Qwen3-8B TP=4 vs TP=8 (bitwise identical)	Qwen3-8B TP-invariant vs baseline (TP=4)	Qwen3-8B B300 vs H100 (bitwise identical)

Contribution process

Tests

pytest TP=1≡2≡4 bitwise (test_tp_invariant.py).
Qwen3-0.6B TP=1, 100 iters — B300 + H100, bitwise.
Qwen3-8B TP=4, 100 iters — B300 + H100, bitwise.
MoE toy 10 iters smoke.

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

When set, NullTokenizer treats --vocab-size N as the total vocab including eod, so eod_id=N-1 and tokenizer.vocab_size=N. Default behavior is unchanged (eod_id=N, tokenizer.vocab_size=N+1). Matches Megatron-Bridge's NullTokenizer convention for mock-data benchmarks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bitwise-identical forward, backward, and end-to-end training across TP=1, 2, 4, 8 on Megatron-Core TransformerBlocks. Gated by NVTE_TP_INVARIANT_MODE environment variable. Components: - TP-invariant GEMM (all-gather sharded weight, full-K GEMM) for both column and row parallel linear in the TE patches under examples/tp-numerics/patches/. - Gradient clipping pow2-rounding to absorb 1-ULP cross-TP norm jitter. - RMSNorm dgamma all-gather with rank-0-only reduction. - Batch-invariant Triton kernels (BIK) for M-invariant matmul. - Cross-entropy all-gather over exp_logits. Validation: - tests/unit_tests/transformer/test_tp_invariant.py — TP=1≡2≡4 bitwise on a small TransformerBlock (fp32+bf16). - examples/tp-numerics/submit_qwen3_{0.6b,8b,moe_toy}_tp_invariant.sh — end-to-end raw-MLM training scripts; Qwen3-0.6B (TP=1) and Qwen3-8B (TP=4) bitwise across 100 iters and across B300 ≡ H100 hardware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-11T21:48:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Gated on NVTE_TP_INVARIANT_MODE=1 (default off; stock paths unchanged). - module/linear.py: row-parallel FWD + BWD full GEMM matching TP=1 K-dim accumulation. - module/layernorm_linear.py: column-parallel BWD dgrad full GEMM with gated deinterleave for SwiGLU FC1 (partition_stride > 1). Companion Megatron-LM PR (gates this code path via env var): NVIDIA/Megatron-LM#4740. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Jinze Xue <jinzex@nvidia.com>

jinzex and others added 2 commits May 11, 2026 12:41

jinzex requested review from a team as code owners May 11, 2026 21:48

jinzex marked this pull request as draft May 11, 2026 21:48

jinzex mentioned this pull request May 11, 2026

TP-invariant Training: bitwise-identical training across TP degrees and GPU architecture NVIDIA/TransformerEngine#2977

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP-invariant Training: bitwise-identical training across TP degrees and GPU Architecture#4740

TP-invariant Training: bitwise-identical training across TP degrees and GPU Architecture#4740
jinzex wants to merge 2 commits into
NVIDIA:devfrom
jinzex:jinzex/tp-invariant-numerics-upstream

jinzex commented May 11, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jinzex commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Tests

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jinzex commented May 11, 2026 •

edited

Loading