Skip to content

TP-invariant Training: bitwise-identical training across TP degrees and GPU Architecture#4740

Draft
jinzex wants to merge 2 commits into
NVIDIA:devfrom
jinzex:jinzex/tp-invariant-numerics-upstream
Draft

TP-invariant Training: bitwise-identical training across TP degrees and GPU Architecture#4740
jinzex wants to merge 2 commits into
NVIDIA:devfrom
jinzex:jinzex/tp-invariant-numerics-upstream

Conversation

@jinzex
Copy link
Copy Markdown

@jinzex jinzex commented May 11, 2026

What does this PR do ?

Bitwise-identical fwd/bwd/E2E training on Megatron-Core TransformerBlocks across TP=1/2/4/8, gated by NVTE_TP_INVARIANT_MODE=1 (default off).

Fixes span TP-invariant GEMM (fwd+bwd, column + row parallel), gated deinterleave, cross-entropy all-gather, output-projection all-gather, float64 + pow2 gradient clipping, RMSNorm dgamma rank-0 reduction, and batch-invariant Triton kernels.

Companion TE PR: NVIDIA/TransformerEngine#2977.

Qwen3-8B TP=4 vs TP=8 (bitwise identical) Qwen3-8B TP-invariant vs baseline (TP=4) Qwen3-8B B300 vs H100 (bitwise identical)

Contribution process

Tests

  • pytest TP=1≡2≡4 bitwise (test_tp_invariant.py).
  • Qwen3-0.6B TP=1, 100 iters — B300 + H100, bitwise.
  • Qwen3-8B TP=4, 100 iters — B300 + H100, bitwise.
  • MoE toy 10 iters smoke.

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

jinzex and others added 2 commits May 11, 2026 12:41
When set, NullTokenizer treats --vocab-size N as the total vocab
including eod, so eod_id=N-1 and tokenizer.vocab_size=N. Default
behavior is unchanged (eod_id=N, tokenizer.vocab_size=N+1). Matches
Megatron-Bridge's NullTokenizer convention for mock-data benchmarks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bitwise-identical forward, backward, and end-to-end training across TP=1,
2, 4, 8 on Megatron-Core TransformerBlocks. Gated by NVTE_TP_INVARIANT_MODE
environment variable.

Components:
  - TP-invariant GEMM (all-gather sharded weight, full-K GEMM) for both
    column and row parallel linear in the TE patches under
    examples/tp-numerics/patches/.
  - Gradient clipping pow2-rounding to absorb 1-ULP cross-TP norm jitter.
  - RMSNorm dgamma all-gather with rank-0-only reduction.
  - Batch-invariant Triton kernels (BIK) for M-invariant matmul.
  - Cross-entropy all-gather over exp_logits.

Validation:
  - tests/unit_tests/transformer/test_tp_invariant.py — TP=1≡2≡4 bitwise
    on a small TransformerBlock (fp32+bf16).
  - examples/tp-numerics/submit_qwen3_{0.6b,8b,moe_toy}_tp_invariant.sh —
    end-to-end raw-MLM training scripts; Qwen3-0.6B (TP=1) and Qwen3-8B
    (TP=4) bitwise across 100 iters and across B300 ≡ H100 hardware.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jinzex jinzex requested review from a team as code owners May 11, 2026 21:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jinzex jinzex marked this pull request as draft May 11, 2026 21:48
jinzex added a commit to jinzex/TransformerEngine that referenced this pull request May 13, 2026
Gated on NVTE_TP_INVARIANT_MODE=1 (default off; stock paths unchanged).

- module/linear.py: row-parallel FWD + BWD full GEMM matching TP=1
  K-dim accumulation.
- module/layernorm_linear.py: column-parallel BWD dgrad full GEMM with
  gated deinterleave for SwiGLU FC1 (partition_stride > 1).

Companion Megatron-LM PR (gates this code path via env var):
NVIDIA/Megatron-LM#4740.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Jinze Xue <jinzex@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant