feat: Optimize memory footprint of long-context training via fused kernel and chunking by terminator123 · Pull Request #4312 · NVIDIA/Megatron-LM

terminator123 · 2026-04-15T07:12:34Z

What does this PR do ?

Introduces a fused CrossEntropy kernel and output chunking strategy to reduce the peak memory consumption of logits during long-context training.

Technical Details

This PR addresses the high VRAM usage bottleneck in large-scale training by targeting the logits tensor memory footprint.

Fused Kernel: Utilizes the Liger-Kernel's fused CrossEntropy implementation to reduce intermediate memory overhead.
Output Chunking: Implements an output chunking mechanism where the model's output is processed in blocks.
Memory-Specific Optimization: The peak memory is reduced by a factor proportional to the number of chunks (1/N ). The more chunks the output is divided into, the lower the peak memory .

…raining

copy-pr-bot · 2026-04-15T07:12:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-04-15T07:12:44Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

Phlip79 · 2026-04-17T16:59:52Z

We are in the process of developing this same feature: #2206. @Jianbing-D can you please take a look at this?

Jianbing-D · 2026-04-20T03:15:35Z

Hi @terminator123,

We have similar feature already merged to dev branch. #2256

And regarding your PR, here are some questions:

Is there any measurement numbers regarding your feature? like latency of fwd pass and bwd pass, as well as storage. Like what we did here: [Dev] Feature: linear cross entropy fusion #2256
Your kernels are written with OAI triton, but that library failed to achieve good performance on Blackwell GPUs. If you could provide any perf numbers, that would be great for us to determine whether your kernels are good enough.
Your feature seems not support reduction=none. Please correct me if I understand it wrong. If this feature doesn't support reduction=none, how could the users handle token masking and padding, where invalid tokens shall have zero as grad, whilst valid tokens shall have valid grad values.

[feat] Implement chunked cross-entropy to avoid OOM on long-context t…

57ba438

…raining

terminator123 requested review from a team as code owners April 15, 2026 07:12

svcnvidia-nemo-ci marked this pull request as draft April 15, 2026 07:12

github-actions Bot added the community-request label Apr 15, 2026

terminator123 marked this pull request as ready for review April 16, 2026 06:55

svcnvidia-nemo-ci requested a review from a team April 16, 2026 06:55

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 17, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 17, 2026

chtruong814 added the waiting-on-customer Waiting on the original author to respond label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Optimize memory footprint of long-context training via fused kernel and chunking#4312

feat: Optimize memory footprint of long-context training via fused kernel and chunking#4312
terminator123 wants to merge 1 commit into
NVIDIA:mainfrom
021ai:chunk_fused_cross_entropy

terminator123 commented Apr 15, 2026

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Phlip79 commented Apr 17, 2026

Uh oh!

Jianbing-D commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

terminator123 commented Apr 15, 2026

What does this PR do ?

Technical Details

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Phlip79 commented Apr 17, 2026

Uh oh!

Jianbing-D commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Jianbing-D commented Apr 20, 2026 •

edited

Loading