[Common, PyTorch] Improve mHC to match DeepSeek's implementation by kainzhong · Pull Request #2978 · NVIDIA/TransformerEngine

kainzhong · 2026-05-12T00:34:32Z

Description

Some enhancement for mHC to better align with DeepSeek's tilelang implementation: https://github.com/deepseek-ai/TileKernels/tree/main/tile_kernels/mhc

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add mhc_generate_mix_and_aggregate API that does projection, scale, sinkhorn and aggregate together
Allow mhc_fused_projection to accept arguments with mixed dtype: x.dtype=bf16, phi.dtype=fp32, which matches DeepSeek's implementation
mhc_fused_projection now outputs fp32 regardless of the input dtype, matching DeepSeek's implementation
Add fuse_grad_x_acc optimization (default to False), which will reuse the same grad_x buffer to accumulate the initial mHC input x's gradient for mhc_fused_expand_combine, mhc_fused_aggregate and mhc_fused_projection
Support norm_weight for mhc_fused_projection, which would be equivalent to apply RMSNorm in the unfused manner with elementwise_affine=True, which would be the learnable per-element affine parameters for RMSNorm
Refactor some kernel code to avoid duplication. I just realized if you make a triton kernel constexpr, it can be used as a macro in if branches since triton will not compile if it knows in compile time that some branch will not be taken
Fix the bug that grid will exceed CUDA's limitation when M is too large and the autotune candidate is BLOCK_SIZE_M=1. Such invalid configs will be pruned now.
Improve projection op by using TMA if on Hopper+

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

for more information, see https://pre-commit.ci

[Common, PyTorch] Improve mHC to match DeepSeek's implementation

eccba0c

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

kainzhong force-pushed the feat/mhc_optimization1 branch from af685d7 to eccba0c Compare May 12, 2026 00:57

kainzhong and others added 11 commits May 12, 2026 17:54

fix

5d55d3a

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1edfa07

for more information, see https://pre-commit.ci

fix

c41ff89

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

make fused_grad_x_acc less confusing

f587a43

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

fb17f29

for more information, see https://pre-commit.ci

fix

cbe971f

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1507618

for more information, see https://pre-commit.ci

nit

486a4cf

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

3f2a9ab

for more information, see https://pre-commit.ci

fix

fdccbde

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

adb65eb

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common, PyTorch] Improve mHC to match DeepSeek's implementation#2978

[Common, PyTorch] Improve mHC to match DeepSeek's implementation#2978
kainzhong wants to merge 12 commits into
NVIDIA:mainfrom
kainzhong:feat/mhc_optimization1

kainzhong commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kainzhong commented May 12, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant