Implement 4over6 NVFP4 recipe by zianglih · Pull Request #2972 · NVIDIA/TransformerEngine

zianglih · 2026-05-09T03:50:20Z

Description

Implement 4over6 nvfp4 from:

Paper: https://arxiv.org/abs/2512.02010
Code: https://github.com/mit-han-lab/fouroversix

FlashInfer PR:

Support 4over6 nvfp4 for quantizer and fused MoE flashinfer-ai/flashinfer#3264

Enable per-block map-to-4 versus map-to-6 candidate selection for 1D/2D NVFP4 quantization in the NVFP4BlockScaling recipe. This mode currently requires RHT and stochastic rounding to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.

This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Adds scoped NVFP4 4over6 control through NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.
Implements 1D & 2D NVFP4 4over6 quantization in the existing NVFP4 CUDA paths by comparing TE-style map-to-4 and map-to-6 FP4 candidates with the original 4over6 MSE rule, choosing map-to-6 on ties, honoring NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as stochastic rounding, grouped tensors, and RHT.
Updates dequantization and NVFP4 GEMM scaling to respect per-tensor 4over6 metadata, using 256-based normalization for 4over6 tensors and 448-based normalization for regular NVFP4 tensors without requiring callers to do hidden rescaling.
Extends the Python reference implementation to mirror the intended ground truth, meaning TE-style candidate quantization plus original 4over6 MSE/compare logic, and uses this reference for bitwise exact tests where fast math is disabled.
Expands C++ and Python coverage across exact NVFP4 quantization, GEMM, dequantization, recipe scope resolution, quantized tensor handling, numerics, sanity, CUDA graph, torch compile, CPU offload, fusible ops, and backward override paths, while documenting the new environment variable and known unsupported modes.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-09T03:55:44Z

Greptile Summary

This PR implements the NVFP4 4over6 quantization algorithm from the FourOverSix paper, enabling per-block map-to-4 vs. map-to-6 candidate selection for 1D and 2D NVFP4 quantization. The feature is gated behind a new NVTE_NVFP4_4OVER6 environment variable (values: weights, activations, all) and the nvfp4_4over6 field on NVFP4BlockScaling.

New CUDA kernel path (quantize_4over6_nvfp4.cuh): Evaluates map-to-4 (1.5\u00d7 expanded block scale) and map-to-6 (standard block scale) FP4 candidates for each 1\u00d716 block, selects the lower-MSE result with ties going to map-to-6, and emits tensors using a 256-based global E4M3 scale bound instead of 448.
Full stack propagation: The use_4over6 flag threads through Python recipe \u2192 NVFP4Quantizer \u2192 NVFP4Tensor/NVFP4TensorStorage \u2192 C++ Tensor/QuantizationConfig \u2192 all CUDA dispatch paths, with guard checks rejecting incompatible combinations (RHT, stochastic rounding, grouped tensors).
Reference implementation and testing: A Python reference in quantization_ref_nvfp4.py mirrors the CUDA MSE logic, and new tests cover exact quantization, GEMM, dequant, and recipe scope resolution.

Confidence Score: 4/5

Safe to merge with awareness of the pre-existing fast-math gap on the single-tensor path; all new 4over6 code is well-guarded and the core quantization logic aligns with the reference.

The 4over6 feature is implemented thoroughly across all 41 changed files. The flag threads correctly through Python recipe, tensor metadata, C++ quantizer, and all CUDA dispatch paths. Validation checks reject incompatible combinations at multiple layers. The Python reference mirrors the CUDA MSE logic with the correct 256-denominator and 1.5x scale expansion. The only new findings are minor style-level concerns.

transformer_engine/pytorch/csrc/extensions/cast.cpp - the use_fast_math env-var read is only wired into the split-quantize helper, not into the single-tensor quantize_impl, so the fast-math variant of the 4over6 MSE kernel is unreachable for ordinary single-tensor calls.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/nvfp4/quantize_4over6_nvfp4.cuh	New file implementing 4over6 device-side quantization helpers: scale computation, MSE accumulation via `cvt_fp32_to_fp4_8x_with_mse_rn`, candidate selection, and shared-memory scratch for the 2D block decision. Core logic is sound and well-templated.
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh	Moves `row_scaled_nvfp4` and `use_4over6` from runtime booleans to compile-time template parameters; updates the dequant factor denominator from 448 to 256 for 4over6 tensors using a `constexpr` ternary. Change is correct and clean.
transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu	Adds `kUse4Over6` and `kUseFastMath` template params to the main 1D cast-transpose kernel; always allocates a static `FourOverSixScratch` shared-memory struct (minimized to depth-1 when 4over6 is disabled), adding minor shared-memory overhead to non-4over6 instantiations.
transformer_engine/common/recipe/init.py	Adds `nvfp4_4over6` field with validation in `__post_init__`; correctly requires `disable_rht` and `disable_stochastic_rounding` for activations/all scopes but not for weights.
transformer_engine/pytorch/csrc/extensions/cast.cpp	Guards grouped and RHT split-quantize against `use_4over6`; wires `use_4over6` and `use_fast_math` through `split_quantize_nvfp4_impl_helper`. The `use_fast_math` env-var read is only performed here, not in the single-tensor `quantize_impl` path.
transformer_engine/pytorch/csrc/quantizer.cpp	Reads `use_4over6` from the Python quantizer and propagates it to tensor wrappers and the quantization config. Validation checks for incompatible RHT/SR combinations are in place in `quantize_impl`.
transformer_engine/pytorch/custom_recipes/quantization_ref_nvfp4.py	Adds `_quantize_blockwise_4over6_reference` mirroring the CUDA MSE logic with the correct 256-denominator and 1.5x scale expansion. The `use_4over6` attribute is resolved with a multi-level `getattr` fallback, which is slightly fragile.
transformer_engine/pytorch/tensor/nvfp4_tensor.py	Correctly propagates `use_4over6` through `__new__`, copy, view/reshape autograd functions, sharding metadata, and `__reduce_ex__`; the `nvfp4_shard_metadata` tuple grows from 5 to 7 elements with matching unpack in `nvfp4_unshard`.
transformer_engine/common/recipe/nvfp4.cu	Per-tensor GEMM scale kernel now accepts per-tensor `fp8_max_A`/`fp8_max_B` instead of a hardcoded constant, correctly using 256 for 4over6 tensors and 448 for standard ones.
transformer_engine/pytorch/quantization.py	Correctly maps the `nvfp4_4over6` scope string to per-quantizer `use_4over6` booleans: `all` True everywhere, `weights` True only for weight tensors, `activations` True for non-weight tensors.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["NVFP4BlockScaling recipe\nnvfp4_4over6: weights/activations/all/None"] --> B["RecipeState._make\nresolves use_4over6 per tensor_type"]
    B --> C["NVFP4Quantizer with use_4over6"]
    C --> D["create_tensor / quantize_impl"]
    D --> E{use_4over6?}
    E -->|No| G["Standard NVFP4 path\n448-based global scale"]
    E -->|Yes| F{valid combo?}
    F -->|RHT or SR or grouped| ERR["NVTE_CHECK error"]
    F -->|OK| H["compute_global_encode_scaling_factor 256-based"]
    H --> I["compute_4over6_decoding_scaling_factors\nmap6 and map4 candidates"]
    I --> J["cvt_fp32_to_fp4_8x_with_mse_rn x2\naccumulate MSE for both candidates"]
    J --> K{err_map4 < err_map6?}
    K -->|Yes| L["rOut_map4 selected"]
    K -->|No - ties go to map6| M["rOut_map6 selected"]
    L --> N["NVFP4Tensor _use_4over6=True\nglobal E4M3 bound = 256"]
    M --> N
    N --> O["Dequant and GEMM scale use 256 denominator"]

_{Reviews (7): Last reviewed commit: "Drop write back lifting" | Re-trigger Greptile}

zianglih · 2026-05-11T07:16:24Z

Functionality has been verified by internal RL experiments.
We may want to allow separate 4over6 config for weights and activations, maybe NVTE_NVFP4_ENABLE_4OVER6=weights|activations|all.

zianglih · 2026-05-11T21:17:24Z

Need to rebase.

timmoon10 · 2026-05-11T23:44:32Z

   *  its values are populated during quantization.
   */
  kNVTERowScaledNVFP4 = 8,
+  kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */


We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.

4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.

timmoon10 · 2026-05-12T00:04:07Z

  using namespace detail;
-  constexpr float fp8_max = TypeExtrema<fp8e4m3>::max;  // 448.0f;
-  constexpr float fp4_max = TypeExtrema<fp4e2m1>::max;  // 6.0f;
+  constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max;  // 448.0f;


How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.

If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.

From the original paper:

Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.

Also:

In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.

Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

We did find the use of 256 to calculate the second level scaling factor helped convergence vs 448, but only slightly.

It's possible that the premise of the paper's argument (prevent saturations when 4 scaling effectively multiplies the block decode scale by 1.5) is sound, but a value larger than 256 can achieve this and the perfect representation of the block with the global amax value with both scalings is not worth the extra range loss.

let me make 256 scaling a separate env var disabled by default

448, 320, 288, 256 are all potential candidates for map-to-6:

448: effectively disable map-to-4 option above 256, preserve range

320, 288: map-to-4 uses 448, no precise 1.5x

256: map-to-4 uses 384, precise 1.5x

For now let me refactor the interface to NVTE_NVFP4_4OVER6_E4M3="448"|"256", default to "448" and dispatches to a number in template parameter in C++ code instead of a boolean toggle. People can add support for other values or make it more generic (like directly parsing the env var digits) in the future.

NVTE_NVFP4_4OVER6_E4M3_USE_256=weights|activations|all is a cleaner pattern and allows separate configuration.

timmoon10 · 2026-05-12T00:25:11Z

This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.

Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.

negvet · 2026-05-12T15:22:36Z

+    nvfp4_4over6 : {None, 'weights', 'activations', 'all'}, default = None
+             Select tensors that use NVFP4 4over6. In this mode NVFP4
+             quantization evaluates per-block map-to-4 and map-to-6 candidates
+             and chooses the one with lower MSE. Ties choose map-to-6. The


We need both MSE (better for post-training?) and MAE (better for pre-training as per our internal studies) to be supported, with MAE as the default.

negvet · 2026-05-12T15:40:29Z

  using namespace detail;
-  constexpr float fp8_max = TypeExtrema<fp8e4m3>::max;  // 448.0f;
-  constexpr float fp4_max = TypeExtrema<fp4e2m1>::max;  // 6.0f;
+  constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max;  // 448.0f;


Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

Signed-off-by: Ziang Li <ziangli@umich.edu>

This reverts commit 69f9ccc. Signed-off-by: Ziang Li <ziangli@umich.edu>

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih marked this pull request as draft May 9, 2026 03:50

zianglih changed the title ~~Implement 4over6 nvfp4~~ Implement 4over6 nvfp4 recipe May 9, 2026

zianglih mentioned this pull request May 9, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

30 tasks

zianglih changed the title ~~Implement 4over6 nvfp4 recipe~~ Implement 4over6 NVFP4 recipe May 9, 2026

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated

Comment thread transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu Outdated

Comment thread transformer_engine/common/recipe/__init__.py

ziang-and force-pushed the 4over6 branch from f3f4127 to 9ff4c3a Compare May 9, 2026 08:53

zianglih commented May 9, 2026

View reviewed changes

Comment thread tests/pytorch/test_sanity.py Outdated

zianglih mentioned this pull request May 9, 2026

Support 4over6 nvfp4 for quantizer and fused MoE flashinfer-ai/flashinfer#3264

Open

5 tasks

ziang-and force-pushed the 4over6 branch from a989400 to 097c7aa Compare May 10, 2026 09:11

zianglih marked this pull request as ready for review May 10, 2026 09:36

ptrendx assigned Oleg-Goncharov May 11, 2026

ptrendx requested a review from negvet May 11, 2026 17:12

ptrendx added community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. fp4 labels May 11, 2026

zianglih marked this pull request as draft May 11, 2026 21:17

ziang-and force-pushed the 4over6 branch from 53aad5e to 1dcd003 Compare May 11, 2026 21:41

zianglih marked this pull request as ready for review May 11, 2026 22:36

timmoon10 requested changes May 12, 2026

View reviewed changes

timmoon10 reviewed May 12, 2026

View reviewed changes

zianglih marked this pull request as draft May 12, 2026 02:01

zianglih marked this pull request as ready for review May 12, 2026 06:45

zianglih requested a review from timmoon10 May 12, 2026 06:47

zianglih marked this pull request as draft May 12, 2026 09:03

ziang-and force-pushed the 4over6 branch from 4f7790a to cc2f378 Compare May 12, 2026 09:17

zianglih marked this pull request as ready for review May 12, 2026 10:10

negvet requested changes May 12, 2026

View reviewed changes

Oleg-Goncharov self-requested a review May 12, 2026 16:37

zianglih added 28 commits May 13, 2026 00:36

Refactor 2d

708c1ec

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up anti pattern

4d31f18

Signed-off-by: Ziang Li <ziangli@umich.edu>

Enforce 4over6 consistency

dfc15f2

Signed-off-by: Ziang Li <ziangli@umich.edu>

Update comments

9453670

Signed-off-by: Ziang Li <ziangli@umich.edu>

Update docs

6d871da

Signed-off-by: Ziang Li <ziangli@umich.edu>

Fix test

f8338e8

Signed-off-by: Ziang Li <ziangli@umich.edu>

Drop test_fusible_ops

c9bc921

Signed-off-by: Ziang Li <ziangli@umich.edu>

Revert "Drop test_fusible_ops"

00ba694

This reverts commit 69f9ccc. Signed-off-by: Ziang Li <ziangli@umich.edu>

Refactor test_fusible_ops

3252d4e

Signed-off-by: Ziang Li <ziangli@umich.edu>

Refactor ref and extend cpp test

3f33c1d

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up cpp test

8607e03

Signed-off-by: Ziang Li <ziangli@umich.edu>

Minor comment

d3dbf34

Signed-off-by: Ziang Li <ziangli@umich.edu>

Drop doc

565f33f

Signed-off-by: Ziang Li <ziangli@umich.edu>

Explicit handle conditional smem buffer

54b4da8

Signed-off-by: Ziang Li <ziangli@umich.edu>

Further clean up

fa09200

Signed-off-by: Ziang Li <ziangli@umich.edu>

More templates

e57e8be

Signed-off-by: Ziang Li <ziangli@umich.edu>

Simplify cpp

a1df319

Signed-off-by: Ziang Li <ziangli@umich.edu>

Drop write back lifting

21720da

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add MAE and dedicated fast math env var

b1d073a

Signed-off-by: Ziang Li <ziangli@umich.edu>

Harden cpp test

0392708

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add warning and err fast math coverage

0b77a37

Signed-off-by: Ziang Li <ziangli@umich.edu>

Fold test case and clean up cpp test

81e579e

Signed-off-by: Ziang Li <ziangli@umich.edu>

Initial 448 vs 256 implementation

1e311ef

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use e4m3 max instead of boolean, more template

38a1c4c

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add benchmark script and minor optimization

3cdd9d9

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use standalone kernels

7deba75

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use cp async

93dbf2b

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add benchmark script

8819d12

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the 4over6 branch from e85cdbf to 8819d12 Compare May 13, 2026 07:38

Minor fix after rebase

38fffbb

Conversation

zianglih commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zianglih commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented May 11, 2026

Uh oh!

timmoon10 May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zianglih commented May 9, 2026 •

edited

Loading

greptile-apps Bot commented May 9, 2026 •

edited

Loading

zianglih commented May 11, 2026 •

edited

Loading

timmoon10 May 11, 2026 •

edited

Loading