Implement 4over6 NVFP4 recipe by zianglih · Pull Request #2972 · NVIDIA/TransformerEngine

zianglih · 2026-05-09T03:50:20Z

Description

Implement 4over6 nvfp4 from:

Paper: https://arxiv.org/abs/2512.02010
Code: https://github.com/mit-han-lab/fouroversix

FlashInfer PR:

Support 4over6 nvfp4 for quantizer and fused MoE flashinfer-ai/flashinfer#3264

Enable per-block map-to-4 versus map-to-6 candidate selection for 1D/2D NVFP4 quantization in the NVFP4BlockScaling recipe. This mode currently requires RHT and stochastic rounding to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.

This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Adds scoped NVFP4 4over6 control through NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.
Implements 1D & 2D NVFP4 4over6 quantization in the existing NVFP4 CUDA paths by comparing TE-style map-to-4 and map-to-6 FP4 candidates with the original 4over6 MSE rule, choosing map-to-6 on ties, honoring NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as stochastic rounding, grouped tensors, and RHT.
Updates dequantization and NVFP4 GEMM scaling to respect per-tensor 4over6 metadata, using 256-based normalization for 4over6 tensors and 448-based normalization for regular NVFP4 tensors without requiring callers to do hidden rescaling.
Extends the Python reference implementation to mirror the intended ground truth, meaning TE-style candidate quantization plus original 4over6 MSE/compare logic, and uses this reference for bitwise exact tests where fast math is disabled.
Expands C++ and Python coverage across exact NVFP4 quantization, GEMM, dequantization, recipe scope resolution, quantized tensor handling, numerics, sanity, CUDA graph, torch compile, CPU offload, fusible ops, and backward override paths, while documenting the new environment variable and known unsupported modes.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-09T03:55:44Z

Greptile Summary

This PR implements the 4over6 adaptive NVFP4 quantization algorithm, where each 1x16 FP4 block independently chooses between a map-to-4 (1.5x expanded scale) and map-to-6 (standard scale) candidate based on per-block MAE or MSE. The feature is exposed through a new NVTE_NVFP4_4OVER6 env var scoped to weights, activations, or both, and threads selection metadata through C++ tensors, Python tensor storage, recipe state, and the dequantization/GEMM scaling path.

Adds quantize_4over6_nvfp4.cuh: a dedicated Blackwell (SM 10.0+) kernel with a two-stage async pipeline that produces rowwise and/or columnwise FP4 output; warp-level reductions correctly handle both 1D and 2D block quantization.
Updates dequantize_nvfp4.cuh and nvfp4.cu per-tensor-scale kernel to use a per-tensor nvfp4_e4m3_max (448 or 256) rather than a hard-coded 448, enabling correct round-trip fidelity for 4over6 tensors.
Extends NVFP4Quantizer, NVFP4TensorStorage, GroupedTensorStorage, and all serialization/copy/view paths to carry nvfp4_use_4over6 and nvfp4_e4m3_max metadata.

Confidence Score: 5/5

Safe to merge. The 4over6 path is fully guarded against unsupported combinations (RHT, stochastic rounding, grouped tensors) at every entry point, the CUDA kernel targets SM 10.0+ exclusively, and nvfp4_e4m3_max metadata is consistently propagated through all tensor copy, view, serialize, and dequantization paths.

The new kernel warp reductions, pipeline staging, boundary-handling, and error-denominator selection all check out numerically. The Python reference mirrors the CUDA path with bitwise precision when fast math is off. Metadata threading across C++/Python is thorough and validated in the updated test suite. The observations flagged are narrow edge-cases in custom module dispatch and a fallback chain in the reference GEMM scaling, neither affecting correctness of the primary training or inference paths.

transformer_engine/pytorch/quantization.py - the backward tensor-type dispatch change could affect custom modules using non-standard backward tensor_type strings.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/nvfp4/quantize_4over6_nvfp4.cuh	New 668-line CUDA kernel implementing 4over6 FP4 quantization; pipeline is structurally correct, warp reductions are coherent, boundary handling is safe, and error-denominator uses the correct E4M3_MAX template parameter throughout.
transformer_engine/pytorch/quantization.py	Backward dispatch in _qparams changed from mode-based to tensor-type-based to enable 4over6 scope exclusion; safe for built-in modules due to _slot_tensor_type positional fallback, but could silently misroute non-standard backward tensor_types in custom modules.
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh	Refactored to use E4M3_MAX as a template parameter (256 or 448) instead of hardcoded 448; ROW_SCALED_NVFP4 moved from runtime to compile-time parameter; correctly dispatches both branches.
transformer_engine/common/recipe/nvfp4.cu	Per-tensor scale kernel now receives fp8_max_A/fp8_max_B from each tensor's nvfp4_e4m3_max to compute the correct normalization factor for mixed 4over6/standard tensors.
transformer_engine/pytorch/custom_recipes/quantization_ref_nvfp4.py	Python reference implementation extended with 4over6 candidate selection; GEMM scaling factor correctly looks up e4m3_max from result tensor rather than hardcoding 448, though the multi-attribute getattr fallback chain is fragile for future tensor types.
transformer_engine/pytorch/csrc/quantizer.cpp	NVFP4Quantizer correctly reads nvfp4_4over6_mode and nvfp4_e4m3_max from Python and propagates them through create_tensor, create_grouped_tensor, convert_and_update_tensor, and quantize_impl.
transformer_engine/common/recipe/init.py	Adds nvfp4_4over6, nvfp4_4over6_e4m3_use_256, and nvfp4_4over6_err_mode fields with env-var defaults and scope-validation assertions; RHT/stochastic_rounding constraints are deferred to _make() at quantizer construction time.
transformer_engine/pytorch/tensor/nvfp4_tensor.py	nvfp4_use_4over6 and nvfp4_e4m3_max fields threaded consistently through new, copy paths, reduce_ex, _from_tensors, and view/reshape autograd functions.
transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py	nvfp4_use_4over6 and nvfp4_e4m3_max added with proper property accessors; split_into_quantized_tensors propagates the values to each individual NVFP4Tensor.
transformer_engine/common/cast/dispatch/quantize.cuh	Routing updated to dispatch to quantize_4over6 when 4over6 mode is non-disabled; guards for E4M3 max and stochastic rounding added correctly before dispatch.

Sequence Diagram

sequenceDiagram
    participant Recipe as NVFP4BlockScaling recipe
    participant State as NVFP4BlockScalingRecipeState
    participant PyQ as NVFP4Quantizer (Python)
    participant CppQ as NVFP4Quantizer (C++)
    participant Dispatch as quantize_fwd_helper
    participant Kernel4o6 as quantize_4over6_kernel
    participant KernelStd as quantize_transpose_nvfp4

    Recipe->>State: nvfp4_4over6 scope + err_mode
    State->>PyQ: nvfp4_use_4over6, nvfp4_e4m3_max, nvfp4_4over6_err_mode
    PyQ->>CppQ: quantizer.attr() cast in quantizer.cpp
    CppQ->>CppQ: set nvfp4_4over6_mode, nvfp4_e4m3_max
    CppQ->>Dispatch: quant_config with nvfp4_4over6_mode set
    Dispatch->>Dispatch: check nvfp4_use_4over6
    alt 4over6 enabled
        Dispatch->>Kernel4o6: quantize_4over6 with use_2d
        Kernel4o6->>Kernel4o6: compute map4 + map6 scales
        Kernel4o6->>Kernel4o6: quantize both candidates with error
        Kernel4o6->>Kernel4o6: select min-error candidate per block
        Kernel4o6-->>Dispatch: FP4 output + selected FP8 block scales
    else standard NVFP4
        Dispatch->>KernelStd: existing quantize kernel
        KernelStd-->>Dispatch: FP4 output + FP8 block scales
    end
    Dispatch->>Dispatch: set nvfp4_e4m3_max on output tensor

_{Reviews (12): Last reviewed commit: "Remove gradient 4over6 quantization and ..." | Re-trigger Greptile}

zianglih · 2026-05-11T07:16:24Z

Functionality has been verified by internal RL experiments.
We may want to allow separate 4over6 config for weights and activations, maybe NVTE_NVFP4_ENABLE_4OVER6=weights|activations|all.

zianglih · 2026-05-11T21:17:24Z

Need to rebase.

timmoon10 · 2026-05-11T23:44:32Z

   *  its values are populated during quantization.
   */
  kNVTERowScaledNVFP4 = 8,
+  kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */


We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.

4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.

timmoon10 · 2026-05-12T00:04:07Z

  using namespace detail;
-  constexpr float fp8_max = TypeExtrema<fp8e4m3>::max;  // 448.0f;
-  constexpr float fp4_max = TypeExtrema<fp4e2m1>::max;  // 6.0f;
+  constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max;  // 448.0f;


How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.

If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.

From the original paper:

Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.

Also:

In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.

Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

We did find the use of 256 to calculate the second level scaling factor helped convergence vs 448, but only slightly.

It's possible that the premise of the paper's argument (prevent saturations when 4 scaling effectively multiplies the block decode scale by 1.5) is sound, but a value larger than 256 can achieve this and the perfect representation of the block with the global amax value with both scalings is not worth the extra range loss.

let me make 256 scaling a separate env var disabled by default

448, 320, 288, 256 are all potential candidates for map-to-6:

448: effectively disable map-to-4 option above 256, preserve range

320, 288: map-to-4 uses 448, no precise 1.5x

256: map-to-4 uses 384, precise 1.5x

For now let me refactor the interface to NVTE_NVFP4_4OVER6_E4M3="448"|"256", default to "448" and dispatches to a number in template parameter in C++ code instead of a boolean toggle. People can add support for other values or make it more generic (like directly parsing the env var digits) in the future.

NVTE_NVFP4_4OVER6_E4M3_USE_256=weights|activations|all is a cleaner pattern and allows separate configuration.

For our RL experiments we do observe 256 leads to less mismatch vs 448.

timmoon10 · 2026-05-12T00:25:11Z

This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.

Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.

negvet · 2026-05-12T15:40:29Z

  using namespace detail;
-  constexpr float fp8_max = TypeExtrema<fp8e4m3>::max;  // 448.0f;
-  constexpr float fp4_max = TypeExtrema<fp4e2m1>::max;  // 6.0f;
+  constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max;  // 448.0f;


Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

Signed-off-by: Ziang Li <ziangli@umich.edu>

negvet · 2026-05-13T09:17:07Z

What is the e2e step time increase with 4/6 on some typical workload?

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih · 2026-05-13T09:48:40Z

Major changes from last time:

Use standalone quantization kernel implementation instead of folding into existing code. 4over6 quantize is very fp32 compute bound (Implement 4over6 NVFP4 recipe #2972 (comment) and Implement 4over6 NVFP4 recipe #2972 (comment)) and latency hiding techniques in TE original nvfp4 quant kernels lead to higher register pressure and worse performance. There is not much we could do regarding fp32 arithmetic bottleneck without changing heuristics. I think even if we want to further optimize perf/heuristics we should do it in a separate PR and extend as new error modes. cc @Oleg-Goncharov @kwyss-nvidia
Allow both 448 and 256 configurations. The user can config by setting NVTE_NVFP4_4OVER6_E4M3_USE_256. However, all underlying implementations encodes nvfp4_e4m3_max and E4M3_MAX template parameter instead of a boolean flag so we can easily extend other values in the future. cc @timmoon10 @kwyss-nvidia @negvet
Add and default to MAE error mode. cc @negvet
For 4over6 quantize cpp test, we now don't check map-to-4 vs map-to-6 selection and accept either to be bitwise exact. This avoids numerics drift from CPU arch. Python test still has strict candidate selection coverage. cc @Oleg-Goncharov

zianglih · 2026-05-14T04:07:00Z

Hi @Oleg-Goncharov ,
For our RL config (see env vars below) benchmark_grouped_linear.py shows a 1.28x~1.36x slowdown.

This is usable especially considering RL has very long context attention and there are other communication overheads. The rollout side end-to-end overhead is only around 1~2%. We also observe meaningful numerics improvements for rollout and training fprop consistency. Considering RL is usually rollout bounded and very sensitive to mismatch, 4over6 shows meaningful improvements under acceptable training side performance overhead.

NVTE_NVFP4_ROW_SCALED_ACTIVATION=1 \
NVTE_BACKWARD_OVERRIDE=dequantized \
NVTE_NVFP4_DISABLE_2D_QUANTIZATION=1 \
NVTE_NVFP4_DISABLE_RHT=1 \
NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING=1 \
python benchmarks/linear/benchmark_grouped_linear.py --recipe nvfp4
       m     k     n recipe  num_gemms  grouped_fwd_bwd_time_ms
0  16384  7168  2048  nvfp4          4                 1.443936
1  32768  7168  2048  nvfp4          4                 2.489801
2  65536  7168  2048  nvfp4          4                 4.548635
3  98304  7168  2048  nvfp4          4                 6.640535
0  16384  7168  2048  nvfp4          8                 1.836268
1  32768  7168  2048  nvfp4          8                 2.837006
2  65536  7168  2048  nvfp4          8                 4.977518
3  98304  7168  2048  nvfp4          8                 6.967243

NVTE_NVFP4_4OVER6=all \
NVTE_NVFP4_ROW_SCALED_ACTIVATION=1 \
NVTE_BACKWARD_OVERRIDE=dequantized \
NVTE_NVFP4_DISABLE_2D_QUANTIZATION=1 \
NVTE_NVFP4_DISABLE_RHT=1 \
NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING=1 \
python benchmarks/linear/benchmark_grouped_linear.py --recipe nvfp4
       m     k     n recipe  num_gemms  grouped_fwd_bwd_time_ms
0  16384  7168  2048  nvfp4          4                 1.908519
1  32768  7168  2048  nvfp4          4                 3.313811
2  65536  7168  2048  nvfp4          4                 6.215076
3  98304  7168  2048  nvfp4          4                 9.027176
0  16384  7168  2048  nvfp4          8                 2.361491
1  32768  7168  2048  nvfp4          8                 3.768442
2  65536  7168  2048  nvfp4          8                 6.588285
3  98304  7168  2048  nvfp4          8                 9.480253

For pretraining config, the performance overhead is 2.16x~2.57x, in an unusable stage at this time. I turned off RHT and SR for fair comparision:

NVTE_NVFP4_DISABLE_RHT=1 \
NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING=1 \
python benchmarks/linear/benchmark_grouped_linear.py --recipe nvfp4
       m     k     n recipe  num_gemms  grouped_fwd_bwd_time_ms
0  16384  7168  2048  nvfp4          4                 0.774788
1  32768  7168  2048  nvfp4          4                 1.251587
2  65536  7168  2048  nvfp4          4                 2.249276
3  98304  7168  2048  nvfp4          4                 3.259345
0  16384  7168  2048  nvfp4          8                 0.952317
1  32768  7168  2048  nvfp4          8                 1.432820
2  65536  7168  2048  nvfp4          8                 2.436908
3  98304  7168  2048  nvfp4          8                 3.412981

NVTE_NVFP4_4OVER6=all \
NVTE_NVFP4_DISABLE_RHT=1 \
NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING=1 \
python benchmarks/linear/benchmark_grouped_linear.py --recipe nvfp4
       m     k     n recipe  num_gemms  grouped_fwd_bwd_time_ms
0  16384  7168  2048  nvfp4          4                 1.753024
1  32768  7168  2048  nvfp4          4                 3.074884
2  65536  7168  2048  nvfp4          4                 5.711913
3  98304  7168  2048  nvfp4          4                 8.387917
0  16384  7168  2048  nvfp4          8                 2.060491
1  32768  7168  2048  nvfp4          8                 3.383869
2  65536  7168  2048  nvfp4          8                 6.018331
3  98304  7168  2048  nvfp4          8                 8.670583

Oleg-Goncharov · 2026-05-14T19:08:19Z

Hi @zianglih, from my side, this looks okay now. The reported slowdown doesn't seem like a blocker for merging, especially if the current tradeoff is acceptable for the target use case, and we can revisit performance later if needed.

timmoon10 · 2026-05-19T01:53:43Z

+  /*! Whether an NVFP4 tensor is encoded with 4over6 semantics.
+   *
+   *  This records whether block scales were selected by comparing map-to-4
+   *  and map-to-6 candidates.
+   */
+  kNVTENVFP44Over6 = 9,


We are controlling 4over6 with 5 configs:

kNVTENVFP44Over6

kNVTENVFP4E4M3Max

kNVTEQuantizationConfigNVFP44Over6

kNVTEQuantizationConfigNVFP4E4M3Max

kNVTEQuantizationConfigNVFP44Over6ErrMode

We only need 2:

kNVTENVFP4E4M3Max: tensor attr, needed for both quant and dequant

kNVTEQuantizationConfigNVFP44Over6Mode: quant config, only needed for quant

Done by 2980cb1 .

timmoon10 · 2026-05-19T01:56:52Z

+/*! \enum NVTENVFP44Over6ErrMode
+ * \brief Candidate-selection error mode for NVFP4 4over6 quantization.
+ */
+enum NVTENVFP44Over6ErrMode {
+  kNVTENVFP44Over6ErrMAE = 0, /*!< Select the candidate with lower summed absolute error */
+  kNVTENVFP44Over6ErrMSE = 1, /*!< Select the candidate with lower summed squared error */
+};


If we add "disabled mode", this enum makes the bool configs for 4over6 redundant.

Suggested change

/*! \enum NVTENVFP44Over6ErrMode

* \brief Candidate-selection error mode for NVFP4 4over6 quantization.

*/

enum NVTENVFP44Over6ErrMode {

kNVTENVFP44Over6ErrMAE = 0, /*!< Select the candidate with lower summed absolute error */

kNVTENVFP44Over6ErrMSE = 1, /*!< Select the candidate with lower summed squared error */

};

/*! \enum NVTENVFP44Over6Mode

* \brief Method for NVFP4 4over6 quantization.

*/

enum NVTENVFP44Over6Mode {

kNVTENVFP44Over6Disabled = 0, /*!< 4over6 is not applied */

kNVTENVFP44Over6MinMAE = 1, /*!< Select the candidate with lower mean absolute error */

kNVTENVFP44Over6MinMSE = 2, /*!< Select the candidate with lower mean squared error */

};

Done by 2980cb1 . Also refactored modes in cpp tests.

timmoon10 · 2026-05-19T02:05:39Z

/te-ci

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih · 2026-05-19T07:17:26Z

A few 4over6 ci failures:

=========================== short test summary info ============================
FAILED ../../tests/pytorch/test_fusible_ops.py::TestBasicOps::test_dropout[dtype1-shape2-fp8_current_scaling-True-0.5] - AssertionError: Number of zeros is outside 99% confidence interval (prob=0.5, prob_observed=0.488525390625)
assert 2.9375 < 2.5758
 +  where 2.9375 = abs(-2.9375)
FAILED ../../tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_layernorm_mlp[nvfp4_4over6-dtype2-False-True-True-False] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 384 (0.3%)
Greatest absolute difference: 0.5703666875867625 at index (172,) (up to 0.5 allowed)
Greatest relative difference: 3.475372894576796 at index (172,) (up to 0.25 allowed)
FAILED ../../tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_layernorm_mlp[nvfp4_4over6-dtype2-False-True-True-True] - AssertionError: Tensor-likes are not close!

Mismatched elements: 8 / 384 (2.1%)
Greatest absolute difference: 0.6054700590012174 at index (37,) (up to 0.5 allowed)
Greatest relative difference: 67.53061971695855 at index (36,) (up to 0.25 allowed)
FAILED ../../tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_layernorm_mlp[nvfp4_4over6-dtype2-True-True-True-False] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 49152 (0.0%)
Greatest absolute difference: 0.5862411404167979 at index (38, 79) (up to 0.5 allowed)
Greatest relative difference: 10.66064707830139 at index (38, 79) (up to 0.25 allowed)
FAILED ../../tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_layernorm_mlp[nvfp4_4over6-dtype2-True-True-True-True] - AssertionError: Tensor-likes are not close!

Mismatched elements: 35 / 384 (9.1%)
Greatest absolute difference: 0.6996637819152378 at index (23,) (up to 0.5 allowed)
Greatest relative difference: 688.2507391509421 at index (184,) (up to 0.25 allowed)
=== 5 failed, 3945 passed, 9607 skipped, 2966 warnings in 404.46s (0:06:44) ====
Error: sub-test failed: test_fusible_ops.py

Signed-off-by: Ziang Li <ziangli@umich.edu>

negvet · 2026-05-19T12:04:05Z

+             Select 4over6 tensors that use 256 as the global E4M3 scale
+             bound. If unset, 4over6 uses the standard NVFP4 448 bound.
+    nvfp4_4over6_err_mode : {'MAE', 'MSE'}, default = 'MAE'
+             Error metric used by NVFP4 4over6 candidate selection.


disable_rht=True + disable_stochastic_rounding=True means that 4over6 has limited use for pre-training. It is ok for this PR I think. This is not an algorithmic limitation but a kernel one. What about documenting it properly ("Currently, 4over6 implementation targets RL and post-training scenarios...") and adding TODOs (e.g., for pre-training enable RHT + 4over6 + quant fused kernel).

Added TODO and refactored docs in 7a4b5c0. Also, since gradient quantizers no longer use 4over6, SR is always allowed. RHT is still not allowed when activation uses 4over6.

negvet · 2026-05-19T12:07:02Z

+            elif self.recipe.nvfp4_4over6 == "weights":
+                nvfp4_use_4over6 = tensor_type == "weight"
+            elif self.recipe.nvfp4_4over6 == "activations":
+                nvfp4_use_4over6 = tensor_type != "weight"


This means we apply 4over6 for gradients as well (along with inputs)? Why? What drives this decision?

Thank you for pointing this out. This is an unintended implementation bug. I have removed all gradiet quantizer 4over6 in 7a4b5c0 .

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih marked this pull request as draft May 9, 2026 03:50

zianglih changed the title ~~Implement 4over6 nvfp4~~ Implement 4over6 nvfp4 recipe May 9, 2026

zianglih mentioned this pull request May 9, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

30 tasks

zianglih changed the title ~~Implement 4over6 nvfp4 recipe~~ Implement 4over6 NVFP4 recipe May 9, 2026

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated

Comment thread transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu Outdated

Comment thread transformer_engine/common/recipe/__init__.py Outdated

ziang-and force-pushed the 4over6 branch from f3f4127 to 9ff4c3a Compare May 9, 2026 08:53

zianglih commented May 9, 2026

View reviewed changes

Comment thread tests/pytorch/test_sanity.py Outdated

zianglih mentioned this pull request May 9, 2026

Support 4over6 nvfp4 for quantizer and fused MoE flashinfer-ai/flashinfer#3264

Open

5 tasks

ziang-and force-pushed the 4over6 branch from a989400 to 097c7aa Compare May 10, 2026 09:11

zianglih marked this pull request as ready for review May 10, 2026 09:36

ptrendx assigned Oleg-Goncharov May 11, 2026

ptrendx requested a review from negvet May 11, 2026 17:12

ptrendx added community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. fp4 labels May 11, 2026

zianglih marked this pull request as draft May 11, 2026 21:17

ziang-and force-pushed the 4over6 branch from 53aad5e to 1dcd003 Compare May 11, 2026 21:41

zianglih marked this pull request as ready for review May 11, 2026 22:36

timmoon10 requested changes May 12, 2026

View reviewed changes

timmoon10 reviewed May 12, 2026

View reviewed changes

zianglih marked this pull request as draft May 12, 2026 02:01

zianglih marked this pull request as ready for review May 12, 2026 06:45

zianglih requested a review from timmoon10 May 12, 2026 06:47

zianglih marked this pull request as draft May 12, 2026 09:03

ziang-and force-pushed the 4over6 branch from 4f7790a to cc2f378 Compare May 12, 2026 09:17

zianglih marked this pull request as ready for review May 12, 2026 10:10

negvet requested changes May 12, 2026

View reviewed changes

Oleg-Goncharov self-requested a review May 12, 2026 16:37

zianglih added 6 commits May 13, 2026 00:36

Initial 448 vs 256 implementation

1e311ef

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use e4m3 max instead of boolean, more template

38a1c4c

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add benchmark script and minor optimization

3cdd9d9

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use standalone kernels

7deba75

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use cp async

93dbf2b

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add benchmark script

8819d12

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the 4over6 branch from e85cdbf to 8819d12 Compare May 13, 2026 07:38

Minor fix after rebase

24e417b

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the 4over6 branch from 38fffbb to 24e417b Compare May 13, 2026 08:54

zianglih added 2 commits May 13, 2026 02:36

Naming consistency

472e5b8

Signed-off-by: Ziang Li <ziangli@umich.edu>

Remove 4over6 benchmark

83e2308

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih marked this pull request as ready for review May 13, 2026 09:48

zianglih requested review from ksivaman and ptrendx as code owners May 13, 2026 09:48

timmoon10 reviewed May 19, 2026

View reviewed changes

Refactor modes

2980cb1

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih requested review from Oleg-Goncharov, negvet and timmoon10 May 19, 2026 05:02

zianglih added 2 commits May 19, 2026 00:28

Relax tol for test_layernorm_mlp for nvfp4_4over6

967293f

Signed-off-by: Ziang Li <ziangli@umich.edu>

Minor fix recipe naming

f555bf2

Signed-off-by: Ziang Li <ziangli@umich.edu>

negvet reviewed May 19, 2026

View reviewed changes

Remove gradient 4over6 quantization and partially allow SR/RHT

7a4b5c0

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the 4over6 branch from 20925c7 to 7a4b5c0 Compare May 19, 2026 18:42

-/*! \enum NVTENVFP44Over6ErrMode
- * \brief Candidate-selection error mode for NVFP4 4over6 quantization.
- */
-enum NVTENVFP44Over6ErrMode {
-  kNVTENVFP44Over6ErrMAE = 0, /*!< Select the candidate with lower summed absolute error */
-  kNVTENVFP44Over6ErrMSE = 1, /*!< Select the candidate with lower summed squared error */
-};
+/*! \enum NVTENVFP44Over6Mode
+ * \brief Method for NVFP4 4over6 quantization.
+ */
+enum NVTENVFP44Over6Mode {
+  kNVTENVFP44Over6Disabled = 0, /*!< 4over6 is not applied */
+  kNVTENVFP44Over6MinMAE = 1, /*!< Select the candidate with lower mean absolute error */
+  kNVTENVFP44Over6MinMSE = 2, /*!< Select the candidate with lower mean squared error */
+};

Conversation

zianglih commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zianglih commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented May 11, 2026

Uh oh!

timmoon10 May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

negvet commented May 13, 2026

Uh oh!

zianglih commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oleg-Goncharov commented May 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented May 19, 2026

Uh oh!

zianglih commented May 19, 2026

Uh oh!

negvet May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

zianglih commented May 9, 2026 •

edited

Loading

greptile-apps Bot commented May 9, 2026 •

edited

Loading

zianglih commented May 11, 2026 •

edited

Loading

timmoon10 May 11, 2026 •

edited

Loading

zianglih commented May 13, 2026 •

edited

Loading

zianglih commented May 14, 2026 •

edited

Loading

negvet May 19, 2026 •

edited

Loading

negvet May 19, 2026 •

edited

Loading