Implement 4over6 NVFP4 recipe#2972
Conversation
Greptile SummaryThis PR implements the 4over6 adaptive NVFP4 quantization algorithm, where each 1x16 FP4 block independently chooses between a map-to-4 (1.5x expanded scale) and map-to-6 (standard scale) candidate based on per-block MAE or MSE. The feature is exposed through a new
Confidence Score: 5/5Safe to merge. The 4over6 path is fully guarded against unsupported combinations (RHT, stochastic rounding, grouped tensors) at every entry point, the CUDA kernel targets SM 10.0+ exclusively, and nvfp4_e4m3_max metadata is consistently propagated through all tensor copy, view, serialize, and dequantization paths. The new kernel warp reductions, pipeline staging, boundary-handling, and error-denominator selection all check out numerically. The Python reference mirrors the CUDA path with bitwise precision when fast math is off. Metadata threading across C++/Python is thorough and validated in the updated test suite. The observations flagged are narrow edge-cases in custom module dispatch and a fallback chain in the reference GEMM scaling, neither affecting correctness of the primary training or inference paths. transformer_engine/pytorch/quantization.py - the backward tensor-type dispatch change could affect custom modules using non-standard backward tensor_type strings. Important Files Changed
Sequence DiagramsequenceDiagram
participant Recipe as NVFP4BlockScaling recipe
participant State as NVFP4BlockScalingRecipeState
participant PyQ as NVFP4Quantizer (Python)
participant CppQ as NVFP4Quantizer (C++)
participant Dispatch as quantize_fwd_helper
participant Kernel4o6 as quantize_4over6_kernel
participant KernelStd as quantize_transpose_nvfp4
Recipe->>State: nvfp4_4over6 scope + err_mode
State->>PyQ: nvfp4_use_4over6, nvfp4_e4m3_max, nvfp4_4over6_err_mode
PyQ->>CppQ: quantizer.attr() cast in quantizer.cpp
CppQ->>CppQ: set nvfp4_4over6_mode, nvfp4_e4m3_max
CppQ->>Dispatch: quant_config with nvfp4_4over6_mode set
Dispatch->>Dispatch: check nvfp4_use_4over6
alt 4over6 enabled
Dispatch->>Kernel4o6: quantize_4over6 with use_2d
Kernel4o6->>Kernel4o6: compute map4 + map6 scales
Kernel4o6->>Kernel4o6: quantize both candidates with error
Kernel4o6->>Kernel4o6: select min-error candidate per block
Kernel4o6-->>Dispatch: FP4 output + selected FP8 block scales
else standard NVFP4
Dispatch->>KernelStd: existing quantize kernel
KernelStd-->>Dispatch: FP4 output + FP8 block scales
end
Dispatch->>Dispatch: set nvfp4_e4m3_max on output tensor
Reviews (12): Last reviewed commit: "Remove gradient 4over6 quantization and ..." | Re-trigger Greptile |
|
Functionality has been verified by internal RL experiments. |
|
Need to rebase. |
| * its values are populated during quantization. | ||
| */ | ||
| kNVTERowScaledNVFP4 = 8, | ||
| kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */ |
There was a problem hiding this comment.
We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.
There was a problem hiding this comment.
4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.
| using namespace detail; | ||
| constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f; | ||
| constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f; | ||
| constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f; |
There was a problem hiding this comment.
How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.
If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.
There was a problem hiding this comment.
From the original paper:
Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.
Also:
In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.
There was a problem hiding this comment.
Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.
There was a problem hiding this comment.
We did find the use of 256 to calculate the second level scaling factor helped convergence vs 448, but only slightly.
It's possible that the premise of the paper's argument (prevent saturations when 4 scaling effectively multiplies the block decode scale by 1.5) is sound, but a value larger than 256 can achieve this and the perfect representation of the block with the global amax value with both scalings is not worth the extra range loss.
There was a problem hiding this comment.
let me make 256 scaling a separate env var disabled by default
There was a problem hiding this comment.
448, 320, 288, 256 are all potential candidates for map-to-6:
- 448: effectively disable map-to-4 option above 256, preserve range
- 320, 288: map-to-4 uses 448, no precise 1.5x
- 256: map-to-4 uses 384, precise 1.5x
For now let me refactor the interface to NVTE_NVFP4_4OVER6_E4M3="448"|"256", default to "448" and dispatches to a number in template parameter in C++ code instead of a boolean toggle. People can add support for other values or make it more generic (like directly parsing the env var digits) in the future.
There was a problem hiding this comment.
NVTE_NVFP4_4OVER6_E4M3_USE_256=weights|activations|all is a cleaner pattern and allows separate configuration.
There was a problem hiding this comment.
For our RL experiments we do observe 256 leads to less mismatch vs 448.
There was a problem hiding this comment.
This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.
There was a problem hiding this comment.
Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.
| using namespace detail; | ||
| constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f; | ||
| constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f; | ||
| constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f; |
There was a problem hiding this comment.
Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
|
What is the e2e step time increase with 4/6 on some typical workload? |
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
|
Major changes from last time:
|
|
Hi @Oleg-Goncharov , This is usable especially considering RL has very long context attention and there are other communication overheads. The rollout side end-to-end overhead is only around 1~2%. We also observe meaningful numerics improvements for rollout and training fprop consistency. Considering RL is usually rollout bounded and very sensitive to mismatch, 4over6 shows meaningful improvements under acceptable training side performance overhead. For pretraining config, the performance overhead is 2.16x~2.57x, in an unusable stage at this time. I turned off RHT and SR for fair comparision: |
|
Hi @zianglih, from my side, this looks okay now. The reported slowdown doesn't seem like a blocker for merging, especially if the current tradeoff is acceptable for the target use case, and we can revisit performance later if needed. |
| /*! Whether an NVFP4 tensor is encoded with 4over6 semantics. | ||
| * | ||
| * This records whether block scales were selected by comparing map-to-4 | ||
| * and map-to-6 candidates. | ||
| */ | ||
| kNVTENVFP44Over6 = 9, |
There was a problem hiding this comment.
We are controlling 4over6 with 5 configs:
kNVTENVFP44Over6kNVTENVFP4E4M3MaxkNVTEQuantizationConfigNVFP44Over6kNVTEQuantizationConfigNVFP4E4M3MaxkNVTEQuantizationConfigNVFP44Over6ErrMode
We only need 2:
kNVTENVFP4E4M3Max: tensor attr, needed for both quant and dequantkNVTEQuantizationConfigNVFP44Over6Mode: quant config, only needed for quant
| /*! \enum NVTENVFP44Over6ErrMode | ||
| * \brief Candidate-selection error mode for NVFP4 4over6 quantization. | ||
| */ | ||
| enum NVTENVFP44Over6ErrMode { | ||
| kNVTENVFP44Over6ErrMAE = 0, /*!< Select the candidate with lower summed absolute error */ | ||
| kNVTENVFP44Over6ErrMSE = 1, /*!< Select the candidate with lower summed squared error */ | ||
| }; |
There was a problem hiding this comment.
If we add "disabled mode", this enum makes the bool configs for 4over6 redundant.
| /*! \enum NVTENVFP44Over6ErrMode | |
| * \brief Candidate-selection error mode for NVFP4 4over6 quantization. | |
| */ | |
| enum NVTENVFP44Over6ErrMode { | |
| kNVTENVFP44Over6ErrMAE = 0, /*!< Select the candidate with lower summed absolute error */ | |
| kNVTENVFP44Over6ErrMSE = 1, /*!< Select the candidate with lower summed squared error */ | |
| }; | |
| /*! \enum NVTENVFP44Over6Mode | |
| * \brief Method for NVFP4 4over6 quantization. | |
| */ | |
| enum NVTENVFP44Over6Mode { | |
| kNVTENVFP44Over6Disabled = 0, /*!< 4over6 is not applied */ | |
| kNVTENVFP44Over6MinMAE = 1, /*!< Select the candidate with lower mean absolute error */ | |
| kNVTENVFP44Over6MinMSE = 2, /*!< Select the candidate with lower mean squared error */ | |
| }; |
There was a problem hiding this comment.
Done by 2980cb1 . Also refactored modes in cpp tests.
|
/te-ci |
Signed-off-by: Ziang Li <ziangli@umich.edu>
|
A few 4over6 ci failures: |
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
| Select 4over6 tensors that use 256 as the global E4M3 scale | ||
| bound. If unset, 4over6 uses the standard NVFP4 448 bound. | ||
| nvfp4_4over6_err_mode : {'MAE', 'MSE'}, default = 'MAE' | ||
| Error metric used by NVFP4 4over6 candidate selection. |
There was a problem hiding this comment.
disable_rht=True + disable_stochastic_rounding=True means that 4over6 has limited use for pre-training. It is ok for this PR I think. This is not an algorithmic limitation but a kernel one. What about documenting it properly ("Currently, 4over6 implementation targets RL and post-training scenarios...") and adding TODOs (e.g., for pre-training enable RHT + 4over6 + quant fused kernel).
There was a problem hiding this comment.
Added TODO and refactored docs in 7a4b5c0. Also, since gradient quantizers no longer use 4over6, SR is always allowed. RHT is still not allowed when activation uses 4over6.
| elif self.recipe.nvfp4_4over6 == "weights": | ||
| nvfp4_use_4over6 = tensor_type == "weight" | ||
| elif self.recipe.nvfp4_4over6 == "activations": | ||
| nvfp4_use_4over6 = tensor_type != "weight" |
There was a problem hiding this comment.
This means we apply 4over6 for gradients as well (along with inputs)? Why? What drives this decision?
There was a problem hiding this comment.
Thank you for pointing this out. This is an unintended implementation bug. I have removed all gradiet quantizer 4over6 in 7a4b5c0 .
Signed-off-by: Ziang Li <ziangli@umich.edu>
Description
@HumansAnd
Implement 4over6 nvfp4 from:
FlashInfer PR:
Enable per-block map-to-4 versus map-to-6 candidate selection for 1D/2D NVFP4 quantization in the
NVFP4BlockScalingrecipe. This mode currently requires RHT and stochastic rounding to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.
Type of change
Changes
Please list the changes introduced in this PR:
NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as stochastic rounding, grouped tensors, and RHT.Checklist: