[Pytorch][Bug] DCP Checkpoint Loading Fixes for FSDP2 with QuantizedModelInit by vthumbe1503 · Pull Request #2974 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-05-11T15:50:37Z

Description

Fixes DCP Sync checkpoint loading for MXFP8/NVFP4.
Fixes DCP Async checkpoint loading for all Quantization recipes
Fixes NVFP4 allgather + dequant numerical errors for fsdp2. Turns out this was due to us not setting the fsdp group as the amax reduction group in the quantizer

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Untyped_storage implementation needed for FSDP2 + DCP
- untyped_storage is now defined for the base QuantizedTensor to return empty storage. Untyped_storage refers to the backing storage that we use to create all the internal tensors. Since we use make_wrapper_subclass to create TE QuantizedTensors, we use dont have any backing storage associated with the tensor. data_ptr on our Custom QuantizedTensor also returns 0.
- The main issue is that FSDP2 maintains sharded param tensor for checkpointing. It does so by calling view(-1) on our Quantized sharded model parameters. We return back a dequantized 1D tensor in TE. So, the sharded tensor that FSDP2 maintains for checkpointing is BF16 and Quantized sharded param is our custom FP8 tensor. It evaluates untyped_storage(BF16 sharded tensor reloaded from disk) == untyped_storage(Quantized sharded parameter) to see if the same_tensor. With us returning empty storage now, this would never be equal to sharded tensor's untyped storage.
DCP Aync/Sync Checkpoint loading
- For Sync cases previously we were going through the route of dequantization to BF16 before saving to disk, which happened through the to_new_empty function
- For both syn/async, dequantizing is not ideal. And so we now have .cpu() and .to() implemented for QuantizedTensor which dont go through dequantization and rather just copy inner tensors of QuantizedTensor to cpu if needed in blocking/non-blocking way.
NVFP4 Allgather Correctness issues
- Allgather with FSDP2 was very far away from fp32 allgather for the same values. This was due to us not setting the amax reduction group in the quantizer.
TE_DType Serialization issues with DCP Checkpointing
- DCP uses torch.load(weights_only=True), whose Unpickler rejects every GLOBAL reference that isn't in add_safe_globals — and getattr is intentionally not allow-listed.
- So we override the default enum reduction in pybind:

default:      (getattr, (tex.DType, "kFloat8E4M3"))   # needs getattr + tex.DType allow-listed
pybind override: (tex.DType, (int_value,))            # only needs tex.DType allow-listed

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-11T15:58:35Z

Greptile Summary

This PR fixes several DCP checkpoint loading issues for FSDP2 with quantized model initialization, covering MXFP8, NVFP4, and Float8 tensor types. It also corrects NVFP4 allgather numerical errors under FSDP2 by properly setting the amax_reduction_group on the NVFP4Quantizer.

QuantizedTensor.untyped_storage() now returns a zero-byte empty storage so FSDP2's same-tensor identity check never incorrectly matches a plain BF16 staging tensor against a quantized parameter.
.cpu() / _to_copy dispatch is reimplemented to move all inner buffers (data, scales, amax) to the target device while preserving the QuantizedTensor subclass, eliminating the dequantize-on-move-to-CPU path used by DCP async staging.
__reduce_ex__ refactored for all tensor types to reference module-level reconstructor functions instead of bound classmethods, removing the need for getattr in add_safe_globals and enabling torch.load(weights_only=True) compatibility; tex.DType pickling is fixed in the pybind11 binding via custom __reduce__/__reduce_ex__ methods.

Confidence Score: 4/5

The changes are targeted and well-reasoned; the DCP async/sync checkpoint flows for all quantization recipes should now work correctly with weights_only=True.

The backward-compat classmethods (_make_in_reduce_ex) are documented as supporting re-loading of old pickle streams, but they silently break under weights_only=True because getattr is not in add_safe_globals. Any operator who saved checkpoints with a previous build and tries to load them through DCP will hit an opaque WeightsUnpickler error. This is a narrow edge case for pre-existing checkpoints, not the new format, but the misleading comment could lead to wasted debugging time.

transformer_engine/pytorch/tensor/float8_tensor.py, mxfp8_tensor.py, nvfp4_tensor.py, float8_blockwise_tensor.py — the backward-compat _make_in_reduce_ex docstrings should be updated to reflect the weights_only=True limitation.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/quantized_tensor.py	Core changes: empty untyped_storage, new _to_copy dispatch handler for device moves, and explicit device propagation in make_from_tensor. The _to_copy handler silently falls through (returns None) when called with a dtype change, which is not a regression but is a potential footgun.
transformer_engine/pytorch/init.py	Registers all required reconstructor functions, storage types, quantizer types, and tex.DType with add_safe_globals for weights_only=True compatibility; getattr correctly omitted from the list.
transformer_engine/pytorch/tensor/float8_tensor.py	New reduce_ex uses module-level _make_float8_tensor_in_reduce_ex (no quantizer saved, by design); old classmethod kept for backward compat but only works without weights_only=True since getattr is not safe-listed.
transformer_engine/pytorch/tensor/mxfp8_tensor.py	reduce_ex updated to module-level function; _make_in_reduce_ex classmethod kept for backward compat; device kwarg added throughout dispatch handlers; dequantize-on-CPU path added in mxfp8_tensor_storage.py bouncing through CUDA.
transformer_engine/pytorch/tensor/nvfp4_tensor.py	reduce_ex updated to module-level function; NVFP4Quantizer.getstate correctly strips amax_reduction_group; amax_reduction_group is now set for FSDP2 sharding in base.py fixing allgather numerical errors.
transformer_engine/common/util/pybind_helper.h	Adds reduce/reduce_ex to the DType pybind11 enum, serializing as (tex.DType, (int_value,)) to eliminate the getattr opcode from the pickle stream.
transformer_engine/pytorch/module/base.py	NVFP4Quantizer added to the isinstance check that sets amax_reduction_group for FSDP2 DTensor params, fixing allgather numerical errors for NVFP4.
transformer_engine/pytorch/tensor/storage/float8_tensor_storage.py	device field removed from get_metadata(); callers updated to pass device explicitly. The change is correct but needs vigilance that all callers have been updated.
tests/pytorch/test_quantized_tensor.py	New test_cpu_dequantize test validates that moving a QuantizedTensor to CPU and dequantizing gives bit-exact results vs CUDA dequantize followed by CPU move; zero tolerance is appropriate since CPU path bounces through CUDA for MXFP8/NVFP4.

Sequence Diagram

sequenceDiagram
    participant FSDP2
    participant QT as QuantizedTensor
    participant F8 as Float8Tensor
    participant DCP

    Note over FSDP2,DCP: DCP Async Save Flow
    FSDP2->>QT: "aten._to_copy(device=cpu)"
    QT->>QT: __torch_dispatch__(_to_copy)
    Note right of QT: dtype unchanged, inner branch taken
    QT->>QT: get_metadata() move tensors to CPU
    QT-->>FSDP2: CPU QuantizedTensor subclass preserved

    FSDP2->>DCP: stage CPU tensor to disk
    DCP->>QT: __reduce_ex__(protocol)
    QT-->>DCP: _make_in_reduce_ex with inner_buffers
    Note right of DCP: quantizer.__getstate__ strips amax_reduction_group
    DCP->>DCP: "torch.save weights_only=True compatible"

    Note over FSDP2,DCP: DCP Sync Load Flow
    DCP->>DCP: "torch.load weights_only=True"
    DCP->>QT: _make_star_in_reduce_ex buffers
    Note right of DCP: module-level fn single GLOBAL opcode
    DCP-->>FSDP2: reconstructed CPU QuantizedTensor

    FSDP2->>F8: copy_ loaded_cpu_tensor
    F8->>F8: direct FP8 buffer copy dtype match
    F8-->>FSDP2: model param updated

    Note over FSDP2: NVFP4 allgather fix
    FSDP2->>QT: fsdp_pre_all_gather
    QT->>QT: "NVFP4Quantizer.amax_reduction_group = shard_group"
    Note right of QT: PR fix was previously missing for NVFP4

_{Reviews (11): Last reviewed commit: "Merge branch 'main' into fsdp2_dcp_laod_..." | Re-trigger Greptile}

greptile-apps · 2026-05-11T15:58:39Z

+    def untyped_storage(self) -> torch.UntypedStorage:
+        """Return an empty UntypedStorage on the tensor's device.
+
+        ``QuantizedTensor`` is a ``_make_wrapper_subclass`` and has no real
+        backing storage of its own; the actual bytes live in the inner
+        buffers (e.g. ``_rowwise_data`` / ``_columnwise_data``) which are
+        an implementation detail of the quantization scheme. Need to define
+        this method to avoid DCP staging errors with FSDP2.
+        """
+        return torch.UntypedStorage(0, device=self.device)


Empty storage breaks shared-storage detection in existing callers

QuantizedTensor.untyped_storage() now returns a freshly allocated zero-byte storage every call. Code in module/_common.py:128 compares tensors[0].untyped_storage().nbytes() against expected size to decide between a no-op view and an out-of-place torch.cat. With 0 bytes returned, that condition is always true, silently disabling the in-place fast path for any QuantizedTensor through ConcatMerge.forward. More critically, utils.py:403-412 in SplitAlongDim.backward uses data_ptr() for noop detection — if all zero-size CUDA allocations return data_ptr() == 0, every QuantizedTensor pair incorrectly appears co-located, setting noop_ok = True and crashing on ret.set_() against a 0-byte storage.

The correct behavior for these functions is to fall back to the slow path for QuantizedTensor s, unless it has a dedicated implementation to handle quantized data.

Yeah, while I don't think we use QuantizedTensors in the SplitAlongDim ever, the concat sounds plausible to be hit.

Need to resolve this comment after going thoroughly over noop_cat consequences on Quantizedtensors

The behavior is unchanged with the change. And I would argue the implementation now is more correct with the change. untyped_storage() default implementation from QuantizedTensor(torch.Tensor) before this change, gives a storage with two properties.

storage.nbytes() returns bytes based on the fake_dtype that we use to register our QuantizedTensor as a torchTensor using make_wrapper_subclass method of torch.

storage.data_ptr() gives an error saying it is an invalid storage and there is no data_ptr()

Both of them is not ideal.
The first one is grossly incrorrect due to two reasons. First we manage the backing storage for the inner tensors of QuantizedTensor and torch has no idea about it. Second nbytes based on fake_dtype is misleading since that might not actually be the number of bytes we actually allocate.
Second one is causing problems with FSDP2 now since it expects some storage for identity check.

For QuantizedTensor, noop_cat today always returns an actual torch.cat which goes through a dequantization luckily due to this condition being true. This condition is going to be true now with the change as well since nbytes() would return 0.

If we do QuantizedTensor.data_ptr() today it gives you 0. QuantizedTensor.untyped_storage().data_ptr() will give invalid storage error which is inconsistent. And giving empty storage as empty storage will fix this inconsitency.

As far as idenity checking goes, FSDP2 does all the comparisong logic only if data_ptr() is not 0. And it also doesnt really make sense to compare two empty storages.

vthumbe1503 · 2026-05-11T16:28:20Z

/te-ci L1 pytorch

ptrendx · 2026-05-11T18:20:14Z

                msg=lambda x: f"Fresh model loaded from DCP checkpoint produces different output: {x}",
            )
+        elif recipe_name == "NVFP4BlockScaling":
+            # NVFP4 DCP load goes through a dequant + quant, so neec to relax tolerances


Why do we need dequant + quant here?

We are doing it anymore

timmoon10 · 2026-05-11T18:14:15Z

+                # torch DCP staging via ``x.new_empty(..., device="cpu")``), we
+                # save the high-precision values in a plain CPU dense tensor.
+                # For the DCP load path, we will re-quantize the high-precision values.
+                target_size = torch.Size(size) if len(size) > 0 else tensor.size()


An empty size is valid and it corresponds to a tensor with 1 entry (for the same reason 2^0=1).

>>> import torch >>> x = torch.ones(123).new_empty([]) >>> print(x.numel()) 1

Suggested change

target_size = torch.Size(size) if len(size) > 0 else tensor.size()

target_size = size

Changed the torch dispatch function now. So we dont have size here

timmoon10 · 2026-05-11T18:17:00Z

+    def untyped_storage(self) -> torch.UntypedStorage:
+        """Return an empty UntypedStorage on the tensor's device.
+
+        ``QuantizedTensor`` is a ``_make_wrapper_subclass`` and has no real
+        backing storage of its own; the actual bytes live in the inner
+        buffers (e.g. ``_rowwise_data`` / ``_columnwise_data``) which are
+        an implementation detail of the quantization scheme. Need to define
+        this method to avoid DCP staging errors with FSDP2.
+        """
+        return torch.UntypedStorage(0, device=self.device)


The correct behavior for these functions is to fall back to the slow path for QuantizedTensor s, unless it has a dedicated implementation to handle quantized data.