[PyTorch] Support for cuDNN-backed flex attention by vcherepanov-nv · Pull Request #2984 · NVIDIA/TransformerEngine

vcherepanov-nv · 2026-05-13T03:18:15Z

Description

This PR introduces an alternative, Python-only code path for the FusedAttention backend for PyTorch.
The user can specify score_mod and score_mod_bprop functions, which get routed to the corresponding parameters of the sdpa and sdpa_backward calls to cuDNN FE.

Fixes # (issue)

#2492

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

A new code path for FusedAttention backend, when score_mod (and the related parameters) is specified
Tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-13T03:26:45Z

Greptile Summary

This PR introduces a Python-only cuDNN frontend code path for FusedAttention that accepts user-supplied score_mod and score_mod_bprop callbacks, routed directly to cuDNN FE's sdpa/sdpa_backward APIs. It includes a graph cache with careful handling of bound methods, lambdas, and stateful callable instances, plus a full test suite covering cache-key semantics and numeric correctness.

flex_attention.py (new): implements graph construction, caching (_cudnn_score_mod_graph_cache), and FusedAttentionWithScoreModFunc autograd function; addresses the id()-based cache-key and lambda collision issues noted in earlier review rounds.
backends.py / dot_product_attention.py: plumbs the four new parameters through FusedAttention.forward and DotProductAttention.forward with guards for all incompatible configurations (FP8, context-parallel, KV cache, THD format, etc.).
utils.py: extends AttentionParams and get_attention_backend to filter on has_score_mod, gating the path to the F16/BF16 arbitrary-seqlen cuDNN backend.

Confidence Score: 4/5

The implementation is largely correct, but two of the four new integration tests will fail at runtime on any CUDA machine because score_mod tensors are created on CPU and passed into a cuDNN CUDA graph execution.

The causal and softcap test cases in test_flex_attention.py pass CPU tensors (torch.full without a device argument) as score_mod_tensors to the cuDNN graph execution path. cuDNN CUDA kernels require all variant-pack tensors to reside on the compute device; the CPU tensors will trigger a device-mismatch error, causing those test cases to fail on any real CUDA runner. The core implementation in flex_attention.py is sound.

tests/pytorch/attention/test_flex_attention.py — the causal and softcap score_mod tensor construction.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/flex_attention.py	New file implementing cuDNN-backed flex attention with score_mod callbacks, graph caching, and autograd Function. Cache-key design and bound-method handling are solid; score_mod_tensors with requires_grad silently drop gradients.
tests/pytorch/attention/test_flex_attention.py	New test file for score_mod attention. CPU tensors used as score_mod_tensors in causal and softcap cases will cause device-mismatch errors at cuDNN graph execution time.
transformer_engine/pytorch/attention/dot_product_attention/backends.py	Adds score_mod parameters to FusedAttention.forward and routes them to FusedAttentionWithScoreModFunc; guards against incompatible configurations are present and correct.
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py	Plumbs score_mod parameters from DotProductAttention.forward through to FusedAttention; input validation is thorough but omits a CUDA device check on score_mod_tensors values.
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Adds has_score_mod/has_score_mod_bprop fields to AttentionParams and corresponding backend-selection filters; logic correctly gates score_mod on the F16_arbitrary_seqlen backend.
tests/pytorch/utils.py	Adds score_mod and score_mod_bprop boolean parameters to get_available_attention_backends; straightforward forwarding change, no issues.

Sequence Diagram

sequenceDiagram
    participant User
    participant DPA as DotProductAttention
    participant FA as FusedAttention
    participant Func as FusedAttentionWithScoreModFunc
    participant Cache as _cudnn_score_mod_graph_cache
    participant cuDNN as cuDNN Frontend

    User->>DPA: forward(q,k,v, score_mod, score_mod_tensors, ...)
    DPA->>DPA: validate score_mod inputs
    DPA->>DPA: get_attention_backend (has_score_mod filter)
    DPA->>FA: forward(score_mod, score_mod_bprop, ...)
    FA->>Func: apply(is_training, q, k, v, score_mod, ...)

    Func->>Cache: _get_cudnn_score_mod_fwd_graph(key)
    alt cache miss
        Cache->>cuDNN: _build_cudnn_score_mod_fwd_graph()
        cuDNN-->>Cache: CudnnScoreModFwdGraphEntry
        Cache->>Cache: store entry
    end
    Cache-->>Func: entry (graph + tensor handles)

    Func->>cuDNN: _execute_cudnn_graph(variant_pack)
    cuDNN-->>Func: output tensor
    Func-->>DPA: output

    User->>Func: backward(d_out)
    Func->>Cache: _get_cudnn_score_mod_bwd_graph(key)
    alt cache miss
        Cache->>cuDNN: _build_cudnn_score_mod_bwd_graph()
        cuDNN-->>Cache: CudnnScoreModBwdGraphEntry
    end
    Func->>cuDNN: _execute_cudnn_graph(variant_pack)
    cuDNN-->>Func: dq, dk, dv
    Func-->>User: dq, dk, dv

_{Reviews (4): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-05-13T03:26:48Z

+def _score_mod_callback_cache_key(callback: Optional[Callable]) -> Optional[Tuple[Any, ...]]:
+    """Create a stable cache key for a score_mod callable."""
+    if callback is None:
+        return None
+    self_obj = getattr(callback, "__self__", None)
+    func_obj = getattr(callback, "__func__", None)
+    if self_obj is not None and func_obj is not None:
+        return ("bound_method", id(self_obj), id(func_obj))
+    return ("callable", id(callback))


id()-based cache key is unsafe for parameterized bound-method score_mods

id(self_obj) identifies a Python object by its memory address. When a bound-method instance is garbage-collected, Python may immediately reuse that memory for a new instance. If the new instance belongs to the same class (same id(func_obj)), the cache key is identical, so _get_cudnn_score_mod_fwd_graph returns the old compiled graph even though the new instance might construct a structurally different computation — e.g., a score_mod class whose forward loops self.n_layers times. The wrong graph is executed without any error, silently producing incorrect attention outputs.

For stateless module-level functions this is fine (they're never GC'd), but any stateful class-based score_mod where different instances produce different graph topologies can hit this bug in long-running programs. Consider using type(self_obj) and a per-class sequence counter, or requiring callers to provide an explicit cache key.

greptile-apps · 2026-05-13T03:26:49Z

 _flash_attn_varlen_fwd = None
 _flash_attn_varlen_bwd = None


Unbounded module-level graph cache will grow indefinitely

_cudnn_score_mod_graph_cache is a plain dict with no eviction policy. Cache keys encode tensor shapes, strides, dtype, and device, so every new (batch, seq, heads, dim) combination — extremely common in training with variable-length sequences or multi-task workloads — inserts a permanent entry. Each cached cuDNN graph holds compiled CUDA kernels and associated state, which can be several tens of MB. Over a long training run this will silently consume increasing GPU/CPU memory. Consider a bounded LRU cache (e.g., functools.lru_cache or a collections.OrderedDict with a size cap).

greptile-apps · 2026-05-13T03:26:51Z

+                fused_attention_backend = tex.get_fused_attn_backend(
+                    self.training,
+                    q_type,
+                    q_type,
+                    dpa_utils.QKVLayout["bshd_bshd_bshd"],
+                    dpa_utils.AttnBiasType["no_bias"],
+                    dpa_utils.AttnMaskType["no_mask"],
+                    dpa_utils.SoftmaxType["vanilla"],


get_fused_attn_backend availability check always uses bshd_bshd_bshd regardless of actual format

The score_mod path hard-codes dpa_utils.QKVLayout["bshd_bshd_bshd"] for the backend probe, even when the user passes qkv_format="sbhd". The result is only used to gate on NVTE_No_Backend, so in practice it likely works today because backend availability for a given dtype is layout-independent. However, if a future cuDNN version makes SBHD/BSHD support diverge, this probe would give a false-positive (accepts sbhd even though no backend supports it) or false-negative (rejects sbhd when it is actually supported). Using the real layout for the probe would make the check self-documenting and future-proof.

cyanguwa · 2026-05-14T22:26:18Z

                        )

-        if context_parallel:
+        if score_mod is not None:


I think this should be in the else branch, because it doesn't support context parallelism. Something like this:
if context_parallel: elif score_mod is not None: else:

cyanguwa · 2026-05-14T22:32:38Z

+        score_mod: Optional[Callable] = None,
+        score_mod_bprop: Optional[Callable] = None,
+        score_mod_tensors: Optional[Dict[str, torch.Tensor]] = None,
+        score_mod_bprop_tensors: Optional[Dict[str, torch.Tensor]] = None,


Do you think it'd be clearer if we add "fprop" to the names? i.e. score_mod_fprop, score_mod_bprop, score_mod_fprop_tensors, score_mod_bprop_tensors?

cyanguwa · 2026-05-14T22:48:21Z

+                        isinstance(k, str) and isinstance(v, torch.Tensor)
+                        for k, v in score_mod_bprop_tensors.items()
+                    ), "score_mod_bprop_tensors must map string names to torch.Tensor instances!"
+


I think all these checks can go into dpa_utils.get_attention_backend(), and with score_mod_xxx args passed in (to AttentionParams), that utility function can return use_fused_attention=False if one of the checks if violated. dpa_utils.get_attention_backend() is used in the tests as well (by get_available_attention_backends()).

cyanguwa · 2026-05-14T22:50:27Z

+                    raise ValueError(
+                        "score_mod requires a cuDNN FusedAttention backend, but no fused "
+                        "attention backend supports the provided inputs."
+                    )


For the score_mod path, I don't think we need to call tex.get_fused_attn_backend() and check if it's supported or not. If anything, we should add graph.validate() -> .... graph.build_plans() to dpa_utils.get_attention_backend(attention_params), but if that's too heavy-handed, we can only do the checks you had above (the asserts). Once those checks were added to dpa_utils.get_attention_backend, whether FusedAttention backend is run or not will be controlled by the following logic (just like with non-score_mod cases):

( use_flash_attention, flash_attention_backend, use_fused_attention, fused_attention_backend, use_unfused_attention, _, ) = dpa_utils.get_attention_backend(attention_params)

cyanguwa · 2026-05-14T22:53:52Z

                else:
                    pad_between_seqs = False

+            if score_mod is None:


Please label this "experimental".

Agreed.
Also please add a comment on top (similar to other sections in forward().
Something like : checking compatibility for flex attn

cyanguwa · 2026-05-14T22:59:38Z

+
+def _build_cudnn_pygraph(dtype: torch.dtype, device: torch.device):
+    """Create a cuDNN frontend Python graph for F16/BF16 SDPA."""
+    import cudnn  # pylint: disable=import-outside-toplevel


Can you import the cudnn from 3rdparty/cudnn-frontend, instead of from the environment/system-wide installation? We have control over the version in 3rdparty/cudnn-frontend, but not the system one.

cyanguwa · 2026-05-14T23:14:41Z

+@pytest.mark.parametrize("dtype", param_types)
+@pytest.mark.parametrize("qkv_format", ["sbhd", "bshd"])
+@pytest.mark.parametrize("scalar_loss", [False, True])
+def test_dot_product_attention_score_mod(dtype, qkv_format, scalar_loss):


Would @pytest.mark.parameterize("score_mod", ["causal", "softcap", "post_scale_bias"]) simplify the tests a bit, so that we don't have 3 separate tests with a lot of repeated code?

cyanguwa · 2026-05-14T23:26:54Z

+    score_mod: Callable,
+    score_mod_tensors: Optional[Dict[str, torch.Tensor]],
+    output_layer: torch.Tensor,
+    stats_bhs1: Optional[torch.Tensor],


I think we can just call this stats, even though it might only support bhs1 shape right now. On the C++ side, cuDNN does support th1 (for THD format) as well. Could we leave the name generic for now in case we want to add more support to it in the future?

cyanguwa · 2026-05-14T23:29:50Z

        return output.contiguous()


+def _bhsd_dim_stride(


We have a lot of small utility functions here - is there a way to pack them up a bit or group them in some way, so the code is easier to read? I know this is Python and we probably do need more than 2 functions (fwd+bwd) but could you please have a look into this? Thanks.

I agree with this and was my first thought too.
We should club these function into a couple classes that can sit in this file at the very least.

However, I think this approach is still not the right approach. We should have a separate flex_attention.py file similar to context_parallel.py and backends.py can import it similar to how it imports the CP functions right now.
I strongly recommend this for two reasons :

When we refactored attention as a whole early last year, the idea was to modularize attention. That was the reason CP was moved out of attention. With Flex attention's functionality and code in here being fairly decoupled from vanilla DPA, it should be easier to move it out. Leaving this code in here would add ~1000 lines of code that is not related to the vanilla DPA and would practically be undoing the refactoring work we did early last year. The same reason for moving CP to it's own file should also apply to Flex attention.

A developer/user of TE PyT DPA should not have to worry about the details of flex attention. Similarly someones modifying flex should not be bogged down by the details of vanilla fused attn. Hence, decoupling is important to aid with debugging as well

cyanguwa · 2026-05-14T23:30:29Z

+    )
+
+
+def _score_mod_relative_position(score_mod_graph, score_tensor, _tensors):


We can just call this "post_scale_bias" to be consistent with our nomenclature elsewhere.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-15T01:30:43Z

+    if (
+        inspect.isfunction(callback)
+        and callback.__closure__ is None
+        and "<locals>" not in callback.__qualname__
+    ):
+        return ("function", callback.__module__, callback.__qualname__)


Module-level lambdas all share the same __qualname__ = "<lambda>", so two different lambdas defined at module scope in the same file (e.g., sm1 = lambda g, s, t: s and sm2 = lambda g, s, t: g.neg(input=s)) would produce the identical cache key ("function", module, "<lambda>"). The second lambda would silently reuse the compiled graph from the first, computing wrong attention scores with no error. Named module-level functions are safe because their qualnames are unique, but lambdas are not. Excluding <lambda> from the cacheable path makes them _SCORE_MOD_UNCACHEABLE, which builds a fresh graph every call — the same safe fallback already used for closures and nested functions.

Suggested change

if (

inspect.isfunction(callback)

and callback.__closure__ is None

and "<locals>" not in callback.__qualname__

):

return ("function", callback.__module__, callback.__qualname__)

if (

inspect.isfunction(callback)

and callback.__closure__ is None

and "<locals>" not in callback.__qualname__

and "<lambda>" not in callback.__qualname__

):

return ("function", callback.__module__, callback.__qualname__)

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps · 2026-05-15T01:44:11Z

+        score_mod_kwargs = {
+            "score_mod": _score_mod_causal,
+            "score_mod_bprop": _score_mod_causal_bprop,
+            "score_mod_tensors": {"neg_inf": torch.full((1, 1, 1, 1), -1e9)},
+            "score_mod_bprop_tensors": {"zero": torch.full((1, 1, 1, 1), 0.0)},
+        }


The neg_inf and zero tensors are created on CPU (torch.full defaults to CPU), but the attention computation runs on CUDA. When cuDNN executes the graph it calls into CUDA kernels and expects all variant-pack tensors to reside on the compute device. Passing CPU tensors here will produce a device-mismatch error at graph execution time, causing both the "causal" test cases to fail.

Suggested change

score_mod_kwargs = {

"score_mod": _score_mod_causal,

"score_mod_bprop": _score_mod_causal_bprop,

"score_mod_tensors": {"neg_inf": torch.full((1, 1, 1, 1), -1e9)},

"score_mod_bprop_tensors": {"zero": torch.full((1, 1, 1, 1), 0.0)},

}

score_mod_kwargs = {

"score_mod": _score_mod_causal,

"score_mod_bprop": _score_mod_causal_bprop,

"score_mod_tensors": {"neg_inf": torch.full((1, 1, 1, 1), -1e9, device="cuda")},

"score_mod_bprop_tensors": {"zero": torch.full((1, 1, 1, 1), 0.0, device="cuda")},

}

KshitijLakhani · 2026-05-19T16:41:05Z

+    score_mod : bool, default = False
+        Whether a score_mod callback was provided.
+    score_mod_bprop : bool, default = False
+        Whether a score_mod bprop callback was provided.


nit: If this is a bool, to match has_attention_mask, consider has_score_mod and has_score_mod_bprop instead ?

KshitijLakhani · 2026-05-19T16:48:43Z

            logger.debug("Disabling all backends for max_logit with FP8 attention")

+    # Filter: score_mod
+    if score_mod_bprop and not score_mod:


What happens (is expected to happen) if score_mod_bprop=False and score_mod=True ?

It's a perfectly legal case, if, for instance, score_mod is used only for masking.

KshitijLakhani · 2026-05-19T16:53:52Z

+        use_flash_attention = False
+        use_flash_attention_2 = False
+        use_flash_attention_3 = False
+        use_flash_attention_4 = False
+        use_fused_attention = False
+        use_unfused_attention = False


nit: Outside the scope of this PR but would be good to do in this or subsequent PR: having a function or something similar for when performing an action/query on all flash_attention vars

KshitijLakhani · 2026-05-19T16:59:43Z

+        if use_flash_attention_2 or use_flash_attention_3 or use_flash_attention_4:
+            logger.debug("Disabling FlashAttention for score_mod")
+        use_flash_attention = False
+        use_flash_attention_2 = False
+        use_flash_attention_3 = False
+        use_flash_attention_4 = False


Consider this maybe ?:

Suggested change

if use_flash_attention_2 or use_flash_attention_3 or use_flash_attention_4:

logger.debug("Disabling FlashAttention for score_mod")

use_flash_attention = False

use_flash_attention_2 = False

use_flash_attention_3 = False

use_flash_attention_4 = False

if use_flash_attention_2 or use_flash_attention_3 or use_flash_attention_4:

logger.debug("Disabling FlashAttention for score_mod")

use_flash_attention = False

use_flash_attention_2 = False

use_flash_attention_3 = False

use_flash_attention_4 = False

unless there's a good reason to do otherwise ?

KshitijLakhani · 2026-05-19T17:00:14Z

+        if use_unfused_attention:
+            logger.debug("Disabling UnfusedDotProductAttention for score_mod")
+        use_unfused_attention = False


Consider this maybe ?

Suggested change

if use_unfused_attention:

logger.debug("Disabling UnfusedDotProductAttention for score_mod")

use_unfused_attention = False

if use_unfused_attention:

logger.debug("Disabling UnfusedDotProductAttention for score_mod")

use_unfused_attention = False

unless there's a good reason to do otherwise ?

KshitijLakhani · 2026-05-19T18:23:26Z

+            if score_mod is None:
+                assert score_mod_bprop is None, "score_mod_bprop requires score_mod!"
+                assert score_mod_tensors is None, "score_mod_tensors requires score_mod!"
+                assert (
+                    score_mod_bprop_tensors is None
+                ), "score_mod_bprop_tensors requires score_mod!"
+            else:
+                assert callable(score_mod), "score_mod must be callable!"
+                assert score_mod_bprop is None or callable(
+                    score_mod_bprop
+                ), "score_mod_bprop must be callable when provided!"
+                assert query_layer.dtype in [
+                    torch.float16,
+                    torch.bfloat16,
+                ], "score_mod only supports FP16 and BF16 tensors!"
+                assert (
+                    key_layer.dtype == query_layer.dtype and value_layer.dtype == query_layer.dtype
+                ), "score_mod requires Q, K and V tensors to have the same dtype!"
+                assert (
+                    type(query_layer) is torch.Tensor
+                    and type(key_layer) is torch.Tensor
+                    and type(value_layer) is torch.Tensor
+                ), "score_mod only supports unquantized torch.Tensor Q, K and V inputs!"
+                assert not self.fp8, "score_mod is not supported with FP8 DotProductAttention!"
+                assert not fp8_output, "score_mod is not supported with fp8_output!"
+                assert not context_parallel, "score_mod is not supported with context parallelism!"
+                assert qkv_format != "thd", "score_mod is not supported with qkv_format='thd'!"
+                assert (
+                    not user_supplied_seqlens
+                ), "score_mod is mutually exclusive with explicit sequence length metadata!"
+                assert not pad_between_seqs, "score_mod is not supported with pad_between_seqs!"
+                assert (
+                    attention_mask is None
+                ), "score_mod is mutually exclusive with attention_mask!"
+                assert attn_mask_type == "no_mask", "score_mod requires attn_mask_type='no_mask'!"
+                assert window_size is None or window_size == (
+                    -1,
+                    -1,
+                ), "score_mod is mutually exclusive with sliding window attention!"
+                assert (
+                    core_attention_bias_type == "no_bias" and core_attention_bias is None
+                ), "score_mod is mutually exclusive with attention bias!"
+                assert alibi_slopes is None, "score_mod is mutually exclusive with ALiBi!"
+                assert (
+                    self.softmax_type == "vanilla"
+                ), "score_mod is mutually exclusive with sink attention!"
+                assert (
+                    self.attention_dropout == 0.0
+                ), "score_mod is not supported with attention dropout!"
+                assert (
+                    not self.return_max_logit
+                ), "score_mod is not supported with return_max_logit!"
+                assert (
+                    not checkpoint_core_attention
+                ), "score_mod is not supported with checkpoint_core_attention!"
+                assert (
+                    not is_graph_capturing()
+                ), "score_mod is not supported with CUDA graph capture!"
+                assert num_splits == 1, "score_mod is not supported with num_splits != 1!"
+                assert q_format in ["sbhd", "bshd"] and kv_format in [
+                    "sbhd",
+                    "bshd",
+                ], "score_mod only supports SBHD/BSHD QKV formats!"
+                if score_mod_tensors is not None:
+                    assert isinstance(score_mod_tensors, dict), "score_mod_tensors must be a dict!"
+                    assert all(
+                        isinstance(k, str) and isinstance(v, torch.Tensor)
+                        for k, v in score_mod_tensors.items()
+                    ), "score_mod_tensors must map string names to torch.Tensor instances!"
+                if score_mod_bprop_tensors is not None:


Aren't these checks duplicates of the checks in get_attention_backend() ?
get_attention_backend() alone can be the source of truth and this duplications should not be needed here. The only checks that should be added in here are those that cannot be / should not be added in the get_attention_backend()
I'd suggest to change this to accommodate for the same

KshitijLakhani · 2026-05-19T18:50:28Z

            )
            global _attention_backends
-            if is_in_onnx_export_mode():
+            if is_in_onnx_export_mode() and score_mod is None:


Is this necessary here if dpa_utils.get_attention_backend(attention_params) does get called in the else block below ?
The flash, fused, unfused would be set in there anyways rgiht ?
Or am I missing something ?
cc: @cyanguwa

KshitijLakhani · 2026-05-19T19:17:53Z

        return output.contiguous()


+def _bhsd_dim_stride(


I agree with this and was my first thought too.
We should club these function into a couple classes that can sit in this file at the very least.

However, I think this approach is still not the right approach. We should have a separate flex_attention.py file similar to context_parallel.py and backends.py can import it similar to how it imports the CP functions right now.
I strongly recommend this for two reasons :

When we refactored attention as a whole early last year, the idea was to modularize attention. That was the reason CP was moved out of attention. With Flex attention's functionality and code in here being fairly decoupled from vanilla DPA, it should be easier to move it out. Leaving this code in here would add ~1000 lines of code that is not related to the vanilla DPA and would practically be undoing the refactoring work we did early last year. The same reason for moving CP to it's own file should also apply to Flex attention.

A developer/user of TE PyT DPA should not have to worry about the details of flex attention. Similarly someones modifying flex should not be bogged down by the details of vanilla fused attn. Hence, decoupling is important to aid with debugging as well

KshitijLakhani · 2026-05-19T19:35:58Z

+def _import_cudnn_frontend():
+    """Import the vendored cuDNN frontend if built, otherwise use the installed package."""
+    cudnn_frontend_path = str(_CUDNN_FRONTEND_PYTHON_PATH)
+    cudnn_frontend_package = _CUDNN_FRONTEND_PYTHON_PATH / "cudnn"
+    if (
+        any(cudnn_frontend_package.glob("_compiled_module*"))
+        and cudnn_frontend_path not in sys.path
+    ):
+        sys.path.insert(0, cudnn_frontend_path)
+    return importlib.import_module("cudnn")
+


How about this?:

def _import_cudnn_frontend(): cudnn_frontend_path = str(_CUDNN_FRONTEND_PYTHON_PATH) cudnn_frontend_package = _CUDNN_FRONTEND_PYTHON_PATH / "cudnn" if ( any(cudnn_frontend_package.glob("_compiled_module*")) and cudnn_frontend_path not in sys.path ): sys.path.insert(0, cudnn_frontend_path) return importlib.import_module("cudnn") # Fall back if importlib.util.find_spec("cudnn") is not None: return importlib.import_module("cudnn") # Fail with a message raise ImportError( "cuDNN Frontend Python package not found. " "Install it with: pip install nvidia-cudnn-frontend" )

KshitijLakhani · 2026-05-19T19:49:26Z

                return out, max_logit, (None, None, None, d_softmax_offset)


+def _score_mod_causal(score_mod_graph, score_tensor, tensors):


I would strongly recommend that similar to the CP tests we have a separate Flex attention test file. Firstly for modularization and secondly because the Flex attention tests do not really end up using the test_dot_product_attention() base test like other DPA tests in the file do so there's no code reuse reasons for it either.

These isolated ~800 lines of code can sit in it's own file if it isn't really using of the funtions in here directly but writing the flex tests as "new" tests or else the flex tests must reuse the DPA setup in here and integrate into that.

I've also shared more details on this in my comment in the backends.py file

cc: @cyanguwa

KshitijLakhani · 2026-05-19T19:50:45Z

Thanks for creating this PR @vcherepanov-nv
This is great !

I was curious about:

Do you have benchmark numbers bases on any toy test cases you might have run ? - would be good to have those in here for users of the API.
1. native PyT flex vs TE PyT flex
2. traditional causal TE via cuDNN vs flex expressed causal TE via cuDNN
I've linked the GH issue in the PR description. Could you please update / close it appropriately when this PR is merged
Thanks !

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

vcherepanov-nv · 2026-05-19T22:11:03Z

Thanks for the thorough review!

Do you have benchmark numbers bases on any toy test cases you might have run ? - would be good to have those in here for users of the API.

native PyT flex vs TE PyT flex

traditional causal TE via cuDNN vs flex expressed causal TE via cuDNN

I haven't done any benchmarking. Reportedly (from a Slack thread) score_mod can lead to significant perf gains if it allows to avoid mask materialization. For causal, I think I observed cuDNN choosing exactly the same kernel with score_mod and the explicit causal flag.

I've linked the GH issue in the PR description. Could you please update / close it appropriately when this PR is merged

Sure, thanks for linking!

vcherepanov-nv added 7 commits May 8, 2026 21:41

Add cuDNN score_mod attention path

11c3ed2

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Avoid BHSD copies in score_mod attention

eb35191

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Test relative position score_mod attention

57ce106

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Test softcap score_mod attention

e6ba0ea

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Run score_mod graphs on current CUDA stream

dcb6b49

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Add PyTorch score_mod execution plan cache

fefcbe7

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Fix score_mod cache edge cases

ac4c60d

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

vcherepanov-nv requested a review from cyanguwa as a code owner May 13, 2026 03:18

vcherepanov-nv added the 2.16.0 label May 13, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

6446825

for more information, see https://pre-commit.ci

vcherepanov-nv mentioned this pull request May 13, 2026

[Draft]Support for score_mod and score_mod_bprop in cuDNN's sdpa #2767

Closed

13 tasks

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

cyanguwa reviewed May 14, 2026

View reviewed changes

vcherepanov-nv and others added 3 commits May 15, 2026 00:48

Fix score_mod callback graph cache keys

58a5fb5

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Address score_mod review feedback

c00a0b7

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a8ed67e

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Fix score_mod lambda cache keys

e2a69e1

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

KshitijLakhani requested changes May 19, 2026

View reviewed changes

vcherepanov-nv and others added 2 commits May 19, 2026 21:17

Address flex attention review feedback

96f8ab2

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e11cc23

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread tests/pytorch/attention/test_flex_attention.py

Comment thread tests/pytorch/attention/test_flex_attention.py

		)


		def _score_mod_relative_position(score_mod_graph, score_tensor, _tensors):

		return out, max_logit, (None, None, None, d_softmax_offset)


		def _score_mod_causal(score_mod_graph, score_tensor, tensors):

Conversation

vcherepanov-nv commented May 13, 2026 • edited by KshitijLakhani Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KshitijLakhani commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

vcherepanov-nv commented May 13, 2026 •

edited by KshitijLakhani

Loading

greptile-apps Bot commented May 13, 2026 •

edited

Loading