Add examples on how to profile a pipeline by sayakpaul · Pull Request #13356 · huggingface/diffusers

sayakpaul · 2026-03-28T04:35:05Z

What does this PR do?

TL;DR: Adds a guide on how to profile a pipeline and fix issues like CPU overhead, CPU<->GPU syncs, etc.

Motivation

Since we provide first-class torch.compile support, it's important that our pipelines are set up for optimal success with it. This includes spotting any obvious issues that plague the torch.compile performance -- CPU overhead, CPU<->GPU syncs, graphbreaks, kernel launch delays, etc.

The best way to spot these bugs is to profile a pipeline, as it gives a granular measurement of where the GPU is spending time and if it is doing so in an expected manner. We can then uncover any unexpected issues and eventually fix them.

Workflow

The README.md added in the PR has all the descriptions, but in summary:

take a popular pipeline like Flux/Flux2/QwenImage/Wan/LTX2
run the profile with 2 inference steps
load the trace on Perfetto
spot the potential suspects
piggy that back to Claude along with the trace
- ask it to attempt a fix
- review the fix
- compare the results

With this Workflow, I was able to fix some issues in the Flux2 Klein pipeline and the Wan pipeline. All changes look quite harmless to me.

Plan

Not only is it helpful to profile pipelines to get a ceiling on performance, but the community could also help us improve our pipelines should this workflow prove to be useful.

Note to reviewers

Please review the changes in src/diffusers/*. And you can skip straight to the "Afterwards" section in the README.md document.

The tutorial is currently available here. Some inline comments.

HuggingFaceDocBuilderDev · 2026-03-28T04:47:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu

super educational, i enjoyed reading this a lot!

maybe rename "Approach" to something like "How the tooling works" because it describes how it works rather than what the user should do
it seems like "Afterwards" may be more effective as a blog post as it tells a story about issues 1 and 2 in the "What to look for" section
could be useful to add a link to this doc from our torch.compile docs

examples/profiling/README.md

stevhliu · 2026-03-30T15:18:13Z

examples/profiling/README.md

+
+To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it. The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume).
+
+### Quick checklist per pipeline


very helpful!

examples/profiling/README.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

examples/profiling/README.md

vkuzo · 2026-03-31T12:08:47Z

examples/profiling/README.md

+- See if `scheduler_step` is disproportionately expensive relative to `transformer_forward` (it should be negligible)
+- Spot unexpected CPU work between annotated regions
+
+**4. Eager vs compile comparison**


maybe we can mention cuda graphs here?

Should it be included under "Smaller CPU gaps"? If so, it's being mentioned a bit later since I wanted to keep the scope specific.

examples/profiling/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

jbschlosser

Nice work! Love to see this

jbschlosser · 2026-04-01T02:54:49Z

src/diffusers/pipelines/wan/pipeline_wan.py


+        # We set the index here to remove DtoH sync, helpful especially during compilation.
+        # Check out more details here: https://github.com/huggingface/diffusers/pull/11696
+        self.scheduler.set_begin_index(0)


looks good, still feel like we need to broaden the scope of that fix within diffusers at some point :) I'll be out on sabbatical for the next month but I can help when I get back

This would be awesome. Created #13375 to track. Thanks for offering to help.

jbschlosser · 2026-04-01T02:59:44Z

examples/profiling/README.md

+* Use of CUDA Graphs can also help mitigate CPU overhead related issues. When
+using "reduce-overhead" and "max-autotune" in `torch.compile` triggers the
+use of CUDA Graphs.


I'm glad you mentioned this here - I wonder if it's worth clarifying in the respective sections of this doc that CUDAGraph usage is the reason why we expect gap removal from using torch.compile

The results and the graph presented in this doc were obtained with the "default" compilation mode (along with regional compilation):

diffusers/examples/profiling/run_profiling.sh

Line 30 in 131831f

COMPILE_ARGS="--compile_regional --compile_fullgraph --compile_mode default"

So not sure? There are also being shipped in this PR that helped mitigate the stalling issues 👀

dg845 · 2026-04-02T01:43:03Z

examples/profiling/README.md

+
+Education materials to strategically profile pipelines to potentially improve their
+runtime with `torch.compile`. To set these pipelines up for success with `torch.compile`,
+we often have to get rid of DtoH syncs, CPU overheads, kernel launch delays, and


Suggested change

we often have to get rid of DtoH syncs, CPU overheads, kernel launch delays, and

we often have to get rid of device-to-host (DtoH) syncs, CPU overheads, kernel launch delays, and

I'm not sure if this terminology will be familiar to the target audience, I had to look it up the first time to find out what it meant.

dg845 · 2026-04-02T01:46:41Z

examples/profiling/README.md

+
+## Context
+
+We want to uncover CPU overhead, CPU-GPU sync points, and other bottlenecks in popular diffusers pipelines — especially issues that become non-trivial under `torch.compile`. The approach is inspired by [flux-fast's run_benchmark.py](https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/run_benchmark.py#L66-L85) which uses `torch.profiler` with method-level annotations, and motivated by issues like [diffusers#11696](https://github.com/huggingface/diffusers/pull/11696) (DtoH sync from scheduler `.item()` call).


Suggested change

We want to uncover CPU overhead, CPU-GPU sync points, and other bottlenecks in popular diffusers pipelines — especially issues that become non-trivial under `torch.compile`. The approach is inspired by [flux-fast's run_benchmark.py](https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/run_benchmark.py#L66-L85) which uses `torch.profiler` with method-level annotations, and motivated by issues like [diffusers#11696](https://github.com/huggingface/diffusers/pull/11696) (DtoH sync from scheduler `.item()` call).

We want to uncover CPU overhead, CPU-GPU sync points, and other bottlenecks in popular diffusers pipelines — especially issues that become non-trivial when using [`torch.compile`](https://docs.pytorch.org/docs/stable/generated/torch.compile.html). The approach is inspired by [flux-fast's run_benchmark.py](https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/run_benchmark.py#L66-L85) which uses [`torch.profiler`](https://docs.pytorch.org/docs/stable/profiler.html) with method-level annotations, and motivated by issues like [diffusers#11696](https://github.com/huggingface/diffusers/pull/11696) (DtoH sync from scheduler `.item()` call).

I think adding links to the torch.compile and torch.profiler docs could be useful for following along, especially if readers aren't familiar with them.

dg845 · 2026-04-02T01:47:38Z

examples/profiling/README.md

+
+## How the Tooling Works
+
+Follow the flux-fast pattern: **annotate key pipeline methods** with `torch.profiler.record_function` wrappers, then run the pipeline under `torch.profiler.profile` and export a Chrome trace.


Suggested change

Follow the flux-fast pattern: **annotate key pipeline methods** with `torch.profiler.record_function` wrappers, then run the pipeline under `torch.profiler.profile` and export a Chrome trace.

Follow the flux-fast pattern: **annotate key pipeline methods** with `torch.profiler.record_function` wrappers, then run the pipeline under `torch.profiler.profile` and export a Chrome JSON trace.

dg845 · 2026-04-02T01:49:47Z

examples/profiling/README.md

+
+## Verification
+
+1. Run: `python profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2`


Suggested change

1. Run: `python profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2`

1. Run: `python examples/profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2`

I think running the script from the diffusers root directory might be more common?

dg845 · 2026-04-02T01:53:39Z

examples/profiling/README.md

+
+Open the exported `.json` trace at [ui.perfetto.dev](https://ui.perfetto.dev/). The trace has two main rows: **CPU** (top) and **CUDA** (bottom). In Perfetto, the CPU row is typically labeled with the process/thread name (e.g., `python (PID)` or `MainThread`) and appears at the top. The CUDA row is labeled `GPU 0` (or similar) and appears below the CPU rows.
+
+**Navigation:** Use `W` to zoom in, `S` to zoom out, and `A`/`D` to pan left/right. You can also scroll to zoom and click-drag to pan. Use `Shift+scroll` to scroll vertically through rows.


I also found , and ., which select the previous/next track event, useful for e.g. finding the next transformer_forward event, or finding the next GPU kernel on the GPU kernel track.

dg845 · 2026-04-02T02:00:15Z

examples/profiling/README.md

+Open both traces side by side (two Perfetto tabs). Key differences to look for:
+- **Fewer, wider CUDA kernels** in compile mode (fused ops) vs many small kernels in eager
+- **Smaller CPU gaps** between kernels in compile mode (less Python dispatch overhead)
+- **CUDA kernel count per step**: to compare, zoom into a single `transformer_forward` span on the CUDA row and count the distinct kernel slices within it. In eager mode you'll typically see many narrow slices (one per op); in compile mode these fuse into fewer, wider slices. A quick way to estimate: select a time range covering one denoising step on the CUDA row — Perfetto shows the number of slices in the selection summary at the bottom. If compile mode shows a similar kernel count to eager, fusion isn't happening effectively (likely due to graph breaks).


When I tried comparing an eager and compile trace side-by-side, I found that it was difficult to find corresponding events because if --mode compile is used, there appear to be no transformer_forward events in the trace, but rather one large ## Call CompiledFxGraph... event. (I think if regional compilation is used via --compile_regional, the transformer_forward event does appear again, with the ## CompiledFxGraph... events under the transformer_forward events.)

dg845 · 2026-04-02T02:07:54Z

examples/profiling/README.md

+- There may be implicit syncs forcing serialization
+- `torch.compile` should help here by batching launches — compare eager vs compile to confirm
+
+To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it. The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume).


Suggested change

To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it. The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume).

To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it (there should be an arrow pointing from the CPU launch slice to the GPU kernel slice). The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume).

Not sure about the exact wording, but I think mentioning this is helpful, especially if there is a big temporal gap between the CPU cudaLaunchKernel and corresponding GPU kernel execution.

May also be worth mentioning that when the GPU kernel is selected, the corresponding cudaLaunchKernel event should be in the "Preceding Flows" section; I believe the "Delay" column then gives the exact launch latency.

dg845 · 2026-04-02T02:08:53Z

examples/profiling/README.md

+### Spotting gaps between launches
+
+Then a reasonable next step is to spot frequent gaps between kernel executions. In the compiled
+case, we don't spot any on the surface. But if we zone in, some become apparent.


Suggested change

case, we don't spot any on the surface. But if we zone in, some become apparent.

case, we don't spot any on the surface. But if we zoom in, some become apparent.

nit: typo

dg845 · 2026-04-02T02:09:25Z

examples/profiling/README.md

+</table>
+
+So, we provided the profile trace file (with compilation) to Claude, asked it to find the instances of
+"cudaStreamSynchronize" and "cudaDeviceSynchronize", and to come up with some potential fixes.


Suggested change

"cudaStreamSynchronize" and "cudaDeviceSynchronize", and to come up with some potential fixes.

`cudaStreamSynchronize` and `cudaDeviceSynchronize`, and to come up with some potential fixes.

nit: formatting

dg845 · 2026-04-02T02:18:40Z

examples/profiling/README.md

+  </tr>
+</table>
+
+### Spotting gaps between launches


I find this section somewhat unsatisfying because it glosses over the reasoning behind each step in the profiling process. I think it would be more instructive if the chain of reasoning that led us (and Claude) to

Find the tqdm and _unpack_latents_with_ids fixes

Find out that these fixes weren't sufficient, and why they weren't sufficient

Discover that cache_context was the bottleneck

was discussed at greater length in this section.

dg845 · 2026-04-02T02:26:07Z

examples/profiling/README.md

+```
+
+The changes looked reasonable based on our past experience. So, we asked Claude to apply these changes to [`pipeline_flux2_klein.py`](../../src/diffusers/pipelines/flux2/pipeline_flux2_klein.py). We then profiled
+the updated pipeline. It still didn't eliminate the gaps as expected so, we fed that back to Claude and


I think it would be nice to have a demonstration using the profiling outputs that the _unpack_latents_with_ids fix indeed eliminates a DtoH sync because the current wording

It still didn't eliminate the gaps as expected

makes it unclear if the change is effective.

dg845 · 2026-04-02T02:28:45Z

examples/profiling/README.md

+|------------------------|------------------------------|-----------------------------|
+| `_set_context` total   | 21.6ms (8 calls)             | 0.0ms (8 calls)             |
+| `cache_context` total  | 21.7ms                       | 0.1ms                       |
+| CPU gaps               | 5,523us / 8,007us / 5,508us  | 158us / 2,777us / 136us     |


I think adding another row with the total wall clock time (or another "overall performance" metric) before and after would be useful here because it's not obvious to me that reducing the CPU gaps here necessarily leads to better performance overall.

dg845 · 2026-04-02T02:29:40Z

examples/profiling/README.md

+
+![GPU idle](https://huggingface.co/datasets/sayakpaul/torch-profiling-trace-diffusers/resolve/main/Wan/Screenshot%202026-03-27%20at%205.56.39%E2%80%AFPM.png)
+
+The UniPC scheduler (used in Wan) creates small constant tensors via `torch.tensor([0.5], dtype=x.dtype, device=device)` during `step()`. This triggers a "cudaMemcpyAsync + cudaStreamSynchronize" to copy


Suggested change

The UniPC scheduler (used in Wan) creates small constant tensors via `torch.tensor([0.5], dtype=x.dtype, device=device)` during `step()`. This triggers a "cudaMemcpyAsync + cudaStreamSynchronize" to copy

The UniPC scheduler (used in Wan) creates small constant tensors via `torch.tensor([0.5], dtype=x.dtype, device=device)` during `step()`. This triggers a `cudaMemcpyAsync` + `cudaStreamSynchronize` to copy

nit: formatting

dg845 · 2026-04-02T02:32:18Z

examples/profiling/README.md

+* Use of CUDA Graphs can also help mitigate CPU overhead related issues. When
+using "reduce-overhead" and "max-autotune" in `torch.compile` triggers the
+use of CUDA Graphs.


Suggested change

* Use of CUDA Graphs can also help mitigate CPU overhead related issues. When

using "reduce-overhead" and "max-autotune" in `torch.compile` triggers the

use of CUDA Graphs.

* Use of CUDA Graphs can also help mitigate CPU overhead related issues. CUDA Graphs can be enabled by setting the `torch.compile` mode to `"reduce-overhead"` or `"max-autotune"`.

nit: I think the wording here is awkward

dg845 · 2026-04-02T02:35:27Z

src/diffusers/pipelines/flux2/pipeline_flux2_klein.py

        return latents

    @staticmethod
-    # Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._unpack_latents_with_ids


Since _unpack_latents_with_ids was originally # Copied from Flux2Pipeline, should the changes here also be propagated to that pipeline?

dg845 · 2026-04-02T02:45:36Z

src/diffusers/schedulers/scheduling_unipc_multistep.py

+        rks.append(torch.ones((), device=device))
+        rks = torch.stack(rks)


Should this change also be explained in the docs (examples/profiling/README.md)?

dg845

Thanks for the PR! Left some questions/suggestions :).

sayakpaul added 16 commits March 26, 2026 17:01

add a profiling worflow.

af96109

fix

eddef12

fix

e4d6293

more clarification

b2b6330

add points.

60d4148

up

179fa51

cache hooks

96506c8

improve readme.

6a23a77

propagate deletion.

bf5131f

up

bfbaf07

up

a410b49

wan fixes.

35437a8

more

142f417

up

9ba98a2

add more traces.

12ba8be

up

43e16fb

sayakpaul requested review from DN6, dg845 and stevhliu March 28, 2026 04:41

sayakpaul added 3 commits March 28, 2026 15:18

better title

e26d5c6

cuda graphs.

1131acd

up

c642cd0

stevhliu approved these changes Mar 30, 2026

View reviewed changes

sayakpaul and others added 5 commits March 31, 2026 09:46

Apply suggestions from code review

ed8241a

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

add torch.compile link.

3ae7d9b

approach -> How the tooling works

bfb19af

table

40a525e

Merge branch 'main' into profiling-workflow

3bdd529

sayakpaul requested a review from stevhliu March 31, 2026 04:24

vkuzo reviewed Mar 31, 2026

View reviewed changes

examples/profiling/README.md Show resolved Hide resolved

vkuzo reviewed Mar 31, 2026

View reviewed changes

examples/profiling/README.md Show resolved Hide resolved

vkuzo reviewed Mar 31, 2026

View reviewed changes

sayakpaul added 4 commits March 31, 2026 19:07

unavoidable gaps.

6cf1429

make important

fb6afa6

note on regional compilation

40c330a

Merge branch 'main' into profiling-workflow

3fc1a04

sayakpaul commented Mar 31, 2026

View reviewed changes

examples/profiling/README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

131831f

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

jbschlosser reviewed Apr 1, 2026

View reviewed changes

sayakpaul mentioned this pull request Apr 1, 2026

Investigate the current DtoH sync solution in pipelines #13375

Open

sayakpaul added the performance Anything related to performance improvements, profiling and benchmarking label Apr 1, 2026

dg845 reviewed Apr 2, 2026

View reviewed changes


		To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it. The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume).

		### Quick checklist per pipeline

	we often have to get rid of DtoH syncs, CPU overheads, kernel launch delays, and
	we often have to get rid of device-to-host (DtoH) syncs, CPU overheads, kernel launch delays, and


		## Context

		We want to uncover CPU overhead, CPU-GPU sync points, and other bottlenecks in popular diffusers pipelines — especially issues that become non-trivial under `torch.compile`. The approach is inspired by [flux-fast's run_benchmark.py](https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/run_benchmark.py#L66-L85) which uses `torch.profiler` with method-level annotations, and motivated by issues like [diffusers#11696](https://github.com/huggingface/diffusers/pull/11696) (DtoH sync from scheduler `.item()` call).


		## How the Tooling Works

		Follow the flux-fast pattern: annotate key pipeline methods with `torch.profiler.record_function` wrappers, then run the pipeline under `torch.profiler.profile` and export a Chrome trace.


		## Verification

		1. Run: `python profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2`

	1. Run: `python profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2`
	1. Run: `python examples/profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2`


		Open the exported `.json` trace at [ui.perfetto.dev](https://ui.perfetto.dev/). The trace has two main rows: CPU (top) and CUDA (bottom). In Perfetto, the CPU row is typically labeled with the process/thread name (e.g., `python (PID)` or `MainThread`) and appears at the top. The CUDA row is labeled `GPU 0` (or similar) and appears below the CPU rows.

		Navigation: Use `W` to zoom in, `S` to zoom out, and `A`/`D` to pan left/right. You can also scroll to zoom and click-drag to pan. Use `Shift+scroll` to scroll vertically through rows.

	case, we don't spot any on the surface. But if we zone in, some become apparent.
	case, we don't spot any on the surface. But if we zoom in, some become apparent.

	"cudaStreamSynchronize" and "cudaDeviceSynchronize", and to come up with some potential fixes.
	`cudaStreamSynchronize` and `cudaDeviceSynchronize`, and to come up with some potential fixes.


		![GPU idle](https://huggingface.co/datasets/sayakpaul/torch-profiling-trace-diffusers/resolve/main/Wan/Screenshot%202026-03-27%20at%205.56.39%E2%80%AFPM.png)

		The UniPC scheduler (used in Wan) creates small constant tensors via `torch.tensor([0.5], dtype=x.dtype, device=device)` during `step()`. This triggers a "cudaMemcpyAsync + cudaStreamSynchronize" to copy

		rks.append(torch.ones((), device=device))
		rks = torch.stack(rks)

Conversation

sayakpaul commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Workflow

Plan

Note to reviewers

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2026

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jbschlosser left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

sayakpaul commented Mar 28, 2026 •

edited

Loading