Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
stevhliu
left a comment
There was a problem hiding this comment.
super educational, i enjoyed reading this a lot!
- maybe rename "Approach" to something like "How the tooling works" because it describes how it works rather than what the user should do
- it seems like "Afterwards" may be more effective as a blog post as it tells a story about issues 1 and 2 in the "What to look for" section
- could be useful to add a link to this doc from our torch.compile docs
|
|
||
| To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it. The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume). | ||
|
|
||
| ### Quick checklist per pipeline |
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
| - See if `scheduler_step` is disproportionately expensive relative to `transformer_forward` (it should be negligible) | ||
| - Spot unexpected CPU work between annotated regions | ||
|
|
||
| **4. Eager vs compile comparison** |
There was a problem hiding this comment.
maybe we can mention cuda graphs here?
There was a problem hiding this comment.
Should it be included under "Smaller CPU gaps"? If so, it's being mentioned a bit later since I wanted to keep the scope specific.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
jbschlosser
left a comment
There was a problem hiding this comment.
Nice work! Love to see this
|
|
||
| # We set the index here to remove DtoH sync, helpful especially during compilation. | ||
| # Check out more details here: https://github.com/huggingface/diffusers/pull/11696 | ||
| self.scheduler.set_begin_index(0) |
There was a problem hiding this comment.
looks good, still feel like we need to broaden the scope of that fix within diffusers at some point :) I'll be out on sabbatical for the next month but I can help when I get back
There was a problem hiding this comment.
This would be awesome. Created #13375 to track. Thanks for offering to help.
| * Use of CUDA Graphs can also help mitigate CPU overhead related issues. When | ||
| using "reduce-overhead" and "max-autotune" in `torch.compile` triggers the | ||
| use of CUDA Graphs. |
There was a problem hiding this comment.
I'm glad you mentioned this here - I wonder if it's worth clarifying in the respective sections of this doc that CUDAGraph usage is the reason why we expect gap removal from using torch.compile
There was a problem hiding this comment.
The results and the graph presented in this doc were obtained with the "default" compilation mode (along with regional compilation):
So not sure? There are also being shipped in this PR that helped mitigate the stalling issues 👀
|
|
||
| Education materials to strategically profile pipelines to potentially improve their | ||
| runtime with `torch.compile`. To set these pipelines up for success with `torch.compile`, | ||
| we often have to get rid of DtoH syncs, CPU overheads, kernel launch delays, and |
There was a problem hiding this comment.
| we often have to get rid of DtoH syncs, CPU overheads, kernel launch delays, and | |
| we often have to get rid of device-to-host (DtoH) syncs, CPU overheads, kernel launch delays, and |
I'm not sure if this terminology will be familiar to the target audience, I had to look it up the first time to find out what it meant.
|
|
||
| ## Context | ||
|
|
||
| We want to uncover CPU overhead, CPU-GPU sync points, and other bottlenecks in popular diffusers pipelines — especially issues that become non-trivial under `torch.compile`. The approach is inspired by [flux-fast's run_benchmark.py](https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/run_benchmark.py#L66-L85) which uses `torch.profiler` with method-level annotations, and motivated by issues like [diffusers#11696](https://github.com/huggingface/diffusers/pull/11696) (DtoH sync from scheduler `.item()` call). |
There was a problem hiding this comment.
| We want to uncover CPU overhead, CPU-GPU sync points, and other bottlenecks in popular diffusers pipelines — especially issues that become non-trivial under `torch.compile`. The approach is inspired by [flux-fast's run_benchmark.py](https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/run_benchmark.py#L66-L85) which uses `torch.profiler` with method-level annotations, and motivated by issues like [diffusers#11696](https://github.com/huggingface/diffusers/pull/11696) (DtoH sync from scheduler `.item()` call). | |
| We want to uncover CPU overhead, CPU-GPU sync points, and other bottlenecks in popular diffusers pipelines — especially issues that become non-trivial when using [`torch.compile`](https://docs.pytorch.org/docs/stable/generated/torch.compile.html). The approach is inspired by [flux-fast's run_benchmark.py](https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/run_benchmark.py#L66-L85) which uses [`torch.profiler`](https://docs.pytorch.org/docs/stable/profiler.html) with method-level annotations, and motivated by issues like [diffusers#11696](https://github.com/huggingface/diffusers/pull/11696) (DtoH sync from scheduler `.item()` call). |
I think adding links to the torch.compile and torch.profiler docs could be useful for following along, especially if readers aren't familiar with them.
|
|
||
| ## How the Tooling Works | ||
|
|
||
| Follow the flux-fast pattern: **annotate key pipeline methods** with `torch.profiler.record_function` wrappers, then run the pipeline under `torch.profiler.profile` and export a Chrome trace. |
There was a problem hiding this comment.
| Follow the flux-fast pattern: **annotate key pipeline methods** with `torch.profiler.record_function` wrappers, then run the pipeline under `torch.profiler.profile` and export a Chrome trace. | |
| Follow the flux-fast pattern: **annotate key pipeline methods** with `torch.profiler.record_function` wrappers, then run the pipeline under `torch.profiler.profile` and export a Chrome JSON trace. |
|
|
||
| ## Verification | ||
|
|
||
| 1. Run: `python profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2` |
There was a problem hiding this comment.
| 1. Run: `python profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2` | |
| 1. Run: `python examples/profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 2` |
I think running the script from the diffusers root directory might be more common?
|
|
||
| Open the exported `.json` trace at [ui.perfetto.dev](https://ui.perfetto.dev/). The trace has two main rows: **CPU** (top) and **CUDA** (bottom). In Perfetto, the CPU row is typically labeled with the process/thread name (e.g., `python (PID)` or `MainThread`) and appears at the top. The CUDA row is labeled `GPU 0` (or similar) and appears below the CPU rows. | ||
|
|
||
| **Navigation:** Use `W` to zoom in, `S` to zoom out, and `A`/`D` to pan left/right. You can also scroll to zoom and click-drag to pan. Use `Shift+scroll` to scroll vertically through rows. |
There was a problem hiding this comment.
I also found , and ., which select the previous/next track event, useful for e.g. finding the next transformer_forward event, or finding the next GPU kernel on the GPU kernel track.
| Open both traces side by side (two Perfetto tabs). Key differences to look for: | ||
| - **Fewer, wider CUDA kernels** in compile mode (fused ops) vs many small kernels in eager | ||
| - **Smaller CPU gaps** between kernels in compile mode (less Python dispatch overhead) | ||
| - **CUDA kernel count per step**: to compare, zoom into a single `transformer_forward` span on the CUDA row and count the distinct kernel slices within it. In eager mode you'll typically see many narrow slices (one per op); in compile mode these fuse into fewer, wider slices. A quick way to estimate: select a time range covering one denoising step on the CUDA row — Perfetto shows the number of slices in the selection summary at the bottom. If compile mode shows a similar kernel count to eager, fusion isn't happening effectively (likely due to graph breaks). |
There was a problem hiding this comment.
When I tried comparing an eager and compile trace side-by-side, I found that it was difficult to find corresponding events because if --mode compile is used, there appear to be no transformer_forward events in the trace, but rather one large ## Call CompiledFxGraph... event. (I think if regional compilation is used via --compile_regional, the transformer_forward event does appear again, with the ## CompiledFxGraph... events under the transformer_forward events.)
| - There may be implicit syncs forcing serialization | ||
| - `torch.compile` should help here by batching launches — compare eager vs compile to confirm | ||
|
|
||
| To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it. The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume). |
There was a problem hiding this comment.
| To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it. The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume). | |
| To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it (there should be an arrow pointing from the CPU launch slice to the GPU kernel slice). The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume). |
Not sure about the exact wording, but I think mentioning this is helpful, especially if there is a big temporal gap between the CPU cudaLaunchKernel and corresponding GPU kernel execution.
May also be worth mentioning that when the GPU kernel is selected, the corresponding cudaLaunchKernel event should be in the "Preceding Flows" section; I believe the "Delay" column then gives the exact launch latency.
| ### Spotting gaps between launches | ||
|
|
||
| Then a reasonable next step is to spot frequent gaps between kernel executions. In the compiled | ||
| case, we don't spot any on the surface. But if we zone in, some become apparent. |
There was a problem hiding this comment.
| case, we don't spot any on the surface. But if we zone in, some become apparent. | |
| case, we don't spot any on the surface. But if we zoom in, some become apparent. |
nit: typo
| </table> | ||
|
|
||
| So, we provided the profile trace file (with compilation) to Claude, asked it to find the instances of | ||
| "cudaStreamSynchronize" and "cudaDeviceSynchronize", and to come up with some potential fixes. |
There was a problem hiding this comment.
| "cudaStreamSynchronize" and "cudaDeviceSynchronize", and to come up with some potential fixes. | |
| `cudaStreamSynchronize` and `cudaDeviceSynchronize`, and to come up with some potential fixes. |
nit: formatting
| </tr> | ||
| </table> | ||
|
|
||
| ### Spotting gaps between launches |
There was a problem hiding this comment.
I find this section somewhat unsatisfying because it glosses over the reasoning behind each step in the profiling process. I think it would be more instructive if the chain of reasoning that led us (and Claude) to
- Find the
tqdmand_unpack_latents_with_idsfixes - Find out that these fixes weren't sufficient, and why they weren't sufficient
- Discover that
cache_contextwas the bottleneck
was discussed at greater length in this section.
| ``` | ||
|
|
||
| The changes looked reasonable based on our past experience. So, we asked Claude to apply these changes to [`pipeline_flux2_klein.py`](../../src/diffusers/pipelines/flux2/pipeline_flux2_klein.py). We then profiled | ||
| the updated pipeline. It still didn't eliminate the gaps as expected so, we fed that back to Claude and |
There was a problem hiding this comment.
I think it would be nice to have a demonstration using the profiling outputs that the _unpack_latents_with_ids fix indeed eliminates a DtoH sync because the current wording
It still didn't eliminate the gaps as expected
makes it unclear if the change is effective.
| |------------------------|------------------------------|-----------------------------| | ||
| | `_set_context` total | 21.6ms (8 calls) | 0.0ms (8 calls) | | ||
| | `cache_context` total | 21.7ms | 0.1ms | | ||
| | CPU gaps | 5,523us / 8,007us / 5,508us | 158us / 2,777us / 136us | |
There was a problem hiding this comment.
I think adding another row with the total wall clock time (or another "overall performance" metric) before and after would be useful here because it's not obvious to me that reducing the CPU gaps here necessarily leads to better performance overall.
|
|
||
|  | ||
|
|
||
| The UniPC scheduler (used in Wan) creates small constant tensors via `torch.tensor([0.5], dtype=x.dtype, device=device)` during `step()`. This triggers a "cudaMemcpyAsync + cudaStreamSynchronize" to copy |
There was a problem hiding this comment.
| The UniPC scheduler (used in Wan) creates small constant tensors via `torch.tensor([0.5], dtype=x.dtype, device=device)` during `step()`. This triggers a "cudaMemcpyAsync + cudaStreamSynchronize" to copy | |
| The UniPC scheduler (used in Wan) creates small constant tensors via `torch.tensor([0.5], dtype=x.dtype, device=device)` during `step()`. This triggers a `cudaMemcpyAsync` + `cudaStreamSynchronize` to copy |
nit: formatting
| * Use of CUDA Graphs can also help mitigate CPU overhead related issues. When | ||
| using "reduce-overhead" and "max-autotune" in `torch.compile` triggers the | ||
| use of CUDA Graphs. |
There was a problem hiding this comment.
| * Use of CUDA Graphs can also help mitigate CPU overhead related issues. When | |
| using "reduce-overhead" and "max-autotune" in `torch.compile` triggers the | |
| use of CUDA Graphs. | |
| * Use of CUDA Graphs can also help mitigate CPU overhead related issues. CUDA Graphs can be enabled by setting the `torch.compile` mode to `"reduce-overhead"` or `"max-autotune"`. |
nit: I think the wording here is awkward
| return latents | ||
|
|
||
| @staticmethod | ||
| # Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._unpack_latents_with_ids |
There was a problem hiding this comment.
Since _unpack_latents_with_ids was originally # Copied from Flux2Pipeline, should the changes here also be propagated to that pipeline?
| rks.append(torch.ones((), device=device)) | ||
| rks = torch.stack(rks) |
There was a problem hiding this comment.
Should this change also be explained in the docs (examples/profiling/README.md)?
dg845
left a comment
There was a problem hiding this comment.
Thanks for the PR! Left some questions/suggestions :).
What does this PR do?
TL;DR: Adds a guide on how to profile a pipeline and fix issues like CPU overhead, CPU<->GPU syncs, etc.
Motivation
Since we provide first-class
torch.compilesupport, it's important that our pipelines are set up for optimal success with it. This includes spotting any obvious issues that plague thetorch.compileperformance -- CPU overhead, CPU<->GPU syncs, graphbreaks, kernel launch delays, etc.The best way to spot these bugs is to profile a pipeline, as it gives a granular measurement of where the GPU is spending time and if it is doing so in an expected manner. We can then uncover any unexpected issues and eventually fix them.
Workflow
The README.md added in the PR has all the descriptions, but in summary:
With this Workflow, I was able to fix some issues in the Flux2 Klein pipeline and the Wan pipeline. All changes look quite harmless to me.
Plan
Not only is it helpful to profile pipelines to get a ceiling on performance, but the community could also help us improve our pipelines should this workflow prove to be useful.
Note to reviewers
Please review the changes in
src/diffusers/*. And you can skip straight to the "Afterwards" section in the README.md document.The tutorial is currently available here. Some inline comments.