Skip to content

Add CUDA graph kernel annotations tutorial#3915

Open
yushangdi wants to merge 12 commits into
mainfrom
cudagraph_annotation
Open

Add CUDA graph kernel annotations tutorial#3915
yushangdi wants to merge 12 commits into
mainfrom
cudagraph_annotation

Conversation

@yushangdi
Copy link
Copy Markdown
Contributor

This tutorial demonstrates how to use CUDA graph kernel annotations for semantic profiling traces with custom visualization lanes.

Features:

  • End-to-end workflow from graph capture to visualization
  • Transformer block example with annotated regions
  • Post-processing to merge annotations into profiler traces
  • Custom stream assignments for semantic organization
  • Version checking for cuda-bindings compatibility
  • Clear error messages with upgrade instructions

The tutorial includes:

  • mark_kernels() context manager usage
  • Graph capture with enable_annotations=True
  • Profiling and trace post-processing
  • Before/after comparison
  • Troubleshooting guide

Fixes #ISSUE_NUMBER

Description

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 2, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3915

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e2ab640 with merge base cdc645a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the cla signed label Jun 2, 2026
This tutorial demonstrates how to use CUDA graph kernel annotations
for semantic profiling traces with custom visualization lanes.

Features:
- End-to-end workflow from graph capture to visualization
- Transformer block example with annotated regions
- Post-processing to merge annotations into profiler traces
- Custom stream assignments for semantic organization
- Version checking for cuda-bindings compatibility
- Clear error messages with upgrade instructions

The tutorial includes:
- mark_kernels() context manager usage
- Graph capture with enable_annotations=True
- Profiling and trace post-processing
- Before/after comparison
- Troubleshooting guide

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
@yushangdi yushangdi force-pushed the cudagraph_annotation branch from 4a6f9d9 to c39bac5 Compare June 2, 2026 20:32
yushangdi and others added 8 commits June 2, 2026 20:33
This is required for the CUDA graph annotations tutorial to work
with full annotation support. The cudaGraphNodeGetToolsId API was
added in cuda-bindings 13.3.0.
- Removed check_cuda_bindings_version() function since PyTorch core
  now provides the warning via _probe_tools_id()
- Updated PyTorch requirement from 2.0+ to 2.13+ (required for
  the annotation APIs used in this tutorial)
- Simplified error messaging to reference PyTorch's built-in warnings
Changed the overview to emphasize:
- Ability to add semantic labels to kernels
- Understanding what each kernel does during profiling
- Labeling and organizing kernels by function

Rather than focusing on splitting kernels across streams,
the overview now centers on the annotation feature itself.
Updated the prerequisites card at the top to show PyTorch 2.12+
(was still showing 2.0). Also updated cuda-python to cuda-bindings
for consistency.
Added chrome://tracing screenshots showing:
- Before: All 65 kernels on single stream with auto-generated names
- After: Kernels organized into semantic lanes (streams 61, 62)
  with meaningful labels (attention, mlp)

Screenshots demonstrate the value of kernel annotations for
understanding execution structure and identifying components.
Move `if __name__ == "__main__": main()` to immediately after the
main() function definition (line ~404) so it executes during the
Sphinx Gallery build process.

Sphinx Gallery requires the execution guard to be positioned right
after the function definition, not at the end of the file, to properly
capture and execute the tutorial code during documentation generation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Comment thread .ci/docker/requirements.txt Outdated
matplotlib
librosa
torch==2.12
cuda-bindings>=13.3.0 # Required for CUDA graph annotations tutorial
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to use earlier version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to 13.1.0

yushangdi and others added 2 commits June 2, 2026 21:10
These files were accidentally included in the previous commit:
- traces/ directory (both root and advanced_source/)
- Screenshot PNG files
- CUDA_GRAPH_TUTORIAL_README.md

These are build artifacts and temporary files that should not be
committed to the repository.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address review comment: cudaGraphNodeGetToolsId API was introduced in
CUDA 13.1.0, not 13.3.0. Update requirements and tutorial documentation
to reflect the correct minimum version.

Verified via CUDA documentation that the API first appeared in 13.1.0.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@yushangdi yushangdi requested a review from BoyuanFeng June 3, 2026 19:05
@yushangdi yushangdi force-pushed the cudagraph_annotation branch from 182897e to 50cac66 Compare June 3, 2026 20:25
Replace the main() execution with commented-out code and static example
output. This avoids CI environment dependencies and ensures consistent
output in the documentation preview.

The example output shows the tutorial workflow with 0 annotated nodes,
which reflects the current CI environment where CUDA graph annotations
may not be fully supported but demonstrates the graceful fallback
behavior.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@yushangdi yushangdi marked this pull request as ready for review June 3, 2026 20:40
@yushangdi yushangdi requested a review from ngimel June 3, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants