Benchmarks: cuda.core by danielfrg · Pull Request #2005 · NVIDIA/cuda-python

danielfrg · 2026-05-01T18:59:32Z

Description

This is for matching benchmarks we have been doing for cuda.bindings to cuda.core.

I guess its up for discussion if we need these and what we want to compare them against.

Right now its basically trying to measure extra latency of the cuda.core layer by comparing the to cuda.bindings ones and matching benchmark IDs to that suite 1:1.

The main question I think is regarding the "caching" that we get from cuda.core on Device. Device instances are singletons so after a first call Device(0)doesnt hit the driver. And probably other similar cases.

I guess we could also introduce some sort of cleanups or process spawns but that would come with other latencies.

copy-pr-bot · 2026-05-01T18:59:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rwgk · 2026-05-01T23:34:22Z

Do you have a side-by-side bindings-vs-core delta table that you could post here?

Quick "Low" findings from Cursor GPT-5.4 Extra High Fast

Low: benchmarks/cuda_core/compare.py and benchmarks/cuda_core/benchmarks/bench_ctx_device.py tell readers to consult BENCHMARK_PLAN.md, but there is no BENCHMARK_PLAN.md under benchmarks/cuda_core or elsewhere in the repo. The starred-row legend is useful, but the referenced deeper rationale document is missing.
Low: benchmarks/cuda_core/benchmarks/bench_ctx_device.py says Device() with no args returns the TLS-cached current device, but cuda_core/cuda/core/_device.pyx actually resolves that case by calling cuCtxGetDevice() when a context is active. The benchmark behavior itself is fine, and benchmarks/cuda_core/compare.py already treats that row as a different code path, but the benchmark comment is misleading about what work is really being measured.

danielfrg · 2026-05-05T00:39:41Z

Here is a table:

On this case the Delta % is nto that relevant because we are just adding Python calls on cuda.core so we might just remove that column IMO. Delta ns might be the most important here.

Delta = core mean - bindings mean (positive = cuda.core slower).
* marks benchmarks where the cuda.core path invokes a different driver
  symbol, makes an additional driver call, or hits a cuda.core-side cache
  — so Delta is not pure Python wrapper overhead on top of the same driver
  call. Unstarred rows compare like-for-like driver calls; their Delta is
  wrapper overhead. See BENCHMARK_PLAN.md (Audit notes) for per-row detail.

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                   bindings (ns)       RSD       core (ns)       RSD      Delta ns     Delta %
-----------------------------------------------------------------------------------------------------------------------
ctx_device.ctx_get_current *                          114      4.3%             155      3.9%           +41        +36%
ctx_device.ctx_get_device                             117      1.9%               -         -             -           -
ctx_device.ctx_set_current                            103      2.6%             134      3.1%           +31        +30%
ctx_device.device_get *                               127      3.1%             129      4.1%            +1         +1%
ctx_device.device_get_attribute *                     192      1.4%              62      2.6%          -130        -68%
ctx_device.device_primary_ctx_retain                  228      3.4%               -         -             -           -
enum.curesult_construction                            168      5.9%               -         -             -           -
enum.curesult_member_access                            20      4.9%               -         -             -           -
enum.device_attribute_construction                    167      7.2%               -         -             -           -
event.event_create_destroy                            310      1.6%             484      2.5%          +174        +56%
event.event_query                                     208      2.6%             156      1.7%           -53        -25%
event.event_record                                    227      3.5%             174      2.4%           -53        -23%
event.event_synchronize                               226      2.3%             176      1.6%           -49        -22%
launch.launch_16_args *                             3,152      1.7%           2,458      1.2%          -694        -22%
launch.launch_16_args_pre_packed                    1,980      1.1%               -         -             -           -
launch.launch_2048b                                 2,433      1.5%               -         -             -           -
launch.launch_256_args *                           16,519      2.0%           7,671      1.3%        -8,849        -54%
launch.launch_512_args *                           31,422      1.3%          13,176      2.8%       -18,246        -58%
launch.launch_512_args_pre_packed                   3,798      1.3%               -         -             -           -
launch.launch_512_bools                            58,242      2.9%               -         -             -           -
launch.launch_512_bytes                            60,803      4.8%               -         -             -           -
launch.launch_512_doubles                          87,285      3.6%               -         -             -           -
launch.launch_512_ints                             61,578      4.4%               -         -             -           -
launch.launch_512_longlongs                        65,914      4.2%               -         -             -           -
launch.launch_empty_kernel *                        1,878      1.5%           1,849      1.5%           -28         -2%
launch.launch_small_kernel *                        2,230      1.3%           1,972      0.8%          -257        -12%
memory.mem_alloc_async_free_async *                   765      1.5%           1,175      2.3%          +409        +54%
memory.mem_alloc_free *                             2,549      1.7%             968      1.5%        -1,581        -62%
memory.memcpy_dtod                                  2,291      1.3%               -         -             -           -
memory.memcpy_dtoh                                  5,490      0.7%               -         -             -           -
memory.memcpy_htod                                  4,093      0.9%               -         -             -           -
module.func_get_attribute                             221      3.2%               -         -             -           -
module.module_get_function                            180      2.3%               -         -             -           -
module.module_load_unload                           8,744      1.0%               -         -             -           -
nvrtc.nvrtc_compile_program                     7,330,579      1.6%               -         -             -           -
nvrtc.nvrtc_create_program                            678      2.4%               -         -             -           -
nvrtc.nvrtc_create_program_100_headers             14,129      6.7%               -         -             -           -
pointer_attributes.pointer_get_attribute              511      3.1%               -         -             -           -
pointer_attributes.pointer_get_attributes           2,097      2.3%               -         -             -           -
stream.stream_create_destroy *                      3,531      2.0%           3,780      1.2%          +248         +7%
stream.stream_query                                   220      1.9%               -         -             -           -
stream.stream_synchronize                             241      2.3%             194      1.8%           -47        -20%
-----------------------------------------------------------------------------------------------------------------------

danielfrg · 2026-05-05T15:51:48Z

Updated the table with my last numbers, sorry about that.

The question here is if are ok with cuda.core being faster because of object construction and i think because of like caching some of the device creation.

In general I would say yes, but I understand that one could argue the comparison is not fair.

mdboom

LGTM

rwgk

Cursor GPT-5.4 Extra High Fast

Findings

No blocking or medium findings in the current PR state. I do not see any obvious or overlooked issue that should stop merge, and I would be comfortable approving it as-is.
Low: benchmarks/cuda_core/benchmarks/bench_ctx_device.py still says Device() with no args returns the TLS-cached current device, but cuda_core/cuda/core/_device.pyx resolves that path via cuCtxGetDevice() when a context is active. The benchmark behavior itself is fine, and benchmarks/cuda_core/compare.py already marks that row as a different code path, so this is just a comment cleanup or follow-on item.

Assumptions

The earlier missing BENCHMARK_PLAN.md references are fixed in the current branch.
I re-checked the current head 578ac995 after the main merge, and I did not see any new issue introduced by the merge commit.
Validation I ran: pre-commit on the touched files passed, pixi run -e source -- python -m pytest tests/test_runner.py -q passed in benchmarks/cuda_bindings, and representative cuda.core wheel benchmarks still ran successfully.

Change Summary

Technically, the PR still does the same two main things: it generalizes the shared pyperf runner in benchmarks/cuda_bindings/runner/main.py so multiple suites can reuse it, and it adds a new benchmarks/cuda_core suite with matching benchmark IDs, suite-local runtime and setup, and a bindings-vs-core comparison tool in benchmarks/cuda_core/compare.py.
The follow-up commit that removed the plan-document references addressed the only prior concrete review issue I had, and after the fresh pass I do not see anything left that looks merge-blocking.

rwgk · 2026-05-13T21:58:45Z

/ok to test 578ac99

copy-pr-bot · 2026-05-13T21:58:49Z

/ok to test 578ac99

@rwgk, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

rwgk · 2026-05-13T21:59:14Z

/ok to test 7b63160

github-actions · 2026-05-14T00:37:28Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

danielfrg added 4 commits May 1, 2026 12:50

cuda.core benchmarks

ed099ad

cuda.core benchmarks

a711361

cuda.core benchmarks

2144446

cuda.core benchmarks

c25b82f

danielfrg self-assigned this May 1, 2026

danielfrg added cuda.bindings Everything related to the cuda.bindings module performance labels May 1, 2026

danielfrg added this to the cuda.core v1.0.0 milestone May 1, 2026

danielfrg requested review from leofang, mdboom and rwgk May 1, 2026 19:00

leofang added P1 Medium priority - Should do cuda.core Everything related to the cuda.core module and removed cuda.bindings Everything related to the cuda.bindings module labels May 5, 2026

Remove benchmark plan mentions

8fdf7a4

leofang mentioned this pull request May 6, 2026

Graph kernel nodes don't keep kernel argument objects alive #2039

Closed

leofang modified the milestones: cuda.core v1.0.0, cuda.core next May 7, 2026

mdboom mentioned this pull request May 13, 2026

Measure and track cuda.core import time #2045

Open

mdboom approved these changes May 13, 2026

View reviewed changes

Merge branch 'main' into benchmarks-cuda-core

578ac99

rwgk approved these changes May 13, 2026

View reviewed changes

Merge branch 'main' into benchmarks-cuda-core

7b63160

rwgk enabled auto-merge (squash) May 13, 2026 21:58

This comment has been minimized.

Sign in to view

rwgk merged commit 8869a26 into main May 13, 2026
94 checks passed

rwgk deleted the benchmarks-cuda-core branch May 13, 2026 22:50

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks: cuda.core#2005

Benchmarks: cuda.core#2005
rwgk merged 7 commits into
mainfrom
benchmarks-cuda-core

danielfrg commented May 1, 2026

Uh oh!

copy-pr-bot Bot commented May 1, 2026

Uh oh!

rwgk commented May 1, 2026

Uh oh!

danielfrg commented May 5, 2026 •

edited

Loading

Uh oh!

danielfrg commented May 5, 2026

Uh oh!

mdboom left a comment

Uh oh!

rwgk left a comment

Uh oh!

rwgk commented May 13, 2026

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

rwgk commented May 13, 2026

Uh oh!

This comment has been minimized.

Uh oh!

This comment has been minimized.

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

danielfrg commented May 1, 2026

Description

Uh oh!

copy-pr-bot Bot commented May 1, 2026

Uh oh!

rwgk commented May 1, 2026

Uh oh!

danielfrg commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielfrg commented May 5, 2026

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Findings

Assumptions

Change Summary

Uh oh!

rwgk commented May 13, 2026

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

rwgk commented May 13, 2026

Uh oh!

This comment has been minimized.

Uh oh!

This comment has been minimized.

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielfrg commented May 5, 2026 •

edited

Loading