Benchmarks: cuda.core#2005
Conversation
|
Do you have a side-by-side bindings-vs-core delta table that you could post here? Quick "Low" findings from Cursor GPT-5.4 Extra High Fast
|
|
Here is a table: On this case the |
|
Updated the table with my last numbers, sorry about that. The question here is if are ok with cuda.core being faster because of object construction and i think because of like caching some of the device creation. In general I would say yes, but I understand that one could argue the comparison is not fair. |
rwgk
left a comment
There was a problem hiding this comment.
Cursor GPT-5.4 Extra High Fast
Findings
- No blocking or medium findings in the current PR state. I do not see any obvious or overlooked issue that should stop merge, and I would be comfortable approving it as-is.
- Low:
benchmarks/cuda_core/benchmarks/bench_ctx_device.pystill saysDevice()with no args returns the TLS-cached current device, butcuda_core/cuda/core/_device.pyxresolves that path viacuCtxGetDevice()when a context is active. The benchmark behavior itself is fine, andbenchmarks/cuda_core/compare.pyalready marks that row as a different code path, so this is just a comment cleanup or follow-on item.
Assumptions
- The earlier missing
BENCHMARK_PLAN.mdreferences are fixed in the current branch. - I re-checked the current head
578ac995after themainmerge, and I did not see any new issue introduced by the merge commit. - Validation I ran:
pre-commiton the touched files passed,pixi run -e source -- python -m pytest tests/test_runner.py -qpassed inbenchmarks/cuda_bindings, and representativecuda.corewheel benchmarks still ran successfully.
Change Summary
- Technically, the PR still does the same two main things: it generalizes the shared pyperf runner in
benchmarks/cuda_bindings/runner/main.pyso multiple suites can reuse it, and it adds a newbenchmarks/cuda_coresuite with matching benchmark IDs, suite-local runtime and setup, and a bindings-vs-core comparison tool inbenchmarks/cuda_core/compare.py. - The follow-up commit that removed the plan-document references addressed the only prior concrete review issue I had, and after the fresh pass I do not see anything left that looks merge-blocking.
|
/ok to test 578ac99 |
@rwgk, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 7b63160 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
|
Description
This is for matching benchmarks we have been doing for
cuda.bindingstocuda.core.I guess its up for discussion if we need these and what we want to compare them against.
Right now its basically trying to measure extra latency of the
cuda.corelayer by comparing the to cuda.bindings ones and matching benchmark IDs to that suite 1:1.The main question I think is regarding the "caching" that we get from cuda.core on
Device.Deviceinstances are singletons so after a first callDevice(0)doesnt hit the driver. And probably other similar cases.I guess we could also introduce some sort of cleanups or process spawns but that would come with other latencies.