cuda.bindings latency benchmarks - part 2 by danielfrg · Pull Request #1856 · NVIDIA/cuda-python

danielfrg · 2026-04-03T15:13:45Z

Description

closes #1580

Follow up #1580

Adding a couple of more benchmarks here and fixing a couple of issue with the pyperf json handling.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-04-03T15:13:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

danielfrg · 2026-04-03T15:33:34Z

There are the results for a run on my dev machine (4090)


----------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   Python (mean)    Overhead
----------------------------------------------------------------------------------
ctx_device.ctx_get_current                        6 ns          112 ns     +106 ns
ctx_device.ctx_get_device                         8 ns          122 ns     +113 ns
ctx_device.ctx_set_current                        8 ns          103 ns      +96 ns
ctx_device.device_get                             6 ns          126 ns     +120 ns
ctx_device.device_get_attribute                   9 ns          195 ns     +186 ns
event.event_create_destroy                       90 ns          307 ns     +218 ns
event.event_query                                74 ns          215 ns     +140 ns
event.event_record                               93 ns          229 ns     +136 ns
event.event_synchronize                          94 ns          239 ns     +145 ns
launch.launch_16_args                          1.57 us         3.12 us    +1545 ns
launch.launch_16_args_pre_packed               1.58 us         1.99 us     +409 ns
launch.launch_empty_kernel                     1.54 us         1.85 us     +302 ns
launch.launch_small_kernel                     1.54 us         2.23 us     +690 ns
pointer_attributes.pointer_get_attribute         29 ns          511 ns     +482 ns
stream.stream_create_destroy                   3.78 us         4.06 us     +274 ns
stream.stream_query                              86 ns          232 ns     +145 ns
stream.stream_synchronize                       111 ns          257 ns     +146 ns
----------------------------------------------------------------------------------

rwgk · 2026-04-06T04:22:50Z

It's the first time that I'm looking at this code. My second question was: What's the purpose of the benchmarks. Cursor (GPT-5.4 Extra High Fast) offered some answers. I asked it to generate a "Motivation" section based on what it found, see below. I think it'd be a great addition to cuda_bindings/benchmarks/README.md.

Motivation

These benchmarks are intended to measure the latency overhead of calling CUDA Driver APIs through cuda.bindings, relative to a similar C++ baseline.

The main goal is to help answer questions such as:

How much overhead does the Python binding layer add to very small CUDA API calls?
Are we staying within our target of keeping Python overhead below roughly 1 us for representative operations?
Do changes to argument conversion, result handling, or wrapper internals introduce measurable regressions?

The paired C++ benchmarks are included to provide a lower-level reference point for the same operation. Comparing Python and C++ results helps estimate the additional cost introduced by the Python-to-C boundary and by binding-specific marshalling work.

These benchmarks are not intended to measure overall GPU performance, kernel throughput, or end-to-end application speed. Most of the benchmarked operations are deliberately tiny, so the reported numbers are best interpreted as binding/API-call latency measurements and regression signals, rather than as predictions of full application performance.

Because the benchmarked operations are so small, methodology matters a lot. The most useful comparisons are between Python and C++ benchmarks that perform as nearly the same work as possible and are run under similar conditions.

rwgk · 2026-04-06T04:31:59Z

My first question (to Cursor) when reviewing this PR was:

I'm not very familiar with benchmarking. I glanced through the PR, it seems to set up pairs of equivalent C++ / Python tests, runs both, and then computes the overhead resulting from running via Python. Could you please look carefully of the C++ / Python pairs are actually equivalent? Is there anything that looks like it would lead to distorted/systematically biased results?

After it gave me the response below I started thinking about the motivation, with the result in the previous comment. In light of that, the findings below still seem relevant, but I'd need to look closer to be more certain which of the "not clean apples-to-apples" aspects it found are actually meaningful. I hope they are at least a good starting point for figuring it out together, so I'm copy-pasting them below.

Findings

High: event.event_query and event.event_synchronize are not actually paired with the Python versions. The C++ binary records onto the shared event first at cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:60, then benchmarks query/synchronize on that same handle at cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:67 and cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:75 without restoring the original completed-event state. The Python side sets up a completed event at cuda_bindings/benchmarks/benchmarks/bench_event.py:16 and cuda_bindings/benchmarks/benchmarks/bench_event.py:17, and each benchmark runs in its own pyperf worker from cuda_bindings/benchmarks/runner/main.py:132, so event_query and event_synchronize start clean there. That makes those rows non-equivalent.
High: The launch.* benchmarks are not clean apples-to-apples overhead measurements. Both sides enqueue asynchronously with no sync/reset inside the timed loop at cuda_bindings/benchmarks/benchmarks/bench_launch.py:64, cuda_bindings/benchmarks/benchmarks/bench_launch.py:77, cuda_bindings/benchmarks/benchmarks/bench_launch.py:90, cuda_bindings/benchmarks/benchmarks/bench_launch.py:102, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:136, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:147, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:157, and cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:169. Python is driven through pyperf at cuda_bindings/benchmarks/runner/main.py:132, which calibrates loop counts, while C++ uses fixed loops = 1000 at cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:21. So the two sides are almost certainly measuring different queue-depth and backpressure regimes. On top of that, the later C++ launch rows reuse the same stream sequentially from cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:112, so they inherit prior queued work in a way the Python rows do not.
Medium: Even the synchronous rows are not measured under the same harness. Python uses pyperf.Runner().bench_time_func(...) at cuda_bindings/benchmarks/runner/main.py:132, while C++ uses one in-process custom suite with fixed warmups, values, and runs at cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:21, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:22, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:23, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:24, and cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:103. For 6-100 ns calls, that is enough to change timer overhead, loop overhead, and process-isolation behavior, so the delta is not a pure wrapper-overhead number.
Medium: The C++ baseline includes extra hot-loop work that Python does not. Most C++ rows wrap each CUDA call in check_cu(...) at cuda_bindings/benchmarks/benchmarks/cpp/bench_ctx_device.cpp:42, cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:53, cuda_bindings/benchmarks/benchmarks/cpp/bench_stream.cpp:46, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:135, and cuda_bindings/benchmarks/benchmarks/cpp/bench_pointer_attributes.cpp:45. The Python loops just discard the returned status at cuda_bindings/benchmarks/benchmarks/bench_ctx_device.py:22, cuda_bindings/benchmarks/benchmarks/bench_event.py:29, cuda_bindings/benchmarks/benchmarks/bench_stream.py:23, cuda_bindings/benchmarks/benchmarks/bench_launch.py:64, and cuda_bindings/benchmarks/benchmarks/bench_pointer_attributes.py:24. That makes the C++ side a little slower and systematically understates Python overhead, especially on the fastest context/device rows.
Medium: launch.launch_16_args is intentionally not equivalent to the C++ row. Python passes (_args, _arg_types) at cuda_bindings/benchmarks/benchmarks/bench_launch.py:90, so the binding has to marshal Python arguments every call. C++ launch_16_args and launch_16_args_pre_packed are the same pre-packed void*[] launch at cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:155 and cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:167. The launch_16_args_pre_packed row is the closer wrapper-only comparison; the plain launch_16_args row measures Python marshalling overhead too.

What Looks Reasonably Matched

ctx_device.*, pointer_attributes.pointer_get_attribute, stream.stream_*, and event.event_create_destroy look structurally close in API choice and setup.
Their exact nanosecond deltas should still be treated cautiously because of the harness mismatch and the extra check_cu(...) cost on the C++ side.
event.event_record is closer than the other async rows, but it still suffers from the async queue-depth mismatch between pyperf-calibrated Python runs and fixed-loop C++ runs.

Bottom Line

Some pairs are fine directionally, but the PR is not consistently comparing truly equivalent C++ and Python work.
event.event_query, event.event_synchronize, and the launch.* rows should not be treated as clean Python-overhead-over-equivalent-C++ measurements.
If the intent is a defensible binding-overhead study, the biggest fixes are: use the same harness model on both sides, reset and synchronize async state between values and between named benchmarks, and either remove check_cu(...) from the timed C++ loop or add equivalent explicit status handling on the Python side.

Note

This review was based on code inspection plus a local check of the pyperf worker and loop behavior used by cuda_bindings/benchmarks/runner/main.py.

danielfrg · 2026-04-07T16:13:29Z

yeah the motivation is correct, its just latency/overhead of the python layer, not throughput. I'll add it to the readme.

And yeah the review i think i would agree with most of it on the ones marked as high (and I will try to match them closer) but for the other ones i think its almost impossible to do a full apples-to-apples so for those i am not sure i would change much but i'll leave it up to you all to make that call :D
I am happy to fix to whatever we believe is a closer comparison.

danielfrg · 2026-04-07T16:30:38Z

Ok, i added a couple of cuStreamSynchronize to make it a bit better. Claude doesn't seem to think this is that big of a problem:

**Event query/synchronize state (High)**

Fair point on code clarity. The practical impact is zero — the stream is idle and non-blocking, so after `event_record` benchmarks the event is still in a completed state. But I've added an explicit `cuStreamSynchronize(stream)` between the `event_record` and `event_query` benchmarks on the C++ side to make the intent clear and future-proof it.

About the second comment, i think its "ok". the C++ one doesn't match pyperf fully when it does a bit more fancy stats for the warm up and number of measurements and the C++ is a fixed count but i don't think it should affect much specially for measuring host latency?

mdboom · 2026-04-07T22:12:25Z

I think it's fine to be apples-to-oranges: what I think we want to compare is "what a typical C++ user would do" vs. "what a typical Python user would do". For example, the automatic kernel argument marshaling overhead in Python is something we've provided that I think is very convenient for the Python programmer, but is unlike anything C++ would likely do. So I wouldn't be quite as strict on matching as the LLM suggests.

As for resolving some of the differences in the harnesses, I think that's tricky territory, and a lot of attempts to bring them together would also make either the Python or C++ environment less "native". I have some ideas, but they are beyond the scope of this PR.

I always like to see that benchmarks have a low standard deviation, because if they are highly reproducible, it makes it easier to work on improving them. The Python ones all generally seem fine, but there are some potential problems in the C++ ones:

stream.stream_create_destroy: Mean +- std dev: 3210 ns +- 491 ns
stream.stream_query: Mean +- std dev: 65 ns +- 40 ns

There isn't really a trendline in the individual values:

suggesting there's just something just random about the behavior.

I wonder if we are seeing the effect of some memory pressure here. pyperf runs each of the outer loops in a fresh process, which helps to rule out the effect of the garbage collector / memory pressure / long-held resources etc. The C++ harness here doesn't do that, and I wonder if we would see more stability if we did. That's one pretty important difference in the harnesses that it might be worth addressing.

All of that said, I don't see any reasons to not merge this and just iterate. If we find that any of these benchmarks need improvement, we can generally just do that.

mdboom

I don't see anything here to block on, except maybe the stddev of two of the C++ benchmarks is too high. See my lengthier comment inline.

cuda_bindings/benchmarks/README.md

mdboom · 2026-04-07T22:21:53Z

Another puzzling result -- on my machine (with a 5070, fwiw), there are some benchmarks where Python measures as slightly faster than C++ which seems unlikely:

----------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   Python (mean)    Overhead
----------------------------------------------------------------------------------
ctx_device.ctx_get_current                        5 ns           78 ns      +73 ns
ctx_device.ctx_get_device                         8 ns           83 ns      +75 ns
ctx_device.ctx_set_current                        6 ns           78 ns      +72 ns
ctx_device.device_get                             5 ns           85 ns      +80 ns
ctx_device.device_get_attribute                   8 ns          132 ns     +124 ns
event.event_create_destroy                       63 ns          239 ns     +175 ns
event.event_query                                59 ns          153 ns      +94 ns
event.event_record                               58 ns          149 ns      +91 ns
event.event_synchronize                          71 ns          159 ns      +88 ns
launch.launch_16_args                          2.05 us         2.53 us     +487 ns
launch.launch_16_args_pre_packed               2.05 us         1.92 us    +-128 ns
launch.launch_empty_kernel                     2.05 us         1.91 us    +-140 ns
launch.launch_small_kernel                     2.05 us         1.95 us     +-97 ns
pointer_attributes.pointer_get_attribute         21 ns          349 ns     +329 ns
stream.stream_create_destroy                   3.21 us         3.46 us     +254 ns
stream.stream_query                              65 ns          157 ns      +92 ns
stream.stream_synchronize                        80 ns          180 ns     +100 ns
----------------------------------------------------------------------------------

Co-authored-by: Michael Droettboom <mdboom@gmail.com>

cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp

danielfrg · 2026-04-10T16:21:32Z

Thanks for the reviews!

Yes, I also sometimes see some Python stuff to be "faster" and it happened in some of the cuda.compute benchmarks. I believe its just because of the apples-to-oranges and the aggregations, i am pretty sure if we run them a lot more that wont be the case.

Since we are ok with this for now I will merge once linting passes and will continue to iterate.

danielfrg · 2026-04-10T19:34:52Z

/ok to test 49c2651

github-actions · 2026-04-10T19:52:27Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1856/
https://nvidia.github.io/cuda-python/pr-preview/pr-1856/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1856/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1856/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rwgk · 2026-04-10T20:07:01Z

This is odd:

  File "/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py", line 181, in _load_lib_no_cache
    ctx.raise_not_found()
  File "/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/cuda/pathfinder/_dynamic_libs/search_steps.py", line 66, in raise_not_found
    raise DynamicLibNotFoundError(f'Failure finding "{self.lib_searched_for}": {err}\n{att}')
cuda.pathfinder._dynamic_libs.load_dl_common.DynamicLibNotFoundError: Failure finding "libnvrtc.so": No such file: libnvrtc.so*, No such file: libnvrtc.so*

I'll take a look.

rwgk · 2026-04-10T22:51:20Z

This is odd:

  File "/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py", line 181, in _load_lib_no_cache
    ctx.raise_not_found()
  File "/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/cuda/pathfinder/_dynamic_libs/search_steps.py", line 66, in raise_not_found
    raise DynamicLibNotFoundError(f'Failure finding "{self.lib_searched_for}": {err}\n{att}')
cuda.pathfinder._dynamic_libs.load_dl_common.DynamicLibNotFoundError: Failure finding "libnvrtc.so": No such file: libnvrtc.so*, No such file: libnvrtc.so*

I'll take a look.

Initially I suspected there could be a problem in pathfinder, or maybe a new CTK release, but it turns out it's almost certainly a problem with this PR. While I had the full context, I asked Cursor to try a fix, thinking it might be quick. It came up with something a lot more involved than expected. I put up the change here, for you to cherry-pick if you find it useful:

rwgk@04a8933

Cursor also generated this explanation:

I think 04a89337c1 is the right cherry-pick for the surprising Linux LOCAL_CTK benchmark failures.

The failure pattern pointed to benchmark discovery touching NVRTC too early. Before this change, cuda_bindings/benchmarks/runner/main.py imported every bench_*.py module just to discover benchmark IDs, and cuda_bindings/benchmarks/benchmarks/bench_launch.py did NVRTC compilation, stream creation, and device allocations at import time. That means the benchmark smoke test could fail while merely discovering benchmarks, before the actual launch benchmarks were even running.

This cherry-pick fixes that by making benchmark discovery side-effect free and by narrowing when NVRTC is needed:

cuda_bindings/benchmarks/runner/main.py now discovers benchmark IDs with ast instead of importing GPU-heavy modules up front.
Benchmark modules are loaded lazily only when a selected benchmark actually runs.
cuda_bindings/benchmarks/benchmarks/bench_launch.py now defers kernel compilation and launch setup until first use.
cuda_bindings/benchmarks/runner/main.py also preserves CUDA-related environment variables for pyperf worker processes.
The change includes focused regression tests for lazy discovery, worker env propagation, and lazy launch initialization.

I like this fix because it directly targets the libnvrtc.so failure mode without changing the benchmark results logic themselves. It removes the eager NVRTC dependency from discovery, which is the part that looked out of place in the failing CI logs.

Cherry-pick command:

git fetch https://github.com/rwgk/cuda-python.git cuda-bindings-bench-more-rwgk
git cherry-pick 04a89337c1

danielfrg added the performance label Apr 3, 2026

danielfrg added 5 commits April 3, 2026 10:21

Add bench_ctx_device and fix JSON output

a843f75

Remove prefix so we can compare benchmarks

780b435

Add bench_event and bench_stream and compare script for a summary table

90b5e0b

Add bench_event and bench_stream and compare script for a summary table

8126ab7

Add Launch benchmarks

a3f0678

danielfrg force-pushed the cuda-bindings-bench-more branch from 0cfea1d to a3f0678 Compare April 3, 2026 15:25

danielfrg requested review from mdboom and rwgk April 3, 2026 16:52

danielfrg self-assigned this Apr 3, 2026

Lint

e4762ed

Add motivation to readme

170578c

Add cuStreamSyncrhonize

c64dada

rparolin added this to the cuda.core v1.0.0 milestone Apr 7, 2026

mdboom approved these changes Apr 7, 2026

View reviewed changes

cuda_bindings/benchmarks/README.md Outdated Show resolved Hide resolved

Apply suggestion from @mdboom

6b821c3

Co-authored-by: Michael Droettboom <mdboom@gmail.com>

rparolin reviewed Apr 7, 2026

View reviewed changes

cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp Outdated Show resolved Hide resolved

danielfrg added the cuda.bindings Everything related to the cuda.bindings module label Apr 10, 2026

Simplify kernel params

5364daa

Linting

49c2651

danielfrg enabled auto-merge (squash) April 10, 2026 19:35

Merge branch 'main' into cuda-bindings-bench-more

5727e60

Conversation

danielfrg commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

danielfrg commented Apr 3, 2026

Uh oh!

rwgk commented Apr 6, 2026

Motivation

Uh oh!

rwgk commented Apr 6, 2026

Findings

What Looks Reasonably Matched

Bottom Line

Note

Uh oh!

danielfrg commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielfrg commented Apr 7, 2026

Uh oh!

mdboom commented Apr 7, 2026

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mdboom commented Apr 7, 2026

Uh oh!

Uh oh!

danielfrg commented Apr 10, 2026

Uh oh!

danielfrg commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented Apr 10, 2026

Uh oh!

rwgk commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielfrg commented Apr 3, 2026 •

edited

Loading

danielfrg commented Apr 7, 2026 •

edited

Loading