Conversation
0cfea1d to
a3f0678
Compare
|
There are the results for a run on my dev machine (4090) |
|
It's the first time that I'm looking at this code. My second question was: What's the purpose of the benchmarks. Cursor (GPT-5.4 Extra High Fast) offered some answers. I asked it to generate a "Motivation" section based on what it found, see below. I think it'd be a great addition to MotivationThese benchmarks are intended to measure the latency overhead of calling CUDA Driver APIs through The main goal is to help answer questions such as:
The paired C++ benchmarks are included to provide a lower-level reference point for the same operation. Comparing Python and C++ results helps estimate the additional cost introduced by the Python-to-C boundary and by binding-specific marshalling work. These benchmarks are not intended to measure overall GPU performance, kernel throughput, or end-to-end application speed. Most of the benchmarked operations are deliberately tiny, so the reported numbers are best interpreted as binding/API-call latency measurements and regression signals, rather than as predictions of full application performance. Because the benchmarked operations are so small, methodology matters a lot. The most useful comparisons are between Python and C++ benchmarks that perform as nearly the same work as possible and are run under similar conditions. |
|
My first question (to Cursor) when reviewing this PR was:
After it gave me the response below I started thinking about the motivation, with the result in the previous comment. In light of that, the findings below still seem relevant, but I'd need to look closer to be more certain which of the "not clean apples-to-apples" aspects it found are actually meaningful. I hope they are at least a good starting point for figuring it out together, so I'm copy-pasting them below. Findings
What Looks Reasonably Matched
Bottom Line
Note
|
|
yeah the motivation is correct, its just latency/overhead of the python layer, not throughput. I'll add it to the readme. And yeah the review i think i would agree with most of it on the ones marked as high (and I will try to match them closer) but for the other ones i think its almost impossible to do a full apples-to-apples so for those i am not sure i would change much but i'll leave it up to you all to make that call :D |
|
Ok, i added a couple of About the second comment, i think its "ok". the C++ one doesn't match pyperf fully when it does a bit more fancy stats for the warm up and number of measurements and the C++ is a fixed count but i don't think it should affect much specially for measuring host latency? |
mdboom
left a comment
There was a problem hiding this comment.
I don't see anything here to block on, except maybe the stddev of two of the C++ benchmarks is too high. See my lengthier comment inline.
|
Another puzzling result -- on my machine (with a 5070, fwiw), there are some benchmarks where Python measures as slightly faster than C++ which seems unlikely: |
Co-authored-by: Michael Droettboom <mdboom@gmail.com>
|
Thanks for the reviews! Yes, I also sometimes see some Python stuff to be "faster" and it happened in some of the cuda.compute benchmarks. I believe its just because of the apples-to-oranges and the aggregations, i am pretty sure if we run them a lot more that wont be the case. Since we are ok with this for now I will merge once linting passes and will continue to iterate. |
|
/ok to test 49c2651 |
|
|
This is odd: I'll take a look. |
Initially I suspected there could be a problem in pathfinder, or maybe a new CTK release, but it turns out it's almost certainly a problem with this PR. While I had the full context, I asked Cursor to try a fix, thinking it might be quick. It came up with something a lot more involved than expected. I put up the change here, for you to cherry-pick if you find it useful: Cursor also generated this explanation: I think The failure pattern pointed to benchmark discovery touching NVRTC too early. Before this change, This cherry-pick fixes that by making benchmark discovery side-effect free and by narrowing when NVRTC is needed:
I like this fix because it directly targets the Cherry-pick command: git fetch https://github.com/rwgk/cuda-python.git cuda-bindings-bench-more-rwgk
git cherry-pick 04a89337c1 |

Description
closes #1580
Follow up #1580
Adding a couple of more benchmarks here and fixing a couple of issue with the pyperf json handling.
Checklist