Commit bda408a
[UR][CUDA] Add opportunistic queue serialize prop, impl for cuda (#18443)
Makes short kernels that don't need to see the same global memory (or
user guarantees global memory writes are complete) launch faster. See
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization
Makes lots of short kernels in cutlass great again. cc @FMarno who
identified this performance gap.
---------
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Jakub Chlanda <j.chlanda@gmail.com>1 parent fd866a3 commit bda408a
5 files changed
Lines changed: 38 additions & 0 deletions
File tree
- unified-runtime
- include
- scripts/core
- source/adapters/cuda
- test/conformance/exp_launch_properties
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
| 40 | + | |
39 | 41 | | |
40 | 42 | | |
41 | 43 | | |
| |||
56 | 58 | | |
57 | 59 | | |
58 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
59 | 65 | | |
60 | 66 | | |
61 | 67 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
560 | 560 | | |
561 | 561 | | |
562 | 562 | | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
563 | 570 | | |
564 | 571 | | |
565 | 572 | | |
| |||
Lines changed: 9 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
69 | 78 | | |
70 | 79 | | |
71 | 80 | | |
| |||
0 commit comments