Prefault OpenCL readback destinations#21069
Conversation
|
Nice idea, this is actually what people do on hard real-time systems to ensure memory pages of all libraries are already being faulted into the local process virtual memory address space and then |
|
A very nice one!
I would do this right now for sizes below
I think we should enable the prefaulting for unified mem devices only for now. I wouldn't expect any regressions here whatever the vendor is but could refine that later
For me currently a no. |
|
It seems we have to restrict that to non-windows systems (or find equivalent) |
|
And possibly enable this for OpenCL fast mode only. This would allow users/us to check for performance regressions and backreport. |
Why not generally, if there is no other immediate drawbacks? |
Very easy :-)
We have had so many issues related to OpenCL over the years so if there are any problems or regressions, we can easily ask for proper reports and can decide how to handle this. |
4298def to
0dfe493
Compare
|
Pushed a new commit with the changes you recommended:
|
|
ok.
|
I found that darktable was very slow on my NVIDIA DGX Spark with the GB10 super chip.
darktable's OpenCL path was working correctly, but export performance could be
much slower than CPU for mixed CPU/GPU pipelines. Profiling showed most of the
loss in
[Read Image (from device to host)].The underlying issue was not raw host/GPU bandwidth. A standalone OpenCL test
showed
clEnqueueReadImageis fast when the destination host memory is alreadycommitted, but extremely slow when the destination is a large cold malloc buffer.
Pre-faulting destination pages before the blocking read avoids that NVIDIA
OpenCL slow path.
Environment
Observed locally:
darktable repro
The integration test image is small enough to keep repro time short, but the
almost-allhistory stack is useful because it forces several CPU/GPUtransitions.
Before the patch, representative results were:
After the patch:
CPU-only comparison after the patch:
Representative CPU-only result:
Standalone OpenCL repro
This isolates the driver behavior from darktable. A readback into warm host
memory is fast. A readback into a cold host allocation is extremely slow.
Touching the destination pages first fixes it.
Compile and run:
Representative result:
Code change
The patch pre-faults destination pages in
dt_opencl_copy_device_to_host()before calling the existing blocking image readback path.
This is intentionally done in the common full-image copy helper because that is
the path used when the pixelpipe must move an OpenCL image back to CPU memory
before a CPU-only module or cache use.
Regression considerations
Likely downside:
before the OpenCL read.
adds a small CPU-side page-touch cost.
Why the risk is limited:
expected to be writable.
orders of magnitude cheaper than letting the OpenCL driver fault the pages.
Potential follow-up refinements:
show measurable regressions.
OpenCL readback time.