Skip to content

Prefault OpenCL readback destinations#21069

Open
dllu wants to merge 1 commit into
darktable-org:masterfrom
dllu:fix/opencl-prefault-readback
Open

Prefault OpenCL readback destinations#21069
dllu wants to merge 1 commit into
darktable-org:masterfrom
dllu:fix/opencl-prefault-readback

Conversation

@dllu
Copy link
Copy Markdown

@dllu dllu commented May 18, 2026

I found that darktable was very slow on my NVIDIA DGX Spark with the GB10 super chip.

darktable's OpenCL path was working correctly, but export performance could be
much slower than CPU for mixed CPU/GPU pipelines. Profiling showed most of the
loss in [Read Image (from device to host)].

The underlying issue was not raw host/GPU bandwidth. A standalone OpenCL test
showed clEnqueueReadImage is fast when the destination host memory is already
committed, but extremely slow when the destination is a large cold malloc buffer.
Pre-faulting destination pages before the blocking read avoids that NVIDIA
OpenCL slow path.

Environment

nvidia-smi
clinfo
/opt/darktable/bin/darktable-cltest

Observed locally:

GPU: NVIDIA GB10
Driver: 580.142
CUDA/OpenCL platform: NVIDIA CUDA, OpenCL 3.0 CUDA 13.0.97
darktable OpenCL: enabled
darktable OpenCL fast mode: enabled
darktable scheduling profile: very fast GPU

darktable repro

The integration test image is small enough to keep repro time short, but the
almost-all history stack is useful because it forces several CPU/GPU
transitions.

tmp=$(mktemp -d /tmp/dt-opencl-repro.XXXXXX)

/usr/bin/time -f "REAL %e" \
  /opt/darktable/bin/darktable-cli \
  src/tests/integration/images/mire1.cr2 \
  src/tests/integration/0103-almost-all/almost-all.xmp \
  "$tmp/out" \
  --width 0 --height 0 --hq true \
  --apply-custom-presets false \
  --out-ext jpg \
  --core \
  -d opencl -d perf \
  >"$tmp/gpu.log" 2>&1

rg 'Read Image|totally in command queue|pixel pipeline processing took|exported' "$tmp/gpu.log"

Before the patch, representative results were:

real time: ~16.8s
pixel pipeline: ~15.3s
[Read Image (from device to host)]: ~8.0s
OpenCL command queue total: ~10.7s

After the patch:

real time: ~9.7-9.9s
pixel pipeline: ~8.2-8.5s
[Read Image (from device to host)]: ~0.032s
OpenCL command queue total: ~2.7-2.8s

CPU-only comparison after the patch:

/usr/bin/time -f "REAL %e" \
  /opt/darktable/bin/darktable-cli \
  src/tests/integration/images/mire1.cr2 \
  src/tests/integration/0103-almost-all/almost-all.xmp \
  "$tmp/cpu-out" \
  --width 0 --height 0 --hq true \
  --apply-custom-presets false \
  --out-ext jpg \
  --core --disable-opencl \
  -d perf

Representative CPU-only result:

real time: ~9.8-10.2s
pixel pipeline: ~8.7-9.1s

Standalone OpenCL repro

This isolates the driver behavior from darktable. A readback into warm host
memory is fast. A readback into a cold host allocation is extremely slow.
Touching the destination pages first fixes it.

#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

static double now_sec(void)
{
  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
  return ts.tv_sec + ts.tv_nsec * 1e-9;
}

static void check(cl_int err, const char *what)
{
  if(err != CL_SUCCESS)
  {
    fprintf(stderr, "%s failed: %d\n", what, err);
    exit(1);
  }
}

int main(void)
{
  const size_t w = 3888, h = 2592;
  const size_t bytes = w * h * 4 * sizeof(float);
  cl_int err;
  cl_platform_id platform;
  cl_device_id device;
  check(clGetPlatformIDs(1, &platform, NULL), "clGetPlatformIDs");
  check(clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL), "clGetDeviceIDs");

  cl_context ctx = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
  check(err, "clCreateContext");
  cl_command_queue q = clCreateCommandQueueWithProperties(ctx, device, 0, &err);
  check(err, "clCreateCommandQueueWithProperties");

  cl_image_format fmt = { CL_RGBA, CL_FLOAT };
  cl_image_desc desc;
  memset(&desc, 0, sizeof(desc));
  desc.image_type = CL_MEM_OBJECT_IMAGE2D;
  desc.image_width = w;
  desc.image_height = h;
  cl_mem img = clCreateImage(ctx, CL_MEM_READ_WRITE, &fmt, &desc, NULL, &err);
  check(err, "clCreateImage");

  void *host = aligned_alloc(64, bytes);
  void *coldhost = aligned_alloc(64, bytes);
  void *pretouched = aligned_alloc(64, bytes);
  if(!host || !coldhost || !pretouched) return 2;
  memset(host, 0, bytes);

  const size_t origin[3] = { 0, 0, 0 };
  const size_t region[3] = { w, h, 1 };
  const size_t rowpitch = w * 4 * sizeof(float);
  check(clEnqueueWriteImage(q, img, CL_TRUE, origin, region, rowpitch, 0, host, 0, NULL, NULL),
        "warm write image");

  double warm_total = 0.0;
  for(int i = 0; i < 8; i++)
  {
    double t0 = now_sec();
    check(clEnqueueReadImage(q, img, CL_TRUE, origin, region, rowpitch, 0, host, 0, NULL, NULL),
          "warm read image");
    warm_total += now_sec() - t0;
  }

  double t0 = now_sec();
  check(clEnqueueReadImage(q, img, CL_TRUE, origin, region, rowpitch, 0, coldhost, 0, NULL, NULL),
        "cold read image");
  double cold = now_sec() - t0;

  t0 = now_sec();
  memset(pretouched, 0, bytes);
  double touch = now_sec() - t0;
  t0 = now_sec();
  check(clEnqueueReadImage(q, img, CL_TRUE, origin, region, rowpitch, 0, pretouched, 0, NULL, NULL),
        "pretouched read image");
  double pretouched_read = now_sec() - t0;

  printf("size %.1f MiB\n", bytes / 1048576.0);
  printf("warm read avg %.6f sec\n", warm_total / 8.0);
  printf("cold read %.6f sec\n", cold);
  printf("pretouch %.6f sec, read %.6f sec, total %.6f sec\n",
         touch, pretouched_read, touch + pretouched_read);

  clReleaseMemObject(img);
  clReleaseCommandQueue(q);
  clReleaseContext(ctx);
  free(host);
  free(coldhost);
  free(pretouched);
  return 0;
}

Compile and run:

cc -O2 /tmp/ocl_read_image_bench.c -lOpenCL -o /tmp/ocl_read_image_bench
/tmp/ocl_read_image_bench

Representative result:

size 153.8 MiB
warm read avg 0.004426 sec
cold read 2.317996 sec
pretouch 0.053719 sec, read 0.003950 sec, total 0.057669 sec

Code change

The patch pre-faults destination pages in dt_opencl_copy_device_to_host()
before calling the existing blocking image readback path.

This is intentionally done in the common full-image copy helper because that is
the path used when the pixelpipe must move an OpenCL image back to CPU memory
before a CPU-only module or cache use.

Regression considerations

Likely downside:

  • Every full-image device-to-host copy now performs one host write per page
    before the OpenCL read.
  • On platforms where the driver already handles cold pageable memory well, this
    adds a small CPU-side page-touch cost.
  • For buffers that are already resident, this is redundant work.

Why the risk is limited:

  • The touched memory is the destination of a blocking device-to-host read and is
    expected to be writable.
  • The read overwrites the destination immediately afterward on success.
  • The overhead is one byte per page, not a full memset of the image.
  • For cold buffers on this NVIDIA unified-memory system, the added pre-touch is
    orders of magnitude cheaper than letting the OpenCL driver fault the pages.
  • CPU-only processing is unaffected.

Potential follow-up refinements:

  • Apply the pre-fault only above a size threshold.
  • Make it conditional on NVIDIA/OpenCL/unified-memory devices if other vendors
    show measurable regressions.
  • Consider pre-faulting cache allocations at allocation time instead of at
    OpenCL readback time.

@da-phil
Copy link
Copy Markdown
Contributor

da-phil commented May 18, 2026

Nice idea, this is actually what people do on hard real-time systems to ensure memory pages of all libraries are already being faulted into the local process virtual memory address space and then mlockall() is called to ensure that all pages remain mapped, until unlocked or the process returned.
The risk currently is that the darktable process constantly consumes more virtual memory and that the memory is paged out of process virtual memory address space sporadically by the kernel, if there is some demand, then we fall back to the current processing speed. So there might even be a case for using mlockall() if the constant virtual memory footprint is not too big, especially for less powerful computers with a small amount of RAM.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

A very nice one!

Potential follow-up refinements:

* Apply the pre-fault only above a size threshold.

I would do this right now for sizes below size_t step

* Make it conditional on NVIDIA/OpenCL/unified-memory devices if other vendors
  show measurable regressions.

I think we should enable the prefaulting for unified mem devices only for now. I wouldn't expect any regressions here whatever the vendor is but could refine that later

* Consider pre-faulting cache allocations at allocation time instead of at
  OpenCL readback time.

For me currently a no.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

It seems we have to restrict that to non-windows systems (or find equivalent)

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

And possibly enable this for OpenCL fast mode only.

This would allow users/us to check for performance regressions and backreport.

@da-phil
Copy link
Copy Markdown
Contributor

da-phil commented May 19, 2026

And possibly enable this for OpenCL fast mode only.

Why not generally, if there is no other immediate drawbacks?

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

Why not generally, if there is no other immediate drawbacks?

Very easy :-)

  1. We (and all users) may easily switch between on/off at runtime and can provide meaningful log
  2. No other preference switch (or per/device option)

We have had so many issues related to OpenCL over the years so if there are any problems or regressions, we can easily ask for proper reports and can decide how to handle this.

@dllu dllu force-pushed the fix/opencl-prefault-readback branch from 4298def to 0dfe493 Compare May 20, 2026 00:23
@dllu
Copy link
Copy Markdown
Author

dllu commented May 20, 2026

Pushed a new commit with the changes you recommended:

  • Pre-faulting is now disabled on Windows.
  • It only runs for unified-memory OpenCL devices.
  • It only runs when OpenCL fast mode is enabled.
  • It skips readbacks smaller than one host page.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

ok.
TIA

  1. there is a pending Various maintenance for tiling and opencl #21075 that might require rebasing of this
  2. Are you aware of https://darktable.info/performance/benchmarks-beispiele/benchmark/ ? For sure this is not a real "benchmark" but from what i checked last time it's tested modules/blending could be strongly affected by your code. Might be worth to test, i will do that also on my unified-mem strix halo system (rusticl and rocm) after the above pr got merged. If everything is perfect and without regressions ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants