You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three independent optimizations, all in indexing.py.
Stage 1: replace per-run np.lexsort with the native Cython
intra_chunk_sort_run kernel in _build_full_descriptor_ooc().
Eliminates the np.arange allocation, the O(N log N) lexsort, and
two gather passes in favour of the GIL-free stable mergesort.
Stage 2: replace _merge_sorted_slices (concat + lexsort, O(N log N)
linear merge, GIL-free) inside _merge_run_pair().
Stage 3: rewrite _build_levels_descriptor_ooc() with a single-pass
strategy. Previously the function issued one array[start:stop]
decompression call per segment across all levels (e.g. 9008 calls for
10M rows on a Mac M4: 8 chunk + 1000 block + 8000 subblock). The new
code decompresses each array chunk exactly once and computes all level
summaries in a single vectorized pass via _fill_summaries_from_2d(),
falling back to the original per-segment path only when segment sizes
do not divide the chunk length evenly.
Measured on 10M random float64:
Mac mini M4 (chunks=1.25M, blocks=10K):
light: 940 ms -> 431 ms (2.2x)
medium: 4696 ms -> 394 ms (11.9x)
full: 9650 ms -> 1584 ms (6.1x)
AMD 7800X3D Linux (chunks=2M, blocks=20K):
light: 603 ms -> 430 ms (1.4x)
medium: 2050 ms -> 395 ms (5.2x)
full: 6856 ms -> 1582 ms (4.3x)
Query latency and index sizes are unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0 commit comments