Skip to content

CUDA execution for external patches#7376

Open
a10y wants to merge 12 commits intodevelopfrom
aduffy/patched-fuse
Open

CUDA execution for external patches#7376
a10y wants to merge 12 commits intodevelopfrom
aduffy/patched-fuse

Conversation

@a10y
Copy link
Copy Markdown
Contributor

@a10y a10y commented Apr 9, 2026

Summary

This fixes up CUDA execution to adaptively support interior or exterior patches for bit unpacking.

This does not implement Dynamic Dispatch

@a10y a10y force-pushed the aduffy/patched-fuse branch from e16d1de to 721fa3d Compare April 9, 2026 19:22
@a10y a10y marked this pull request as ready for review April 10, 2026 16:52
@a10y a10y added the feature A feature request label Apr 10, 2026
@a10y a10y requested a review from 0ax1 April 10, 2026 16:52
@a10y a10y added changelog/feature A new feature and removed feature A feature request labels Apr 10, 2026
@a10y a10y changed the title Aduffy/patched fuse CUDA execution for external patches Apr 10, 2026
a10y added 5 commits April 13, 2026 11:43
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/patched-fuse branch from bc9c13e to e13c033 Compare April 13, 2026 15:43
a10y added 6 commits April 13, 2026 14:42
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/patched-fuse branch from 3acf500 to 88567cf Compare April 13, 2026 20:58
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Copy link
Copy Markdown
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Added a couple more thoughts.

/// This kernel uses a thread-per-lane model where each thread is assigned to
/// one (chunk, lane) slot and applies all patches in that slot.
template <typename ValueT>
__device__ void patched(ValueT *const output,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Ideally we could move to a world where we drop the other patches.cu eventually and only operate on transposed patches on the GPU, maybe.

match_each_integer_ptype!(bitpacked.ptype(bitpacked.dtype()), |P| {
return decode_bitpacked::<P>(
bitpacked.into_owned(),
P::default(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also handle the fused case here right: FoR + BP?

}

// Execute the components
let lane_offsets = array
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go in a separate fn:

            let lane_offsets = array
                .lane_offsets()
                .clone()
                .execute_cuda(ctx)
                .await?
                .into_primitive()
                .into_data_parts()
                .buffer;

            let patch_indices = array
                .patch_indices()
                .clone()
                .execute_cuda(ctx)
                .await?
                .into_primitive()
                .into_data_parts()
                .buffer;

            let patch_values = array
                .patch_values()
                .clone()
                .execute_cuda(ctx)
                .await?
                .into_primitive()
                .into_data_parts()
                .buffer;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants