Add torch.Tensor fast path for StridedMemoryView via AOTI tensor bridge by leofang · Pull Request #1894 · NVIDIA/cuda-python

leofang · 2026-04-10T19:41:44Z

Summary

Add a fast path for constructing StridedMemoryView from torch.Tensor objects using PyTorch's AOT Inductor (AOTI) stable C ABI, bypassing the DLPack/CAI protocol overhead.

How it works

When a torch.Tensor is passed to any from_* classmethod (from_dlpack, from_cuda_array_interface, from_array_interface, or from_any_interface), the tensor metadata (data pointer, shape, strides, dtype, device) is read directly from the underlying C struct via AOTI function pointers, instead of going through the Python-level __dlpack__() or __cuda_array_interface__ protocols.

The key technique (pyobj_to_aten_handle) extracts the AtenTensorHandle by offsetting past PyObject_HEAD in the THPVariable struct — pure C pointer arithmetic with zero Python API calls. The AOTI functions (aoti_torch_get_data_ptr, aoti_torch_get_sizes, etc.) then read tensor metadata through PyTorch's stable C ABI.

PyTorch is NOT a build-time or runtime dependency. The AOTI symbols are resolved lazily at runtime from torch._C (loaded with RTLD_GLOBAL) only when the user actually passes a torch.Tensor. The _tensor_bridge module is never imported at cuda.core load time.

Performance

Benchmarked with %timeit (Python 3.12, PyTorch 2.11, NVIDIA RTX 6000 Ada):

                          v0.7.0 (DLPack)   tensor-bridge (AOTI)   speedup
GPU tensor stream_ptr=-1:    4.41 us             565 ns              ~8x
GPU tensor stream_ptr=0:    13.20 us             624 ns             ~21x
CPU tensor stream_ptr=-1:    5.08 us             552 ns              ~9x

At the C level (no Python overhead), AOTI extracts all 7 metadata fields in ~14 ns — ~4x faster than the DLPack C exchange API (~60 ns) for the same metadata.

Stream ordering

AOTI fast path: When stream_ptr != -1, establishes stream ordering between PyTorch's current CUDA stream (queried via aoti_torch_get_current_cuda_stream) and the consumer stream, using the same event-based pattern as the existing CAI path.
CAI fallback safety: PyTorch's __cuda_array_interface__ reports version 2 with no stream field, making the standard CAI sync path a no-op. We detect torch tensors in the CAI path and apply AOTI-based stream ordering to fix this safety gap.

Version compatibility

Minimum: PyTorch 2.3 (when the AOTI functions we use were introduced)
Maximum: PyTorch 2.11 (latest tested; the THPVariable struct layout and AtenTensorHandle == at::Tensor* identity are undocumented internals)
Unknown versions gracefully fall back to DLPack/CAI paths

Files changed

cuda/core/_tensor_bridge.pyx (new): AOTI tensor bridge — pyobj_to_aten_handle, view_as_torch_tensor, sync_torch_stream, dtype/itemsize mapping
cuda/core/_include/aoti_shim.h (new): Vendored subset of PyTorch's AOTI stable C ABI declarations
cuda/core/_memoryview.pyx: Torch detection (_is_torch_tensor with type cache + version bounds), lazy bridge loading, fast path in all from_* classmethods, CAI stream safety fix, lazy dtype resolution
tests/test_utils.py: 12 new test cases (dtypes, shapes, slicing, devices, decorator)
docs/source/release/1.0.0-notes.rst: Release notes entry

Closes #749

Co-Authored-By: Emilio Castillo ecastillo@nvidia.com

Test plan

All existing tests pass (117 passed, 38 skipped)
12 new torch tensor bridge tests pass (CPU + GPU)
Stream ordering verified for both AOTI and CAI paths
Graceful fallback for torch < 2.3 (returns False, uses DLPack)
CI validation

Provide a fast path for constructing a StridedMemoryView from a torch.Tensor by reading tensor metadata directly through PyTorch's AOT Inductor (AOTI) stable C ABI, avoiding DLPack/CAI protocol overhead (~10 ns per tensor via pointer arithmetic). Key design: - Vendored AOTI shim header (aoti_shim.h) with extern "C" wrapping - _tensor_bridge.pyx loaded lazily (only when a torch.Tensor is first passed) to avoid undefined AOTI symbols at import time - RTLD_GLOBAL bootstrap via sys.modules["torch._C"] before loading _tensor_bridge.so - torch detection via type(obj).__module__.startswith("torch") - PyTorch is NOT a build-time or run-time dependency of cuda.core Closes NVIDIA#749 Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pty .pxd - Remove unused aoti_torch_get_numel and aoti_torch_get_storage_offset declarations from aoti_shim.h and _tensor_bridge.pyx - Fix license headers on new files to 2026 (not 2024-2026) - Delete empty _tensor_bridge.pxd (nothing cimports from it) - Defer numpy dtype resolution for torch tensors: store raw AOTI dtype code in metadata, compute itemsize from a cheap lookup table, and only resolve the full numpy dtype on first .dtype access via get_dtype() Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of short-circuiting in __init__ and from_any_interface, add the AOTI fast path check to from_dlpack, from_cuda_array_interface, and from_array_interface. This ensures torch tensors always take the fast path regardless of which constructor the user calls. Simplify from_any_interface and _StridedMemoryViewProxy to just delegate to the from_* methods (which now handle torch internally). Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When stream_ptr is not -1, establish stream ordering between PyTorch's current CUDA stream (the producer) and the consumer stream, using the same event record + stream wait pattern as the CAI path. Uses aoti_torch_get_current_cuda_stream to get the producer stream, matching what PyTorch's own __dlpack__ does internally. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Factor out stream ordering into a cpdef sync_torch_stream() helper in _tensor_bridge.pyx, callable from both C (view_as_torch_tensor) and Python (_memoryview.pyx). Apply the same stream ordering in view_as_cai for torch tensors: PyTorch's __cuda_array_interface__ reports version 2 and omits the "stream" field, so the standard CAI sync path is a no-op — leaving the consumer with no guarantee that the producer's work is visible. We now detect torch tensors in the CAI path and query PyTorch's current CUDA stream via AOTI to establish proper ordering. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add check_aoti() inline helper to replace repetitive err/raise patterns for AOTI calls (one-liner per call) - Change itemsize type from int to size_t - Add test_torch_tensor_bridge_sliced_2d test case Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Revert itemsize back to int (size_t was unnecessary for small values) - Memoize int(stream_ptr) to avoid redundant Python operator conversion Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Better Cython 3 performance: except?-1 avoids the overhead of except* which always checks for exceptions. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The AOTI stable C ABI functions we use (get_dim, get_dtype, get_device_type, get_device_index, get_current_cuda_stream, complex dtype constants) were all introduced in PyTorch 2.3.0. Earlier versions are missing some or all of them. _is_torch_tensor now returns False when torch < 2.3, causing a graceful fallback to the standard DLPack/CAI paths. The version check result is memoized in a module-level variable. Also move `import ctypes, sys` from _get_tensor_bridge to module level. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document the AOTI-based fast path for torch.Tensor in StridedMemoryView with ~10-20x speedup and stream ordering support. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The cdata field changed from MaybeOwned<at::Tensor> (2.3-2.9) to at::Tensor (2.10+). Both layouts are compatible with our offset trick. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cache the result of the torch tensor type check (module + hasattr + version) keyed by type(obj). Subsequent calls for the same type are a single dict lookup (~76 ns) instead of the full check (~186 ns). Non-torch objects also benefit as the cache returns False immediately after the first miss. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The pyobj_to_aten_handle trick and AtenTensorHandle == at::Tensor* identity are undocumented internals that could change. Cap at the latest tested version so unknown future versions fall back to the standard DLPack/CAI paths. Bump after verifying each new release. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-04-10T19:41:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leofang · 2026-04-11T04:30:38Z

cuda_core/cuda/core/_tensor_bridge.pyx

+cdef inline AtenTensorHandle pyobj_to_aten_handle(object obj):
+    """Extract AtenTensorHandle by offsetting past PyObject_HEAD.
+
+    In PyTorch 2.3–2.9 the first field after PyObject_HEAD is
+    ``c10::MaybeOwned<at::Tensor> cdata``; from 2.10 onward it is
+    ``at::Tensor cdata``.  In both cases the address of ``cdata``
+    is usable as an ``AtenTensorHandle`` (``at::Tensor*``) for the
+    AOTI stable C ABI functions.
+    """
+    return <AtenTensorHandle>(<char*><PyObject*>obj + sizeof(PyObject))


Note: I have filed a feature request to discuss if this API can be formalized in AOTI directly, so that we can relax upper bound safely and be forward compatible: pytorch/pytorch#180107.

leofang and others added 16 commits April 9, 2026 07:15

Use except?-1 instead of except* for check_aoti

cc4558a

Better Cython 3 performance: except?-1 avoids the overhead of except* which always checks for exceptions. Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update speedup range in release notes to match benchmarks

30ba7d5

Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update module docstring to document both THPVariable layouts

0c31df1

Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use except?-1 for sync_torch_stream instead of except*

8c20237

Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leofang added P0 High priority - Must do! cuda.core Everything related to the cuda.core module performance labels Apr 10, 2026

leofang added this to the cuda.core v1.0.0 milestone Apr 10, 2026

leofang self-assigned this Apr 10, 2026

leofang requested review from NaderAlAwar and oleksandr-pavlyk April 10, 2026 19:46

Fix linter errors

8c019b9

Co-Authored-By: Emilio Castillo <ecastillo@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leofang force-pushed the tensor-bridge-749 branch from 24aeb0f to 8c019b9 Compare April 10, 2026 20:10

leofang assigned emcastillo Apr 10, 2026

leofang mentioned this pull request Apr 11, 2026

Allow getting AtenTensorHandle from the tensor object via AOTI stable C API pytorch/pytorch#180107

Open

leofang commented Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torch.Tensor fast path for StridedMemoryView via AOTI tensor bridge#1894

Add torch.Tensor fast path for StridedMemoryView via AOTI tensor bridge#1894
leofang wants to merge 17 commits intoNVIDIA:mainfrom
leofang:tensor-bridge-749

leofang commented Apr 10, 2026

Uh oh!

copy-pr-bot bot commented Apr 10, 2026

Uh oh!

leofang Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leofang commented Apr 10, 2026

Summary

How it works

Performance

Stream ordering

Version compatibility

Files changed

Test plan

Uh oh!

copy-pr-bot bot commented Apr 10, 2026

Uh oh!

leofang Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants