|
| 1 | +--- |
| 2 | +title: ztensor |
| 3 | +weight: 1 |
| 4 | +bookToc: true |
| 5 | +--- |
| 6 | + |
| 7 | +# ztensor |
| 8 | + |
| 9 | +GPU-accelerated tensor, compute engine, and computation graph library for Go. |
| 10 | + |
| 11 | +```bash |
| 12 | +go get github.com/zerfoo/ztensor |
| 13 | +``` |
| 14 | + |
| 15 | +## Overview |
| 16 | + |
| 17 | +ztensor is the foundational tensor and compute library in the Zerfoo ecosystem. It provides multi-type tensor storage, a unified compute engine interface across CPU and GPU backends, a computation graph compiler with operator fusion, and GPU memory management -- all without CGo. |
| 18 | + |
| 19 | +If you are building an ML inference engine, need GPU compute from Go, or want a typed tensor library, ztensor is the package to import. |
| 20 | + |
| 21 | +## When to Use ztensor Directly |
| 22 | + |
| 23 | +| Use case | Import | |
| 24 | +|----------|--------| |
| 25 | +| Tensor math, GPU compute, custom ML operators | `github.com/zerfoo/ztensor` directly | |
| 26 | +| Transformer inference, model serving, training | `github.com/zerfoo/zerfoo` (imports ztensor internally) | |
| 27 | + |
| 28 | +Import ztensor directly when you need tensor operations or GPU compute without the full inference/serving stack. If you are running transformer models, use zerfoo -- it builds on ztensor for you. |
| 29 | + |
| 30 | +## Tensor Creation |
| 31 | + |
| 32 | +Tensors are generic over all numeric types via the `tensor.Numeric` constraint: |
| 33 | + |
| 34 | +```go |
| 35 | +import "github.com/zerfoo/ztensor/tensor" |
| 36 | + |
| 37 | +// Create a 2x3 float32 tensor |
| 38 | +a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6}) |
| 39 | + |
| 40 | +fmt.Println(a.Shape()) // [2, 3] |
| 41 | +fmt.Println(a.Data()) // [1 2 3 4 5 6] |
| 42 | +``` |
| 43 | + |
| 44 | +Supported element types include `float32`, `float64`, `float16.Float16`, `float16.BFloat16`, `float8.Float8`, and all Go integer types. |
| 45 | + |
| 46 | +## Compute Engine |
| 47 | + |
| 48 | +All arithmetic flows through the `compute.Engine[T]` interface. This enables transparent CPU/GPU switching and CUDA graph capture. |
| 49 | + |
| 50 | +### CPU Engine |
| 51 | + |
| 52 | +```go |
| 53 | +import ( |
| 54 | + "context" |
| 55 | + |
| 56 | + "github.com/zerfoo/ztensor/compute" |
| 57 | + "github.com/zerfoo/ztensor/numeric" |
| 58 | + "github.com/zerfoo/ztensor/tensor" |
| 59 | +) |
| 60 | + |
| 61 | +ctx := context.Background() |
| 62 | +eng := compute.NewCPUEngine[float32](numeric.Float32Ops{}) |
| 63 | + |
| 64 | +a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6}) |
| 65 | +b, _ := tensor.New[float32]([]int{3, 2}, []float32{1, 2, 3, 4, 5, 6}) |
| 66 | + |
| 67 | +c, _ := eng.MatMul(ctx, a, b) |
| 68 | +fmt.Println(c.Shape()) // [2, 2] |
| 69 | +fmt.Println(c.Data()) // [22 28 49 64] |
| 70 | +``` |
| 71 | + |
| 72 | +### GPU Engine |
| 73 | + |
| 74 | +GPU libraries are loaded at runtime via purego -- no CGo, no build tags, no linking. If the GPU runtime is not available, the constructor returns an error and you fall back to CPU. |
| 75 | + |
| 76 | +```go |
| 77 | +// CUDA (NVIDIA GPUs) |
| 78 | +eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{}) |
| 79 | + |
| 80 | +// ROCm (AMD GPUs) |
| 81 | +eng, err := compute.NewROCmEngine[float32](numeric.Float32Ops{}) |
| 82 | + |
| 83 | +// OpenCL (cross-vendor) |
| 84 | +eng, err := compute.NewOpenCLEngine[float32](numeric.Float32Ops{}) |
| 85 | +``` |
| 86 | + |
| 87 | +A common pattern is to try GPU first with a CPU fallback: |
| 88 | + |
| 89 | +```go |
| 90 | +eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{}) |
| 91 | +if err != nil { |
| 92 | + eng = compute.NewCPUEngine[float32](numeric.Float32Ops{}) |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +## Type-Safe Generics |
| 97 | + |
| 98 | +Write functions that work across any numeric type: |
| 99 | + |
| 100 | +```go |
| 101 | +func dotProduct[T tensor.Numeric]( |
| 102 | + eng compute.Engine[T], |
| 103 | + a, b *tensor.TensorNumeric[T], |
| 104 | +) (*tensor.TensorNumeric[T], error) { |
| 105 | + return eng.MatMul(context.Background(), a, b) |
| 106 | +} |
| 107 | +``` |
| 108 | + |
| 109 | +## Computation Graph |
| 110 | + |
| 111 | +The `graph` package provides a computation graph compiler with operator fusion passes and CUDA graph capture for optimized inference: |
| 112 | + |
| 113 | +| Feature | Description | |
| 114 | +|---------|-------------| |
| 115 | +| Operator fusion | Combines adjacent operations to reduce kernel launches | |
| 116 | +| CUDA graph capture | Records and replays GPU execution for minimal launch overhead | |
| 117 | +| Megakernel codegen | Generates fused GPU kernels at compile time | |
| 118 | + |
| 119 | +## Package Reference |
| 120 | + |
| 121 | +| Package | Description | |
| 122 | +|---------|-------------| |
| 123 | +| `tensor/` | Multi-type tensor storage (CPU, GPU, quantized) | |
| 124 | +| `compute/` | Engine interface with CPU, CUDA, ROCm, and OpenCL implementations | |
| 125 | +| `graph/` | Computation graph compiler with fusion and CUDA graph capture | |
| 126 | +| `numeric/` | Type-safe `Arithmetic[T]` interface for all numeric types | |
| 127 | +| `device/` | Device abstraction and memory allocators | |
| 128 | +| `internal/cuda/` | Zero-CGo CUDA runtime bindings via purego, 25+ custom kernels | |
| 129 | +| `internal/xblas/` | ARM NEON and x86 AVX2 SIMD assembly | |
| 130 | +| `internal/gpuapi/` | GPU Runtime Abstraction Layer (CUDA/ROCm/OpenCL) | |
| 131 | +| `internal/codegen/` | Megakernel code generator | |
| 132 | + |
| 133 | +## Dependencies |
| 134 | + |
| 135 | +ztensor depends on [float16]({{< relref "numeric-types" >}}) and [float8]({{< relref "numeric-types" >}}) for half-precision and FP8 arithmetic. |
0 commit comments