Skip to content

Commit cb4d0b0

Browse files
committed
docs(ecosystem): add ztensor, ztoken, and numeric types overviews
1 parent 6dff653 commit cb4d0b0

4 files changed

Lines changed: 445 additions & 0 deletions

File tree

content/docs/ecosystem/_index.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,54 @@
11
---
22
title: Ecosystem
33
weight: 8
4+
bookToc: true
45
bookCollapseSection: true
56
---
7+
8+
# Ecosystem
9+
10+
Zerfoo is a family of Go modules that together form a complete ML inference and training stack. Each module has its own `go.mod`, versioning, and release cycle.
11+
12+
## Dependency Graph
13+
14+
```
15+
float16 ──┐
16+
├──► ztensor ──► zerfoo
17+
float8 ──┘ ▲
18+
19+
ztoken ────────────────────────┘
20+
```
21+
22+
- **float16** and **float8** provide reduced-precision arithmetic
23+
- **ztensor** builds tensors, compute engines, and computation graphs on top of them
24+
- **zerfoo** combines ztensor and ztoken into a full inference, training, and serving framework
25+
- **ztoken** is independent (zero external deps) and plugs directly into zerfoo
26+
27+
## Which Module to Import
28+
29+
| You want to... | Import |
30+
|----------------|--------|
31+
| Run transformer inference or serve models | `github.com/zerfoo/zerfoo` |
32+
| Work with tensors, GPU compute, or computation graphs | `github.com/zerfoo/ztensor` |
33+
| Tokenize text (BPE, HuggingFace, GGUF) | `github.com/zerfoo/ztoken` |
34+
| Do Float16 or BFloat16 arithmetic | `github.com/zerfoo/float16` |
35+
| Do FP8 E4M3FN arithmetic | `github.com/zerfoo/float8` |
36+
| Convert ONNX models to GGUF | `github.com/zerfoo/zonnx` (CLI) |
37+
38+
## Modules
39+
40+
### [ztensor]({{< relref "ztensor" >}})
41+
42+
GPU-accelerated tensor, compute engine, and computation graph library. Provides the `compute.Engine[T]` interface that powers all arithmetic in the ecosystem. Supports CUDA, ROCm, and OpenCL backends loaded at runtime via purego -- zero CGo.
43+
44+
### [ztoken]({{< relref "ztoken" >}})
45+
46+
BPE tokenizer with HuggingFace `tokenizer.json` and GGUF tokenizer extraction. Handles SentencePiece compatibility for Llama-family models. Zero external dependencies.
47+
48+
### [Numeric Types (float16 + float8)]({{< relref "numeric-types" >}})
49+
50+
IEEE 754 half-precision (`Float16`), Brain Floating Point (`BFloat16`), and FP8 E4M3FN (`Float8`) arithmetic libraries. Used by ztensor for quantized tensor storage and mixed-precision compute.
51+
52+
### zonnx
53+
54+
ONNX-to-GGUF converter CLI. Standalone binary with no runtime dependencies on the other modules. Converts ONNX models into GGUF format for use with zerfoo.
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
---
2+
title: Numeric Types
3+
weight: 3
4+
bookToc: true
5+
---
6+
7+
# Numeric Types
8+
9+
Zerfoo provides two libraries for reduced-precision floating-point arithmetic: **float16** (IEEE 754 half-precision and BFloat16) and **float8** (FP8 E4M3FN). These are used throughout ztensor for quantized tensor storage and mixed-precision compute.
10+
11+
## At a Glance
12+
13+
| Type | Package | Bits | Format | Range | Precision | Best For |
14+
|------|---------|------|--------|-------|-----------|----------|
15+
| `Float16` | `float16` | 16 | 1 sign + 5 exp + 10 mantissa | ~6.55 x 10^4 | ~3-4 digits | Inference weights, activations |
16+
| `BFloat16` | `float16` | 16 | 1 sign + 8 exp + 7 mantissa | ~3.39 x 10^38 | ~2-3 digits | Training (same range as float32) |
17+
| `Float8` | `float8` | 8 | 1 sign + 4 exp + 3 mantissa (E4M3FN) | ~448 | ~1-2 digits | Quantized inference, memory savings |
18+
19+
## float16
20+
21+
```bash
22+
go get github.com/zerfoo/float16
23+
```
24+
25+
The float16 package provides two types in a single module: `Float16` (IEEE 754 half-precision) and `BFloat16` (Brain Floating Point).
26+
27+
### Float16
28+
29+
Standard IEEE 754 half-precision with 10 bits of mantissa. Good precision for inference weights and activations, but limited range.
30+
31+
```go
32+
import "github.com/zerfoo/float16"
33+
34+
a := float16.FromFloat32(3.14159)
35+
b := float16.FromFloat64(2.71828)
36+
37+
sum := a.Add(b)
38+
product := a.Mul(b)
39+
40+
fmt.Printf("Sum: %f\n", sum.ToFloat32())
41+
fmt.Printf("Product: %f\n", product.ToFloat32())
42+
```
43+
44+
### BFloat16
45+
46+
Same exponent range as float32 (8 exponent bits) with reduced mantissa (7 bits). Preferred for training because it avoids overflow/underflow issues that Float16 suffers from at the edges of the float32 range.
47+
48+
```go
49+
bf := float16.BFloat16FromFloat32(1.5)
50+
f32 := bf.ToFloat32()
51+
```
52+
53+
### Special Values and Classification
54+
55+
```go
56+
f := float16.FromFloat32(3.14)
57+
58+
f.IsInf(0) // check for infinity
59+
f.IsNaN() // check for NaN
60+
f.IsFinite() // check for finite
61+
f.IsNormal() // check for normalized
62+
f.IsSubnormal() // check for subnormal
63+
```
64+
65+
### Rounding Modes
66+
67+
```go
68+
config := float16.GetConfig()
69+
config.DefaultRoundingMode = float16.RoundNearestEven // default
70+
float16.Configure(config)
71+
72+
// Available: RoundNearestEven, RoundTowardZero,
73+
// RoundTowardPositive, RoundTowardNegative, RoundNearestAway
74+
```
75+
76+
### Vectorized Operations
77+
78+
```go
79+
a := []float16.Float16{...}
80+
b := []float16.Float16{...}
81+
82+
sum := float16.VectorAdd(a, b)
83+
product := float16.VectorMul(a, b)
84+
```
85+
86+
## float8
87+
88+
```bash
89+
go get github.com/zerfoo/float8
90+
```
91+
92+
The float8 package implements FP8 E4M3FN, an 8-bit floating-point format widely used for quantized ML inference. It has no infinity representation (the E4M3FN variant uses that encoding for additional finite values).
93+
94+
```go
95+
import "github.com/zerfoo/float8"
96+
97+
a := float8.FromFloat32(3.14)
98+
b := float8.FromFloat32(2.71)
99+
100+
sum := a.Add(b)
101+
product := a.Mul(b)
102+
103+
fmt.Printf("a + b = %f\n", sum.ToFloat32())
104+
fmt.Printf("a * b = %f\n", product.ToFloat32())
105+
```
106+
107+
### Fast Mode
108+
109+
For performance-critical paths, enable lookup-table-based arithmetic:
110+
111+
```go
112+
float8.EnableFastArithmetic()
113+
float8.EnableFastConversion()
114+
```
115+
116+
This trades memory for speed by using pre-computed tables.
117+
118+
## When to Use Each Type
119+
120+
| Scenario | Recommended Type |
121+
|----------|-----------------|
122+
| Model inference weights | Float16 or BFloat16 |
123+
| Training (mixed precision) | BFloat16 (matches float32 range) |
124+
| Quantized inference (Q8) | Float8 E4M3FN |
125+
| CUDA kernel intermediate values | Float16 |
126+
| Memory-constrained deployment | Float8 |
127+
128+
## Integration with ztensor
129+
130+
These types are first-class citizens in ztensor. Create tensors of any numeric type:
131+
132+
```go
133+
import (
134+
"github.com/zerfoo/float16"
135+
"github.com/zerfoo/ztensor/tensor"
136+
"github.com/zerfoo/ztensor/compute"
137+
"github.com/zerfoo/ztensor/numeric"
138+
)
139+
140+
// Float16 tensor
141+
a, _ := tensor.New[float16.Float16]([]int{2, 3}, data)
142+
eng := compute.NewCPUEngine[float16.Float16](numeric.Float16Ops{})
143+
```
144+
145+
The compute engine handles dequantization automatically when mixing precision levels.

content/docs/ecosystem/ztensor.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
---
2+
title: ztensor
3+
weight: 1
4+
bookToc: true
5+
---
6+
7+
# ztensor
8+
9+
GPU-accelerated tensor, compute engine, and computation graph library for Go.
10+
11+
```bash
12+
go get github.com/zerfoo/ztensor
13+
```
14+
15+
## Overview
16+
17+
ztensor is the foundational tensor and compute library in the Zerfoo ecosystem. It provides multi-type tensor storage, a unified compute engine interface across CPU and GPU backends, a computation graph compiler with operator fusion, and GPU memory management -- all without CGo.
18+
19+
If you are building an ML inference engine, need GPU compute from Go, or want a typed tensor library, ztensor is the package to import.
20+
21+
## When to Use ztensor Directly
22+
23+
| Use case | Import |
24+
|----------|--------|
25+
| Tensor math, GPU compute, custom ML operators | `github.com/zerfoo/ztensor` directly |
26+
| Transformer inference, model serving, training | `github.com/zerfoo/zerfoo` (imports ztensor internally) |
27+
28+
Import ztensor directly when you need tensor operations or GPU compute without the full inference/serving stack. If you are running transformer models, use zerfoo -- it builds on ztensor for you.
29+
30+
## Tensor Creation
31+
32+
Tensors are generic over all numeric types via the `tensor.Numeric` constraint:
33+
34+
```go
35+
import "github.com/zerfoo/ztensor/tensor"
36+
37+
// Create a 2x3 float32 tensor
38+
a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6})
39+
40+
fmt.Println(a.Shape()) // [2, 3]
41+
fmt.Println(a.Data()) // [1 2 3 4 5 6]
42+
```
43+
44+
Supported element types include `float32`, `float64`, `float16.Float16`, `float16.BFloat16`, `float8.Float8`, and all Go integer types.
45+
46+
## Compute Engine
47+
48+
All arithmetic flows through the `compute.Engine[T]` interface. This enables transparent CPU/GPU switching and CUDA graph capture.
49+
50+
### CPU Engine
51+
52+
```go
53+
import (
54+
"context"
55+
56+
"github.com/zerfoo/ztensor/compute"
57+
"github.com/zerfoo/ztensor/numeric"
58+
"github.com/zerfoo/ztensor/tensor"
59+
)
60+
61+
ctx := context.Background()
62+
eng := compute.NewCPUEngine[float32](numeric.Float32Ops{})
63+
64+
a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6})
65+
b, _ := tensor.New[float32]([]int{3, 2}, []float32{1, 2, 3, 4, 5, 6})
66+
67+
c, _ := eng.MatMul(ctx, a, b)
68+
fmt.Println(c.Shape()) // [2, 2]
69+
fmt.Println(c.Data()) // [22 28 49 64]
70+
```
71+
72+
### GPU Engine
73+
74+
GPU libraries are loaded at runtime via purego -- no CGo, no build tags, no linking. If the GPU runtime is not available, the constructor returns an error and you fall back to CPU.
75+
76+
```go
77+
// CUDA (NVIDIA GPUs)
78+
eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{})
79+
80+
// ROCm (AMD GPUs)
81+
eng, err := compute.NewROCmEngine[float32](numeric.Float32Ops{})
82+
83+
// OpenCL (cross-vendor)
84+
eng, err := compute.NewOpenCLEngine[float32](numeric.Float32Ops{})
85+
```
86+
87+
A common pattern is to try GPU first with a CPU fallback:
88+
89+
```go
90+
eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{})
91+
if err != nil {
92+
eng = compute.NewCPUEngine[float32](numeric.Float32Ops{})
93+
}
94+
```
95+
96+
## Type-Safe Generics
97+
98+
Write functions that work across any numeric type:
99+
100+
```go
101+
func dotProduct[T tensor.Numeric](
102+
eng compute.Engine[T],
103+
a, b *tensor.TensorNumeric[T],
104+
) (*tensor.TensorNumeric[T], error) {
105+
return eng.MatMul(context.Background(), a, b)
106+
}
107+
```
108+
109+
## Computation Graph
110+
111+
The `graph` package provides a computation graph compiler with operator fusion passes and CUDA graph capture for optimized inference:
112+
113+
| Feature | Description |
114+
|---------|-------------|
115+
| Operator fusion | Combines adjacent operations to reduce kernel launches |
116+
| CUDA graph capture | Records and replays GPU execution for minimal launch overhead |
117+
| Megakernel codegen | Generates fused GPU kernels at compile time |
118+
119+
## Package Reference
120+
121+
| Package | Description |
122+
|---------|-------------|
123+
| `tensor/` | Multi-type tensor storage (CPU, GPU, quantized) |
124+
| `compute/` | Engine interface with CPU, CUDA, ROCm, and OpenCL implementations |
125+
| `graph/` | Computation graph compiler with fusion and CUDA graph capture |
126+
| `numeric/` | Type-safe `Arithmetic[T]` interface for all numeric types |
127+
| `device/` | Device abstraction and memory allocators |
128+
| `internal/cuda/` | Zero-CGo CUDA runtime bindings via purego, 25+ custom kernels |
129+
| `internal/xblas/` | ARM NEON and x86 AVX2 SIMD assembly |
130+
| `internal/gpuapi/` | GPU Runtime Abstraction Layer (CUDA/ROCm/OpenCL) |
131+
| `internal/codegen/` | Megakernel code generator |
132+
133+
## Dependencies
134+
135+
ztensor depends on [float16]({{< relref "numeric-types" >}}) and [float8]({{< relref "numeric-types" >}}) for half-precision and FP8 arithmetic.

0 commit comments

Comments
 (0)