docs(ecosystem): add ztensor, ztoken, and numeric types overviews

dndungu · dndungu · commit cb4d0b0903ae · 2026-03-25T11:15:57.000-07:00
diff --git a/content/docs/ecosystem/_index.md b/content/docs/ecosystem/_index.md
@@ -1,5 +1,54 @@
 ---
 title: Ecosystem
 weight: 8
+bookToc: true
 bookCollapseSection: true
 ---
+
+# Ecosystem
+
+Zerfoo is a family of Go modules that together form a complete ML inference and training stack. Each module has its own `go.mod`, versioning, and release cycle.
+
+## Dependency Graph
+
+```
+float16 ──┐
+           ├──► ztensor ──► zerfoo
+float8  ──┘                    ▲
+                               │
+ztoken ────────────────────────┘
+```
+
+- **float16** and **float8** provide reduced-precision arithmetic
+- **ztensor** builds tensors, compute engines, and computation graphs on top of them
+- **zerfoo** combines ztensor and ztoken into a full inference, training, and serving framework
+- **ztoken** is independent (zero external deps) and plugs directly into zerfoo
+
+## Which Module to Import
+
+| You want to... | Import |
+|----------------|--------|
+| Run transformer inference or serve models | `github.com/zerfoo/zerfoo` |
+| Work with tensors, GPU compute, or computation graphs | `github.com/zerfoo/ztensor` |
+| Tokenize text (BPE, HuggingFace, GGUF) | `github.com/zerfoo/ztoken` |
+| Do Float16 or BFloat16 arithmetic | `github.com/zerfoo/float16` |
+| Do FP8 E4M3FN arithmetic | `github.com/zerfoo/float8` |
+| Convert ONNX models to GGUF | `github.com/zerfoo/zonnx` (CLI) |
+
+## Modules
+
+### [ztensor]({{< relref "ztensor" >}})
+
+GPU-accelerated tensor, compute engine, and computation graph library. Provides the `compute.Engine[T]` interface that powers all arithmetic in the ecosystem. Supports CUDA, ROCm, and OpenCL backends loaded at runtime via purego -- zero CGo.
+
+### [ztoken]({{< relref "ztoken" >}})
+
+BPE tokenizer with HuggingFace `tokenizer.json` and GGUF tokenizer extraction. Handles SentencePiece compatibility for Llama-family models. Zero external dependencies.
+
+### [Numeric Types (float16 + float8)]({{< relref "numeric-types" >}})
+
+IEEE 754 half-precision (`Float16`), Brain Floating Point (`BFloat16`), and FP8 E4M3FN (`Float8`) arithmetic libraries. Used by ztensor for quantized tensor storage and mixed-precision compute.
+
+### zonnx
+
+ONNX-to-GGUF converter CLI. Standalone binary with no runtime dependencies on the other modules. Converts ONNX models into GGUF format for use with zerfoo.
diff --git a/content/docs/ecosystem/numeric-types.md b/content/docs/ecosystem/numeric-types.md
@@ -0,0 +1,145 @@
+---
+title: Numeric Types
+weight: 3
+bookToc: true
+---
+
+# Numeric Types
+
+Zerfoo provides two libraries for reduced-precision floating-point arithmetic: **float16** (IEEE 754 half-precision and BFloat16) and **float8** (FP8 E4M3FN). These are used throughout ztensor for quantized tensor storage and mixed-precision compute.
+
+## At a Glance
+
+| Type | Package | Bits | Format | Range | Precision | Best For |
+|------|---------|------|--------|-------|-----------|----------|
+| `Float16` | `float16` | 16 | 1 sign + 5 exp + 10 mantissa | ~6.55 x 10^4 | ~3-4 digits | Inference weights, activations |
+| `BFloat16` | `float16` | 16 | 1 sign + 8 exp + 7 mantissa | ~3.39 x 10^38 | ~2-3 digits | Training (same range as float32) |
+| `Float8` | `float8` | 8 | 1 sign + 4 exp + 3 mantissa (E4M3FN) | ~448 | ~1-2 digits | Quantized inference, memory savings |
+
+## float16
+
+```bash
+go get github.com/zerfoo/float16
+```
+
+The float16 package provides two types in a single module: `Float16` (IEEE 754 half-precision) and `BFloat16` (Brain Floating Point).
+
+### Float16
+
+Standard IEEE 754 half-precision with 10 bits of mantissa. Good precision for inference weights and activations, but limited range.
+
+```go
+import "github.com/zerfoo/float16"
+
+a := float16.FromFloat32(3.14159)
+b := float16.FromFloat64(2.71828)
+
+sum := a.Add(b)
+product := a.Mul(b)
+
+fmt.Printf("Sum: %f\n", sum.ToFloat32())
+fmt.Printf("Product: %f\n", product.ToFloat32())
+```
+
+### BFloat16
+
+Same exponent range as float32 (8 exponent bits) with reduced mantissa (7 bits). Preferred for training because it avoids overflow/underflow issues that Float16 suffers from at the edges of the float32 range.
+
+```go
+bf := float16.BFloat16FromFloat32(1.5)
+f32 := bf.ToFloat32()
+```
+
+### Special Values and Classification
+
+```go
+f := float16.FromFloat32(3.14)
+
+f.IsInf(0)     // check for infinity
+f.IsNaN()      // check for NaN
+f.IsFinite()   // check for finite
+f.IsNormal()   // check for normalized
+f.IsSubnormal() // check for subnormal
+```
+
+### Rounding Modes
+
+```go
+config := float16.GetConfig()
+config.DefaultRoundingMode = float16.RoundNearestEven // default
+float16.Configure(config)
+
+// Available: RoundNearestEven, RoundTowardZero,
+// RoundTowardPositive, RoundTowardNegative, RoundNearestAway
+```
+
+### Vectorized Operations
+
+```go
+a := []float16.Float16{...}
+b := []float16.Float16{...}
+
+sum := float16.VectorAdd(a, b)
+product := float16.VectorMul(a, b)
+```
+
+## float8
+
+```bash
+go get github.com/zerfoo/float8
+```
+
+The float8 package implements FP8 E4M3FN, an 8-bit floating-point format widely used for quantized ML inference. It has no infinity representation (the E4M3FN variant uses that encoding for additional finite values).
+
+```go
+import "github.com/zerfoo/float8"
+
+a := float8.FromFloat32(3.14)
+b := float8.FromFloat32(2.71)
+
+sum := a.Add(b)
+product := a.Mul(b)
+
+fmt.Printf("a + b = %f\n", sum.ToFloat32())
+fmt.Printf("a * b = %f\n", product.ToFloat32())
+```
+
+### Fast Mode
+
+For performance-critical paths, enable lookup-table-based arithmetic:
+
+```go
+float8.EnableFastArithmetic()
+float8.EnableFastConversion()
+```
+
+This trades memory for speed by using pre-computed tables.
+
+## When to Use Each Type
+
+| Scenario | Recommended Type |
+|----------|-----------------|
+| Model inference weights | Float16 or BFloat16 |
+| Training (mixed precision) | BFloat16 (matches float32 range) |
+| Quantized inference (Q8) | Float8 E4M3FN |
+| CUDA kernel intermediate values | Float16 |
+| Memory-constrained deployment | Float8 |
+
+## Integration with ztensor
+
+These types are first-class citizens in ztensor. Create tensors of any numeric type:
+
+```go
+import (
+    "github.com/zerfoo/float16"
+    "github.com/zerfoo/ztensor/tensor"
+    "github.com/zerfoo/ztensor/compute"
+    "github.com/zerfoo/ztensor/numeric"
+)
+
+// Float16 tensor
+a, _ := tensor.New[float16.Float16]([]int{2, 3}, data)
+eng := compute.NewCPUEngine[float16.Float16](numeric.Float16Ops{})
+```
+
+The compute engine handles dequantization automatically when mixing precision levels.
diff --git a/content/docs/ecosystem/ztensor.md b/content/docs/ecosystem/ztensor.md
@@ -0,0 +1,135 @@
+---
+title: ztensor
+weight: 1
+bookToc: true
+---
+
+# ztensor
+
+GPU-accelerated tensor, compute engine, and computation graph library for Go.
+
+```bash
+go get github.com/zerfoo/ztensor
+```
+
+## Overview
+
+ztensor is the foundational tensor and compute library in the Zerfoo ecosystem. It provides multi-type tensor storage, a unified compute engine interface across CPU and GPU backends, a computation graph compiler with operator fusion, and GPU memory management -- all without CGo.
+
+If you are building an ML inference engine, need GPU compute from Go, or want a typed tensor library, ztensor is the package to import.
+
+## When to Use ztensor Directly
+
+| Use case | Import |
+|----------|--------|
+| Tensor math, GPU compute, custom ML operators | `github.com/zerfoo/ztensor` directly |
+| Transformer inference, model serving, training | `github.com/zerfoo/zerfoo` (imports ztensor internally) |
+
+Import ztensor directly when you need tensor operations or GPU compute without the full inference/serving stack. If you are running transformer models, use zerfoo -- it builds on ztensor for you.
+
+## Tensor Creation
+
+Tensors are generic over all numeric types via the `tensor.Numeric` constraint:
+
+```go
+import "github.com/zerfoo/ztensor/tensor"
+
+// Create a 2x3 float32 tensor
+a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6})
+
+fmt.Println(a.Shape()) // [2, 3]
+fmt.Println(a.Data())  // [1 2 3 4 5 6]
+```
+
+Supported element types include `float32`, `float64`, `float16.Float16`, `float16.BFloat16`, `float8.Float8`, and all Go integer types.
+
+## Compute Engine
+
+All arithmetic flows through the `compute.Engine[T]` interface. This enables transparent CPU/GPU switching and CUDA graph capture.
+
+### CPU Engine
+
+```go
+import (
+    "context"
+
+    "github.com/zerfoo/ztensor/compute"
+    "github.com/zerfoo/ztensor/numeric"
+    "github.com/zerfoo/ztensor/tensor"
+)
+
+ctx := context.Background()
+eng := compute.NewCPUEngine[float32](numeric.Float32Ops{})
+
+a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6})
+b, _ := tensor.New[float32]([]int{3, 2}, []float32{1, 2, 3, 4, 5, 6})
+
+c, _ := eng.MatMul(ctx, a, b)
+fmt.Println(c.Shape()) // [2, 2]
+fmt.Println(c.Data())  // [22 28 49 64]
+```
+
+### GPU Engine
+
+GPU libraries are loaded at runtime via purego -- no CGo, no build tags, no linking. If the GPU runtime is not available, the constructor returns an error and you fall back to CPU.
+
+```go
+// CUDA (NVIDIA GPUs)
+eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{})
+
+// ROCm (AMD GPUs)
+eng, err := compute.NewROCmEngine[float32](numeric.Float32Ops{})
+
+// OpenCL (cross-vendor)
+eng, err := compute.NewOpenCLEngine[float32](numeric.Float32Ops{})
+```
+
+A common pattern is to try GPU first with a CPU fallback:
+
+```go
+eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{})
+if err != nil {
+    eng = compute.NewCPUEngine[float32](numeric.Float32Ops{})
+}
+```
+
+## Type-Safe Generics
+
+Write functions that work across any numeric type:
+
+```go
+func dotProduct[T tensor.Numeric](
+    eng compute.Engine[T],
+    a, b *tensor.TensorNumeric[T],
+) (*tensor.TensorNumeric[T], error) {
+    return eng.MatMul(context.Background(), a, b)
+}
+```
+
+## Computation Graph
+
+The `graph` package provides a computation graph compiler with operator fusion passes and CUDA graph capture for optimized inference:
+
+| Feature | Description |
+|---------|-------------|
+| Operator fusion | Combines adjacent operations to reduce kernel launches |
+| CUDA graph capture | Records and replays GPU execution for minimal launch overhead |
+| Megakernel codegen | Generates fused GPU kernels at compile time |
+
+## Package Reference
+
+| Package | Description |
+|---------|-------------|
+| `tensor/` | Multi-type tensor storage (CPU, GPU, quantized) |
+| `compute/` | Engine interface with CPU, CUDA, ROCm, and OpenCL implementations |
+| `graph/` | Computation graph compiler with fusion and CUDA graph capture |
+| `numeric/` | Type-safe `Arithmetic[T]` interface for all numeric types |
+| `device/` | Device abstraction and memory allocators |
+| `internal/cuda/` | Zero-CGo CUDA runtime bindings via purego, 25+ custom kernels |
+| `internal/xblas/` | ARM NEON and x86 AVX2 SIMD assembly |
+| `internal/gpuapi/` | GPU Runtime Abstraction Layer (CUDA/ROCm/OpenCL) |
+| `internal/codegen/` | Megakernel code generator |
+
+## Dependencies
+
+ztensor depends on [float16]({{< relref "numeric-types" >}}) and [float8]({{< relref "numeric-types" >}}) for half-precision and FP8 arithmetic.
diff --git a/content/docs/ecosystem/ztoken.md b/content/docs/ecosystem/ztoken.md