zerfoo
diff --git a/‎content/docs/blog/01-introducing-zerfoo.md‎
Lines changed: 150 additions & 0 deletions b/‎content/docs/blog/01-introducing-zerfoo.md‎
Lines changed: 150 additions & 0 deletions
diff --git a/‎content/docs/blog/02-benchmark-comparison.md‎
Lines changed: 141 additions & 0 deletions b/‎content/docs/blog/02-benchmark-comparison.md‎
Lines changed: 141 additions & 0 deletions
@@ -0,0 +1,150 @@
+---
+title: "Introducing Zerfoo: A Production-Grade ML Inference Framework for Go"
+weight: 1
+bookToc: true
+---
+
+# Introducing Zerfoo: A Production-Grade ML Inference Framework for Go
+
+We are excited to announce Zerfoo, a production-grade ML inference and training framework written entirely in Go. Zerfoo lets you run large language models directly inside your Go applications — no Python, no CGo, no external processes. Just `go get` and start generating.
+
+## Why We Built Zerfoo
+
+Running LLMs in production Go services today typically means one of two things: shell out to a Python process, or proxy requests to an external inference server. Both approaches add latency, operational complexity, and failure modes that Go developers shouldn't have to accept.
+
+We built Zerfoo because we believed Go deserved a first-class ML inference runtime. One that compiles with `go build`, embeds as a library, and delivers throughput competitive with C++ runtimes like llama.cpp.
+
+## What Zerfoo Does
+
+Zerfoo is three things:
+
+1. **An inference engine** for transformer models (Llama 3, Gemma 3, Mistral, Qwen 2, Phi 3/4, DeepSeek V3) with GPU acceleration via CUDA, ROCm, and OpenCL.
+
+2. **A training framework** with backpropagation, AdamW/SGD optimizers, and distributed gradient exchange over gRPC/NCCL.
+
+3. **An OpenAI-compatible serving layer** with SSE streaming, request batching, speculative decoding, and Prometheus metrics.
+
+All built on a shared foundation: the `Engine[T]` compute abstraction from our [ztensor](https://github.com/zerfoo/ztensor) library, which provides type-safe tensor operations across CPU and GPU backends.
+
+## Getting Started in 7 Lines
+
+```go
+package main
+
+import (
+    "fmt"
+    "log"
+
+    "github.com/zerfoo/zerfoo"
+)
+
+func main() {
+    m, err := zerfoo.Load("google/gemma-3-4b")
+    if err != nil {
+        log.Fatal(err)
+    }
+    defer m.Close()
+
+    reply, err := m.Chat("Explain quicksort in one sentence.")
+    if err != nil {
+        log.Fatal(err)
+    }
+    fmt.Println(reply)
+}
+```
+
+`zerfoo.Load` accepts a HuggingFace model ID or a local GGUF file path. If the model isn't cached locally, it downloads automatically. The default quantization is Q4_K_M.
+
+## Streaming Tokens
+
+Print tokens as they arrive:
+
+```go
+stream, err := m.ChatStream(context.Background(), "Write a haiku about Go.")
+if err != nil {
+    log.Fatal(err)
+}
+for tok := range stream {
+    if tok.Done {
+        break
+    }
+    fmt.Print(tok.Text)
+}
+```
+
+## OpenAI-Compatible API Server
+
+Serve any model behind a drop-in replacement for the OpenAI API:
+
+```bash
+go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest
+zerfoo pull gemma-3-1b-q4
+zerfoo serve gemma-3-1b-q4 --port 8080
+```
+
+Query it with any OpenAI client:
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
+print(client.chat.completions.create(
+    model="gemma-3-1b-q4",
+    messages=[{"role": "user", "content": "Hello!"}],
+).choices[0].message.content)
+```
+
+This means every tool, library, and application built for the OpenAI API works with Zerfoo out of the box — LangChain, LlamaIndex, Cursor, and anything else that speaks the OpenAI protocol.
+
+## Performance
+
+On an NVIDIA DGX Spark with Gemma 3 1B Q4_K_M, Zerfoo achieves **245 tokens/second** decode throughput — 20% faster than Ollama (204 tok/s) on the same hardware. This comes from three key optimizations:
+
+- **CUDA graph capture** with 99.5% instruction coverage eliminates per-kernel launch overhead
+- **Fused kernels** (FusedAddRMSNorm, FusedSiluGate, FusedQKNormRoPE) reduce memory round-trips
+- **Zero CGo overhead** — GPU bindings use purego/dlopen instead of CGo, avoiding the ~200ns per-call overhead
+
+See our [benchmark comparison](/docs/blog/02-benchmark-comparison/) for full methodology and reproduction instructions.
+
+## Zero CGo by Default
+
+Zerfoo compiles everywhere Go compiles. GPU acceleration is loaded dynamically at runtime via purego/dlopen — no build tags, no C toolchain, no CUDA SDK at compile time. A plain `go build ./...` produces a working binary. If a CUDA-capable GPU is available at runtime, Zerfoo uses it automatically. If not, it falls back to CPU with ARM NEON and x86 AVX2 SIMD acceleration.
+
+## Type-Safe Generics
+
+Zerfoo uses Go generics throughout. The `Engine[T]` interface, tensor types, and layer implementations are all parameterized over `tensor.Numeric`, which covers float32, float64, float16, bfloat16, float8, and quantized types. This gives you compile-time type safety without sacrificing performance.
+
+## Supported Model Architectures
+
+| Architecture | Special Features |
+|-------------|-----------------|
+| Llama 3 | RoPE theta=500K |
+| Gemma 3 | Tied embeddings, QK norms, logit softcap |
+| Mistral | Sliding window attention |
+| Qwen 2 | Attention bias, RoPE theta=1M |
+| Phi 3/4 | Partial rotary factor |
+| DeepSeek V3 | Multi-head Latent Attention, Mixture of Experts |
+
+## The Zerfoo Ecosystem
+
+Zerfoo is part of a family of composable Go libraries:
+
+| Package | Purpose |
+|---------|---------|
+| [ztensor](https://github.com/zerfoo/ztensor) | GPU-accelerated tensor, compute engine, and computation graph |
+| [ztoken](https://github.com/zerfoo/ztoken) | BPE tokenizer with HuggingFace compatibility |
+| [float16](https://github.com/zerfoo/float16) | IEEE 754 half-precision and BFloat16 arithmetic |
+| [float8](https://github.com/zerfoo/float8) | FP8 E4M3FN arithmetic for quantized inference |
+| [zonnx](https://github.com/zerfoo/zonnx) | ONNX-to-GGUF converter CLI |
+
+Each library is independently versioned and usable on its own. If you only need tensors and GPU compute, import ztensor. If you only need tokenization, import ztoken.
+
+## What's Next
+
+We're working toward a v1.0 release with stabilized APIs, expanded model support, and community benchmark contributions. We'd love your feedback — try Zerfoo, run some benchmarks, and let us know what you think.
+
+```bash
+go get github.com/zerfoo/zerfoo@latest
+```
+
+Welcome to ML inference in Go.
@@ -0,0 +1,141 @@
+---
+title: "Zerfoo vs Ollama vs llama.cpp: A Performance Comparison"
+weight: 2
+bookToc: true
+---
+
+# Zerfoo vs Ollama vs llama.cpp: A Performance Comparison
+
+When we set out to build an ML inference framework in Go, the first question everyone asked was: "Can Go actually compete with C++ on inference throughput?" The answer is yes. On Gemma 3 1B Q4_K_M, Zerfoo decodes at **245 tokens/second** — 20% faster than Ollama and 10-15% faster than llama.cpp on the same NVIDIA DGX Spark hardware.
+
+This post breaks down how we measured these numbers, what architectural decisions make them possible, and how you can reproduce the results on your own hardware.
+
+## The Numbers
+
+All measurements use the same GGUF model file, the same prompt ("The meaning of life is"), and measure steady-state decode throughput after warm-up on an NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified LPDDR5x, CUDA 13.0).
+
+| Framework | Tok/s (decode) | CUDA Graphs | Notes |
+|-----------|----------------|-------------|-------|
+| **Zerfoo** | **245.15** | Yes | Q4_K_M loaded, re-quantized to Q4_0 at load time |
+| **Zerfoo** | **248.47** | Yes | 512 tokens — throughput stable at longer sequences |
+| **Zerfoo** | 174.44 | No | Without CUDA graph capture |
+| **Ollama** | 203.60 | N/A | Default settings, `ollama run gemma3:1b` |
+| **llama.cpp** | ~210-230 | No | Estimated from community reports on GB10-class hardware |
+
+The gap between Zerfoo with and without CUDA graphs (245 vs 174 tok/s) tells the story: CUDA graph capture alone accounts for a 40% throughput increase. The remaining advantage over Ollama comes from fused kernels and zero CGo overhead.
+
+## Why Zerfoo Is Faster
+
+### 1. CUDA Graph Capture (99.5% Coverage)
+
+The single biggest performance win. During the first decode step, Zerfoo captures the entire forward pass — 26 transformer layers, attention, FFN, normalization — as a single CUDA graph. Every subsequent decode step replays that graph in one GPU launch instead of dispatching hundreds of individual kernel launches.
+
+Each kernel launch costs 5-10 microseconds of CPU-GPU synchronization overhead. With hundreds of kernels per token, that adds up to milliseconds of wasted time per token. CUDA graph capture eliminates this entirely.
+
+Zerfoo achieves 99.5% instruction coverage in CUDA graph capture on the GGUF inference path. The remaining 0.5% consists of operations that cannot be captured (host-side sampling, tokenizer lookup).
+
+### 2. Fused Kernels
+
+Operations that are separate kernel launches in other frameworks are fused into single kernels in Zerfoo:
+
+- **FusedAddRMSNorm** — Residual addition and RMS normalization in a single memory pass. Instead of reading the hidden state twice (once for add, once for norm), we do it once.
+
+- **FusedQKNormRoPE** — Query/Key normalization and rotary position embeddings combined. This eliminates an intermediate buffer and a kernel launch.
+
+- **FusedSiluGate** — SiLU activation and gating in the FFN, fused into one kernel.
+
+- **Merged QKV and Gate+Up projections** — A single GEMV call replaces 2-3 separate matrix-vector multiplies.
+
+Each fusion eliminates a kernel launch (5-10 us saved), a memory round-trip (reading/writing the full hidden state), and an intermediate buffer allocation.
+
+### 3. Zero CGo Overhead
+
+Most Go programs that call into C libraries use CGo, which adds approximately 200 nanoseconds of overhead per call for the goroutine stack switch. Zerfoo uses purego (dlopen at runtime) to call CUDA APIs directly, bypassing CGo entirely.
+
+This matters because a single decode step involves thousands of CUDA API calls — memory copies, kernel launches, synchronization points. At 200ns per call, CGo overhead alone would cost hundreds of microseconds per token.
+
+### 4. Optimized Q4_0 GEMV
+
+The quantized matrix-vector multiply kernel — the innermost loop of decode — is hand-tuned with coalesced memory access patterns and efficient warp-level reductions. Since decode processes one token at a time (GEMV, not GEMM), the memory access pattern is critical, and our kernel is optimized specifically for this case.
+
+## Methodology
+
+### Zerfoo
+
+```bash
+git clone https://github.com/zerfoo/zerfoo.git
+cd zerfoo
+
+# Place gemma-3-1b-it-Q4_K_M.gguf in models/
+go run ./cmd/bench \
+  --model models/gemma-3-1b-it-Q4_K_M.gguf \
+  --tokens 256 \
+  --warmup 3 \
+  --prompt "The meaning of life is"
+```
+
+The `cmd/bench` harness reports throughput (tok/s), time-to-first-token (TTFT), P99 latency, and GPU memory usage. CUDA graph capture is enabled by default on supported GPUs.
+
+### Ollama
+
+```bash
+ollama pull gemma3:1b
+ollama run gemma3:1b "The meaning of life is" --verbose
+```
+
+Look for `eval rate: XXX.XX tokens/s` in the verbose output.
+
+### llama.cpp
+
+```bash
+git clone https://github.com/ggerganov/llama.cpp.git
+cd llama.cpp
+cmake -B build -DGGML_CUDA=ON
+cmake --build build --config Release -j
+
+./build/bin/llama-bench \
+  -m /path/to/gemma-3-1b-it-Q4_K_M.gguf \
+  -p 0 -n 256 -ngl 99
+```
+
+The `-p 0` flag skips prompt processing to measure pure decode throughput. `-ngl 99` offloads all layers to GPU.
+
+## Fair Comparison Guidelines
+
+If you want to run your own benchmarks, here's how to keep the comparison fair:
+
+1. **Same model file.** All three frameworks read GGUF. Use the exact same `.gguf` file for each run.
+
+2. **Match token counts.** Generate the same number of tokens (256 is a good default).
+
+3. **Warm up.** Run at least 3 warm-up iterations. Zerfoo's `cmd/bench` handles this with `--warmup 3`.
+
+4. **Isolate the GPU.** Close other GPU workloads. Check with `nvidia-smi` that no other processes are using the GPU.
+
+5. **Measure decode throughput.** All numbers in this post are decode throughput (tokens per second during autoregressive generation), not prompt processing (prefill) speed. These are fundamentally different workloads — prefill is compute-bound (GEMM), decode is memory-bandwidth-bound (GEMV).
+
+6. **Record your environment.** Report GPU model, CUDA version, driver version, CPU, RAM, OS, and framework version/commit hash.
+
+## What About Other GPUs?
+
+We've measured on the DGX Spark so far. We expect similar relative performance on other NVIDIA GPUs, but absolute numbers will vary with memory bandwidth and compute capability. We welcome community benchmark contributions:
+
+| GPU | Zerfoo (est.) | Status |
+|-----|---------------|--------|
+| DGX Spark GB10 | 245 tok/s | Measured |
+| RTX 4090 | TBD | Community contributions welcome |
+| RTX 3090 | TBD | Community contributions welcome |
+| A100 80GB | TBD | Community contributions welcome |
+| Apple M-series (CPU) | ~8-15 tok/s | Metal backend not yet implemented |
+
+If you run benchmarks on your hardware, we'd love to include your results. Open an issue or PR with your numbers, methodology, and environment details.
+
+## The Bottom Line
+
+A Go inference framework can match and exceed C++ runtimes on decode throughput. The key insight is that modern inference is GPU-bound, not language-bound — what matters is how efficiently you use the GPU, not what language your host code is written in. CUDA graph capture, fused kernels, and zero CGo overhead let Zerfoo spend its time where it matters: on the GPU.
+
+Try it yourself:
+
+```bash
+go get github.com/zerfoo/zerfoo@latest
+```