docs(tutorials): add model loading and text generation tutorials

dndungu · dndungu · commit 88887890ee06 · 2026-03-25T11:15:57.000-07:00
diff --git a/content/docs/tutorials/model-loading.md b/content/docs/tutorials/model-loading.md
@@ -0,0 +1,158 @@
+---
+title: Model Loading
+weight: 1
+bookToc: true
+---
+
+# Model Loading and Architecture Support
+
+This tutorial covers the GGUF model format, the architectures Zerfoo supports, how to load models programmatically with various options, and what quantization levels mean for memory and quality.
+
+## The GGUF Format
+
+GGUF (GPT-Generated Unified Format) is a single-file model format designed for efficient inference. A GGUF file contains everything needed to run a model:
+
+- **Metadata**: architecture name, vocabulary size, hidden dimensions, RoPE parameters, chat template, and more.
+- **Tokenizer**: the full BPE vocabulary and merge rules embedded in the file's metadata section.
+- **Tensors**: all model weights, stored in their quantized or full-precision representation with shape information.
+
+Zerfoo uses GGUF as its sole model format. When you call `inference.LoadFile`, the framework parses the GGUF header, extracts the tokenizer, reads the architecture metadata, and builds a typed computation graph -- all without any external config files.
+
+```go
+model, err := inference.LoadFile("path/to/model.gguf")
+```
+
+GGUF files are mmap-friendly. On Unix platforms, you can enable memory-mapped loading to avoid copying weights into the Go heap:
+
+```go
+model, err := inference.LoadFile("model.gguf",
+	inference.WithMmap(true),
+)
+```
+
+## Supported Architectures
+
+Zerfoo includes architecture-specific graph builders for each model family. The architecture is detected automatically from GGUF metadata -- you do not need to specify it.
+
+| Architecture | Key Features | Example Models |
+|-------------|-------------|----------------|
+| Llama 3 | RoPE theta=500K, GQA | Llama 3.2 1B/3B, Llama 3.1 8B/70B |
+| Llama 4 | Extended Llama architecture | Llama 4 Scout |
+| Gemma 3 | Tied embeddings, embedding scaling, QK norms, logit softcap | Gemma 3 1B/4B/12B/27B |
+| Gemma 3n | Gemma 3 nano variant | Gemma 3n |
+| Mistral | Sliding window attention | Mistral 7B v0.3 |
+| Mixtral | Mixture of experts (MoE) with sliding window | Mixtral 8x7B |
+| Qwen 2 | Attention bias, RoPE theta=1M | Qwen 2.5 7B/14B/72B |
+| Phi 3/4 | Partial rotary factor | Phi-3 Mini, Phi-4 |
+| DeepSeek V3 | Multi-head Latent Attention (MLA), batched MoE | DeepSeek V3 |
+| Falcon | Multi-query attention | Falcon 7B/40B |
+| Command-R | Retrieval-augmented generation architecture | Command-R |
+| Jamba | Hybrid Mamba-Transformer architecture | Jamba |
+| Mamba/Mamba3 | State-space model (SSM), no attention | Mamba |
+| LLaVA | Vision-language multimodal | LLaVA |
+
+Each architecture has a dedicated builder in the `inference/` package (e.g., `arch_llama.go`, `arch_gemma.go`, `arch_deepseek.go`). The builder reads architecture-specific metadata fields and constructs the computation graph with the correct layer structure, attention mechanism, and normalization.
+
+## Loading Models Programmatically
+
+The `inference.LoadFile` function accepts functional options that control device placement, precision, and sequence length.
+
+### Device Selection
+
+```go
+// CPU inference (default).
+model, err := inference.LoadFile("model.gguf")
+
+// CUDA GPU inference.
+model, err := inference.LoadFile("model.gguf",
+	inference.WithDevice("cuda"),
+)
+```
+
+### Compute Precision
+
+```go
+// FP16 compute -- activations are converted F32->FP16 before GPU kernels.
+model, err := inference.LoadFile("model.gguf",
+	inference.WithDevice("cuda"),
+	inference.WithDType("fp16"),
+)
+
+// FP8 quantization -- weights are quantized to FP8 E4M3 at load time.
+model, err := inference.LoadFile("model.gguf",
+	inference.WithDevice("cuda"),
+	inference.WithDType("fp8"),
+)
+```
+
+### Sequence Length
+
+Override the model's default maximum context length:
+
+```go
+model, err := inference.LoadFile("model.gguf",
+	inference.WithMaxSeqLen(4096),
+)
+```
+
+### TensorRT Backend
+
+For maximum throughput on NVIDIA GPUs, enable the TensorRT backend:
+
+```go
+model, err := inference.LoadFile("model.gguf",
+	inference.WithDevice("cuda"),
+	inference.WithBackend("tensorrt"),
+	inference.WithPrecision("fp16"),
+)
+```
+
+### Model Aliases
+
+Zerfoo maintains a table of short aliases for popular HuggingFace repositories. You can resolve an alias to its full repo ID or register your own:
+
+```go
+// Resolves "gemma-3-1b-q4" -> "google/gemma-3-1b-it-qat-q4_0-gguf"
+repoID := inference.ResolveAlias("gemma-3-1b-q4")
+
+// Register a custom alias.
+inference.RegisterAlias("my-model", "myorg/my-model-GGUF")
+```
+
+## Understanding Quantization
+
+Quantization reduces model weights from 16- or 32-bit floats to lower-precision integers, trading a small amount of quality for significant memory savings and faster inference.
+
+Common GGUF quantization types:
+
+| Type | Bits/Weight | Memory (7B model) | Quality | Use Case |
+|------|------------|-------------------|---------|----------|
+| F16 | 16 | ~14 GB | Baseline | Full quality, GPU with ample VRAM |
+| Q8_0 | 8 | ~7 GB | Near-lossless | Best quality-to-size ratio |
+| Q4_K_M | ~4.5 | ~4 GB | Good | Recommended default for most users |
+| Q4_0 | 4 | ~3.5 GB | Acceptable | Minimum viable quality |
+
+The quantization type is baked into the GGUF file at conversion time. Zerfoo reads the quantization metadata from each tensor and applies the correct dequantization during inference. You do not need to specify the quantization type at load time.
+
+For a 1B parameter model like Gemma 3 1B with Q4_K_M quantization, expect roughly 800 MB of memory usage -- small enough to run on a laptop CPU.
+
+## Inspecting Model Metadata
+
+After loading a model, you can access its metadata:
+
+```go
+model, err := inference.LoadFile("model.gguf")
+if err != nil {
+	log.Fatal(err)
+}
+defer model.Close()
+
+info := model.Info()
+fmt.Printf("Architecture: %s\n", info.Architecture)
+fmt.Printf("Parameters: %d\n", info.Parameters)
+```
+
+## Next Steps
+
+- [Text Generation Deep Dive](/docs/tutorials/text-generation/) -- sampling strategies, streaming, and performance tuning.
+- [Running the OpenAI-Compatible API Server](/docs/api/) -- serve models over HTTP.
diff --git a/content/docs/tutorials/text-generation.md b/content/docs/tutorials/text-generation.md
@@ -0,0 +1,197 @@
+---
+title: Text Generation
+weight: 2
+bookToc: true
+---
+
+# Text Generation Deep Dive
+
+This tutorial explores how Zerfoo generates text: sampling strategies, streaming responses token by token, KV cache behavior, and batch generation for throughput.
+
+## How Autoregressive Generation Works
+
+Transformer models generate text one token at a time. At each step, the model computes a probability distribution over the vocabulary (logits), a token is selected, and it becomes part of the input for the next step. The `generate` package implements this loop with configurable sampling, stopping conditions, and KV caching.
+
+When you call `model.Generate`, this is what happens internally:
+
+1. The prompt is tokenized using the BPE tokenizer embedded in the GGUF file.
+2. A `SamplingConfig` is built from the options you pass.
+3. The prompt tokens run through the computation graph in a single forward pass (prefill).
+4. The KV cache stores key/value activations so they are not recomputed on subsequent steps.
+5. One token is generated per step (decode) until a stop condition is met.
+
+## Sampling Strategies
+
+Sampling controls how the next token is chosen from the logit distribution. Zerfoo supports several strategies that can be combined.
+
+### Temperature
+
+Temperature scales the logits before converting them to probabilities. Lower values make the distribution sharper (more deterministic), higher values make it flatter (more creative).
+
+```go
+// Deterministic output (greedy decoding).
+result, _ := model.Generate(ctx, prompt,
+	inference.WithTemperature(0),
+)
+
+// Creative output.
+result, _ := model.Generate(ctx, prompt,
+	inference.WithTemperature(1.2),
+)
+```
+
+A temperature of 0 selects the highest-probability token every time (greedy). A temperature of 1.0 samples proportionally to the probabilities. Values above 1.0 increase randomness.
+
+### Top-K Sampling
+
+Top-K restricts the candidate set to the K most probable tokens before sampling. This prevents the model from selecting very unlikely tokens.
+
+```go
+result, _ := model.Generate(ctx, prompt,
+	inference.WithTemperature(0.8),
+	inference.WithTopK(40),
+)
+```
+
+When `TopK` is 0 (the default), all tokens are candidates.
+
+### Top-P (Nucleus) Sampling
+
+Top-P keeps the smallest set of tokens whose cumulative probability exceeds P. This adapts the candidate set size dynamically -- confident predictions use fewer candidates, uncertain predictions use more.
+
+```go
+result, _ := model.Generate(ctx, prompt,
+	inference.WithTemperature(0.8),
+	inference.WithTopP(0.9),
+)
+```
+
+When `TopP` is 1.0 (the default), no filtering is applied. Top-K and Top-P can be combined: Top-K filters first, then Top-P filters the remainder.
+
+### Repetition Penalty
+
+Repetition penalty reduces the probability of tokens that have already appeared in the output. A value of 1.0 disables the penalty; values above 1.0 penalize repetition.
+
+```go
+result, _ := model.Generate(ctx, prompt,
+	inference.WithRepetitionPenalty(1.1),
+)
+```
+
+### Recommended Defaults
+
+For most use cases, a good starting point is:
+
+```go
+result, _ := model.Generate(ctx, prompt,
+	inference.WithTemperature(0.7),
+	inference.WithTopP(0.9),
+	inference.WithMaxTokens(256),
+)
+```
+
+## Streaming Responses
+
+For interactive applications, you often want to display tokens as they are generated rather than waiting for the full response. The `GenerateStream` method accepts a callback that receives each token:
+
+```go
+err := model.GenerateStream(ctx, "Tell me a story.",
+	func(token string) bool {
+		fmt.Print(token)
+		// Return true to continue, false to stop early.
+		return true
+	},
+	inference.WithTemperature(0.8),
+	inference.WithMaxTokens(512),
+)
+```
+
+The callback function implements the `generate.TokenStream` type. It receives each decoded token string and returns a boolean: `true` to continue generation, `false` to stop immediately.
+
+## Stop Conditions
+
+Generation stops when any of these conditions is met:
+
+1. The end-of-sequence (EOS) token is generated.
+2. `MaxNewTokens` is reached.
+3. A stop string is found in the output.
+4. The streaming callback returns `false`.
+5. The context is cancelled.
+
+You can set custom stop strings:
+
+```go
+result, _ := model.Generate(ctx, prompt,
+	inference.WithMaxTokens(512),
+	inference.WithStopStrings("\n\n", "END"),
+)
+```
+
+## Constrained Decoding with Grammars
+
+Zerfoo supports grammar-constrained generation using the `grammar` package. At each sampling step, a token mask restricts output to tokens valid according to the grammar:
+
+```go
+import "github.com/zerfoo/zerfoo/generate/grammar"
+
+g, err := grammar.Parse(`root ::= "{" ws "\"name\"" ws ":" ws string "}" ...`)
+result, _ := model.Generate(ctx, "Generate a JSON object with a name field.",
+	inference.WithGrammar(g),
+	inference.WithMaxTokens(128),
+)
+```
+
+This is useful for generating structured output like JSON, SQL, or code that must conform to a specific syntax.
+
+## KV Cache and Performance
+
+The KV (Key-Value) cache is the single most important optimization for autoregressive generation. Without it, every decode step would reprocess the entire sequence from scratch.
+
+### How It Works
+
+During the prefill phase, the model computes attention keys and values for all prompt tokens and stores them in the KV cache. During decode, only the new token is processed -- its keys and values are appended to the cache, and attention is computed against all cached entries.
+
+### Memory Considerations
+
+KV cache memory grows linearly with sequence length and model size. For a 7B model with 32 layers and 4096 context length, the KV cache can use 1-2 GB of memory in FP32. You can halve this with FP16 KV storage:
+
+```go
+model, err := inference.LoadFile("model.gguf",
+	inference.WithDevice("cuda"),
+	inference.WithKVDtype("fp16"),
+)
+```
+
+### Paged KV Cache
+
+For serving multiple concurrent requests, Zerfoo supports paged KV caching at the generator level. Paged KV allocates memory in blocks from a shared pool rather than pre-allocating the full sequence length per request. This significantly improves memory utilization when serving requests of varying lengths.
+
+### CUDA Graph Capture
+
+On CUDA devices, Zerfoo captures the decode step as a CUDA graph after the first execution. Subsequent decode steps replay the captured graph, eliminating CPU-side kernel launch overhead. This is why sessions are pooled in `inference.Model` -- reusing sessions preserves GPU memory addresses required for graph replay.
+
+## Batch Generation
+
+When you have multiple prompts to process, batch generation is more efficient than sequential calls:
+
+```go
+prompts := []string{
+	"Summarize quantum computing in one sentence.",
+	"What is the capital of Japan?",
+	"Explain REST APIs briefly.",
+}
+
+results, err := model.GenerateBatch(ctx, prompts,
+	inference.WithTemperature(0.5),
+	inference.WithMaxTokens(64),
+)
+for i, r := range results {
+	fmt.Printf("Prompt %d: %s\n", i+1, r)
+}
+```
+
+`GenerateBatch` processes prompts concurrently using the session pool, taking advantage of GPU parallelism when available.
+
+## Next Steps
+
+- [Running the OpenAI-Compatible API Server](/docs/api/) -- serve your model over HTTP.