Skip to content

Commit 8888789

Browse files
committed
docs(tutorials): add model loading and text generation tutorials
1 parent 481a566 commit 8888789

2 files changed

Lines changed: 355 additions & 0 deletions

File tree

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
---
2+
title: Model Loading
3+
weight: 1
4+
bookToc: true
5+
---
6+
7+
# Model Loading and Architecture Support
8+
9+
This tutorial covers the GGUF model format, the architectures Zerfoo supports, how to load models programmatically with various options, and what quantization levels mean for memory and quality.
10+
11+
## The GGUF Format
12+
13+
GGUF (GPT-Generated Unified Format) is a single-file model format designed for efficient inference. A GGUF file contains everything needed to run a model:
14+
15+
- **Metadata**: architecture name, vocabulary size, hidden dimensions, RoPE parameters, chat template, and more.
16+
- **Tokenizer**: the full BPE vocabulary and merge rules embedded in the file's metadata section.
17+
- **Tensors**: all model weights, stored in their quantized or full-precision representation with shape information.
18+
19+
Zerfoo uses GGUF as its sole model format. When you call `inference.LoadFile`, the framework parses the GGUF header, extracts the tokenizer, reads the architecture metadata, and builds a typed computation graph -- all without any external config files.
20+
21+
```go
22+
model, err := inference.LoadFile("path/to/model.gguf")
23+
```
24+
25+
GGUF files are mmap-friendly. On Unix platforms, you can enable memory-mapped loading to avoid copying weights into the Go heap:
26+
27+
```go
28+
model, err := inference.LoadFile("model.gguf",
29+
inference.WithMmap(true),
30+
)
31+
```
32+
33+
## Supported Architectures
34+
35+
Zerfoo includes architecture-specific graph builders for each model family. The architecture is detected automatically from GGUF metadata -- you do not need to specify it.
36+
37+
| Architecture | Key Features | Example Models |
38+
|-------------|-------------|----------------|
39+
| Llama 3 | RoPE theta=500K, GQA | Llama 3.2 1B/3B, Llama 3.1 8B/70B |
40+
| Llama 4 | Extended Llama architecture | Llama 4 Scout |
41+
| Gemma 3 | Tied embeddings, embedding scaling, QK norms, logit softcap | Gemma 3 1B/4B/12B/27B |
42+
| Gemma 3n | Gemma 3 nano variant | Gemma 3n |
43+
| Mistral | Sliding window attention | Mistral 7B v0.3 |
44+
| Mixtral | Mixture of experts (MoE) with sliding window | Mixtral 8x7B |
45+
| Qwen 2 | Attention bias, RoPE theta=1M | Qwen 2.5 7B/14B/72B |
46+
| Phi 3/4 | Partial rotary factor | Phi-3 Mini, Phi-4 |
47+
| DeepSeek V3 | Multi-head Latent Attention (MLA), batched MoE | DeepSeek V3 |
48+
| Falcon | Multi-query attention | Falcon 7B/40B |
49+
| Command-R | Retrieval-augmented generation architecture | Command-R |
50+
| Jamba | Hybrid Mamba-Transformer architecture | Jamba |
51+
| Mamba/Mamba3 | State-space model (SSM), no attention | Mamba |
52+
| LLaVA | Vision-language multimodal | LLaVA |
53+
54+
Each architecture has a dedicated builder in the `inference/` package (e.g., `arch_llama.go`, `arch_gemma.go`, `arch_deepseek.go`). The builder reads architecture-specific metadata fields and constructs the computation graph with the correct layer structure, attention mechanism, and normalization.
55+
56+
## Loading Models Programmatically
57+
58+
The `inference.LoadFile` function accepts functional options that control device placement, precision, and sequence length.
59+
60+
### Device Selection
61+
62+
```go
63+
// CPU inference (default).
64+
model, err := inference.LoadFile("model.gguf")
65+
66+
// CUDA GPU inference.
67+
model, err := inference.LoadFile("model.gguf",
68+
inference.WithDevice("cuda"),
69+
)
70+
```
71+
72+
### Compute Precision
73+
74+
```go
75+
// FP16 compute -- activations are converted F32->FP16 before GPU kernels.
76+
model, err := inference.LoadFile("model.gguf",
77+
inference.WithDevice("cuda"),
78+
inference.WithDType("fp16"),
79+
)
80+
81+
// FP8 quantization -- weights are quantized to FP8 E4M3 at load time.
82+
model, err := inference.LoadFile("model.gguf",
83+
inference.WithDevice("cuda"),
84+
inference.WithDType("fp8"),
85+
)
86+
```
87+
88+
### Sequence Length
89+
90+
Override the model's default maximum context length:
91+
92+
```go
93+
model, err := inference.LoadFile("model.gguf",
94+
inference.WithMaxSeqLen(4096),
95+
)
96+
```
97+
98+
### TensorRT Backend
99+
100+
For maximum throughput on NVIDIA GPUs, enable the TensorRT backend:
101+
102+
```go
103+
model, err := inference.LoadFile("model.gguf",
104+
inference.WithDevice("cuda"),
105+
inference.WithBackend("tensorrt"),
106+
inference.WithPrecision("fp16"),
107+
)
108+
```
109+
110+
### Model Aliases
111+
112+
Zerfoo maintains a table of short aliases for popular HuggingFace repositories. You can resolve an alias to its full repo ID or register your own:
113+
114+
```go
115+
// Resolves "gemma-3-1b-q4" -> "google/gemma-3-1b-it-qat-q4_0-gguf"
116+
repoID := inference.ResolveAlias("gemma-3-1b-q4")
117+
118+
// Register a custom alias.
119+
inference.RegisterAlias("my-model", "myorg/my-model-GGUF")
120+
```
121+
122+
## Understanding Quantization
123+
124+
Quantization reduces model weights from 16- or 32-bit floats to lower-precision integers, trading a small amount of quality for significant memory savings and faster inference.
125+
126+
Common GGUF quantization types:
127+
128+
| Type | Bits/Weight | Memory (7B model) | Quality | Use Case |
129+
|------|------------|-------------------|---------|----------|
130+
| F16 | 16 | ~14 GB | Baseline | Full quality, GPU with ample VRAM |
131+
| Q8_0 | 8 | ~7 GB | Near-lossless | Best quality-to-size ratio |
132+
| Q4_K_M | ~4.5 | ~4 GB | Good | Recommended default for most users |
133+
| Q4_0 | 4 | ~3.5 GB | Acceptable | Minimum viable quality |
134+
135+
The quantization type is baked into the GGUF file at conversion time. Zerfoo reads the quantization metadata from each tensor and applies the correct dequantization during inference. You do not need to specify the quantization type at load time.
136+
137+
For a 1B parameter model like Gemma 3 1B with Q4_K_M quantization, expect roughly 800 MB of memory usage -- small enough to run on a laptop CPU.
138+
139+
## Inspecting Model Metadata
140+
141+
After loading a model, you can access its metadata:
142+
143+
```go
144+
model, err := inference.LoadFile("model.gguf")
145+
if err != nil {
146+
log.Fatal(err)
147+
}
148+
defer model.Close()
149+
150+
info := model.Info()
151+
fmt.Printf("Architecture: %s\n", info.Architecture)
152+
fmt.Printf("Parameters: %d\n", info.Parameters)
153+
```
154+
155+
## Next Steps
156+
157+
- [Text Generation Deep Dive](/docs/tutorials/text-generation/) -- sampling strategies, streaming, and performance tuning.
158+
- [Running the OpenAI-Compatible API Server](/docs/api/) -- serve models over HTTP.
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
---
2+
title: Text Generation
3+
weight: 2
4+
bookToc: true
5+
---
6+
7+
# Text Generation Deep Dive
8+
9+
This tutorial explores how Zerfoo generates text: sampling strategies, streaming responses token by token, KV cache behavior, and batch generation for throughput.
10+
11+
## How Autoregressive Generation Works
12+
13+
Transformer models generate text one token at a time. At each step, the model computes a probability distribution over the vocabulary (logits), a token is selected, and it becomes part of the input for the next step. The `generate` package implements this loop with configurable sampling, stopping conditions, and KV caching.
14+
15+
When you call `model.Generate`, this is what happens internally:
16+
17+
1. The prompt is tokenized using the BPE tokenizer embedded in the GGUF file.
18+
2. A `SamplingConfig` is built from the options you pass.
19+
3. The prompt tokens run through the computation graph in a single forward pass (prefill).
20+
4. The KV cache stores key/value activations so they are not recomputed on subsequent steps.
21+
5. One token is generated per step (decode) until a stop condition is met.
22+
23+
## Sampling Strategies
24+
25+
Sampling controls how the next token is chosen from the logit distribution. Zerfoo supports several strategies that can be combined.
26+
27+
### Temperature
28+
29+
Temperature scales the logits before converting them to probabilities. Lower values make the distribution sharper (more deterministic), higher values make it flatter (more creative).
30+
31+
```go
32+
// Deterministic output (greedy decoding).
33+
result, _ := model.Generate(ctx, prompt,
34+
inference.WithTemperature(0),
35+
)
36+
37+
// Creative output.
38+
result, _ := model.Generate(ctx, prompt,
39+
inference.WithTemperature(1.2),
40+
)
41+
```
42+
43+
A temperature of 0 selects the highest-probability token every time (greedy). A temperature of 1.0 samples proportionally to the probabilities. Values above 1.0 increase randomness.
44+
45+
### Top-K Sampling
46+
47+
Top-K restricts the candidate set to the K most probable tokens before sampling. This prevents the model from selecting very unlikely tokens.
48+
49+
```go
50+
result, _ := model.Generate(ctx, prompt,
51+
inference.WithTemperature(0.8),
52+
inference.WithTopK(40),
53+
)
54+
```
55+
56+
When `TopK` is 0 (the default), all tokens are candidates.
57+
58+
### Top-P (Nucleus) Sampling
59+
60+
Top-P keeps the smallest set of tokens whose cumulative probability exceeds P. This adapts the candidate set size dynamically -- confident predictions use fewer candidates, uncertain predictions use more.
61+
62+
```go
63+
result, _ := model.Generate(ctx, prompt,
64+
inference.WithTemperature(0.8),
65+
inference.WithTopP(0.9),
66+
)
67+
```
68+
69+
When `TopP` is 1.0 (the default), no filtering is applied. Top-K and Top-P can be combined: Top-K filters first, then Top-P filters the remainder.
70+
71+
### Repetition Penalty
72+
73+
Repetition penalty reduces the probability of tokens that have already appeared in the output. A value of 1.0 disables the penalty; values above 1.0 penalize repetition.
74+
75+
```go
76+
result, _ := model.Generate(ctx, prompt,
77+
inference.WithRepetitionPenalty(1.1),
78+
)
79+
```
80+
81+
### Recommended Defaults
82+
83+
For most use cases, a good starting point is:
84+
85+
```go
86+
result, _ := model.Generate(ctx, prompt,
87+
inference.WithTemperature(0.7),
88+
inference.WithTopP(0.9),
89+
inference.WithMaxTokens(256),
90+
)
91+
```
92+
93+
## Streaming Responses
94+
95+
For interactive applications, you often want to display tokens as they are generated rather than waiting for the full response. The `GenerateStream` method accepts a callback that receives each token:
96+
97+
```go
98+
err := model.GenerateStream(ctx, "Tell me a story.",
99+
func(token string) bool {
100+
fmt.Print(token)
101+
// Return true to continue, false to stop early.
102+
return true
103+
},
104+
inference.WithTemperature(0.8),
105+
inference.WithMaxTokens(512),
106+
)
107+
```
108+
109+
The callback function implements the `generate.TokenStream` type. It receives each decoded token string and returns a boolean: `true` to continue generation, `false` to stop immediately.
110+
111+
## Stop Conditions
112+
113+
Generation stops when any of these conditions is met:
114+
115+
1. The end-of-sequence (EOS) token is generated.
116+
2. `MaxNewTokens` is reached.
117+
3. A stop string is found in the output.
118+
4. The streaming callback returns `false`.
119+
5. The context is cancelled.
120+
121+
You can set custom stop strings:
122+
123+
```go
124+
result, _ := model.Generate(ctx, prompt,
125+
inference.WithMaxTokens(512),
126+
inference.WithStopStrings("\n\n", "END"),
127+
)
128+
```
129+
130+
## Constrained Decoding with Grammars
131+
132+
Zerfoo supports grammar-constrained generation using the `grammar` package. At each sampling step, a token mask restricts output to tokens valid according to the grammar:
133+
134+
```go
135+
import "github.com/zerfoo/zerfoo/generate/grammar"
136+
137+
g, err := grammar.Parse(`root ::= "{" ws "\"name\"" ws ":" ws string "}" ...`)
138+
result, _ := model.Generate(ctx, "Generate a JSON object with a name field.",
139+
inference.WithGrammar(g),
140+
inference.WithMaxTokens(128),
141+
)
142+
```
143+
144+
This is useful for generating structured output like JSON, SQL, or code that must conform to a specific syntax.
145+
146+
## KV Cache and Performance
147+
148+
The KV (Key-Value) cache is the single most important optimization for autoregressive generation. Without it, every decode step would reprocess the entire sequence from scratch.
149+
150+
### How It Works
151+
152+
During the prefill phase, the model computes attention keys and values for all prompt tokens and stores them in the KV cache. During decode, only the new token is processed -- its keys and values are appended to the cache, and attention is computed against all cached entries.
153+
154+
### Memory Considerations
155+
156+
KV cache memory grows linearly with sequence length and model size. For a 7B model with 32 layers and 4096 context length, the KV cache can use 1-2 GB of memory in FP32. You can halve this with FP16 KV storage:
157+
158+
```go
159+
model, err := inference.LoadFile("model.gguf",
160+
inference.WithDevice("cuda"),
161+
inference.WithKVDtype("fp16"),
162+
)
163+
```
164+
165+
### Paged KV Cache
166+
167+
For serving multiple concurrent requests, Zerfoo supports paged KV caching at the generator level. Paged KV allocates memory in blocks from a shared pool rather than pre-allocating the full sequence length per request. This significantly improves memory utilization when serving requests of varying lengths.
168+
169+
### CUDA Graph Capture
170+
171+
On CUDA devices, Zerfoo captures the decode step as a CUDA graph after the first execution. Subsequent decode steps replay the captured graph, eliminating CPU-side kernel launch overhead. This is why sessions are pooled in `inference.Model` -- reusing sessions preserves GPU memory addresses required for graph replay.
172+
173+
## Batch Generation
174+
175+
When you have multiple prompts to process, batch generation is more efficient than sequential calls:
176+
177+
```go
178+
prompts := []string{
179+
"Summarize quantum computing in one sentence.",
180+
"What is the capital of Japan?",
181+
"Explain REST APIs briefly.",
182+
}
183+
184+
results, err := model.GenerateBatch(ctx, prompts,
185+
inference.WithTemperature(0.5),
186+
inference.WithMaxTokens(64),
187+
)
188+
for i, r := range results {
189+
fmt.Printf("Prompt %d: %s\n", i+1, r)
190+
}
191+
```
192+
193+
`GenerateBatch` processes prompts concurrently using the session pool, taking advantage of GPU parallelism when available.
194+
195+
## Next Steps
196+
197+
- [Running the OpenAI-Compatible API Server](/docs/api/) -- serve your model over HTTP.

0 commit comments

Comments
 (0)