|
| 1 | +--- |
| 2 | +title: Text Generation |
| 3 | +weight: 2 |
| 4 | +bookToc: true |
| 5 | +--- |
| 6 | + |
| 7 | +# Text Generation Deep Dive |
| 8 | + |
| 9 | +This tutorial explores how Zerfoo generates text: sampling strategies, streaming responses token by token, KV cache behavior, and batch generation for throughput. |
| 10 | + |
| 11 | +## How Autoregressive Generation Works |
| 12 | + |
| 13 | +Transformer models generate text one token at a time. At each step, the model computes a probability distribution over the vocabulary (logits), a token is selected, and it becomes part of the input for the next step. The `generate` package implements this loop with configurable sampling, stopping conditions, and KV caching. |
| 14 | + |
| 15 | +When you call `model.Generate`, this is what happens internally: |
| 16 | + |
| 17 | +1. The prompt is tokenized using the BPE tokenizer embedded in the GGUF file. |
| 18 | +2. A `SamplingConfig` is built from the options you pass. |
| 19 | +3. The prompt tokens run through the computation graph in a single forward pass (prefill). |
| 20 | +4. The KV cache stores key/value activations so they are not recomputed on subsequent steps. |
| 21 | +5. One token is generated per step (decode) until a stop condition is met. |
| 22 | + |
| 23 | +## Sampling Strategies |
| 24 | + |
| 25 | +Sampling controls how the next token is chosen from the logit distribution. Zerfoo supports several strategies that can be combined. |
| 26 | + |
| 27 | +### Temperature |
| 28 | + |
| 29 | +Temperature scales the logits before converting them to probabilities. Lower values make the distribution sharper (more deterministic), higher values make it flatter (more creative). |
| 30 | + |
| 31 | +```go |
| 32 | +// Deterministic output (greedy decoding). |
| 33 | +result, _ := model.Generate(ctx, prompt, |
| 34 | + inference.WithTemperature(0), |
| 35 | +) |
| 36 | + |
| 37 | +// Creative output. |
| 38 | +result, _ := model.Generate(ctx, prompt, |
| 39 | + inference.WithTemperature(1.2), |
| 40 | +) |
| 41 | +``` |
| 42 | + |
| 43 | +A temperature of 0 selects the highest-probability token every time (greedy). A temperature of 1.0 samples proportionally to the probabilities. Values above 1.0 increase randomness. |
| 44 | + |
| 45 | +### Top-K Sampling |
| 46 | + |
| 47 | +Top-K restricts the candidate set to the K most probable tokens before sampling. This prevents the model from selecting very unlikely tokens. |
| 48 | + |
| 49 | +```go |
| 50 | +result, _ := model.Generate(ctx, prompt, |
| 51 | + inference.WithTemperature(0.8), |
| 52 | + inference.WithTopK(40), |
| 53 | +) |
| 54 | +``` |
| 55 | + |
| 56 | +When `TopK` is 0 (the default), all tokens are candidates. |
| 57 | + |
| 58 | +### Top-P (Nucleus) Sampling |
| 59 | + |
| 60 | +Top-P keeps the smallest set of tokens whose cumulative probability exceeds P. This adapts the candidate set size dynamically -- confident predictions use fewer candidates, uncertain predictions use more. |
| 61 | + |
| 62 | +```go |
| 63 | +result, _ := model.Generate(ctx, prompt, |
| 64 | + inference.WithTemperature(0.8), |
| 65 | + inference.WithTopP(0.9), |
| 66 | +) |
| 67 | +``` |
| 68 | + |
| 69 | +When `TopP` is 1.0 (the default), no filtering is applied. Top-K and Top-P can be combined: Top-K filters first, then Top-P filters the remainder. |
| 70 | + |
| 71 | +### Repetition Penalty |
| 72 | + |
| 73 | +Repetition penalty reduces the probability of tokens that have already appeared in the output. A value of 1.0 disables the penalty; values above 1.0 penalize repetition. |
| 74 | + |
| 75 | +```go |
| 76 | +result, _ := model.Generate(ctx, prompt, |
| 77 | + inference.WithRepetitionPenalty(1.1), |
| 78 | +) |
| 79 | +``` |
| 80 | + |
| 81 | +### Recommended Defaults |
| 82 | + |
| 83 | +For most use cases, a good starting point is: |
| 84 | + |
| 85 | +```go |
| 86 | +result, _ := model.Generate(ctx, prompt, |
| 87 | + inference.WithTemperature(0.7), |
| 88 | + inference.WithTopP(0.9), |
| 89 | + inference.WithMaxTokens(256), |
| 90 | +) |
| 91 | +``` |
| 92 | + |
| 93 | +## Streaming Responses |
| 94 | + |
| 95 | +For interactive applications, you often want to display tokens as they are generated rather than waiting for the full response. The `GenerateStream` method accepts a callback that receives each token: |
| 96 | + |
| 97 | +```go |
| 98 | +err := model.GenerateStream(ctx, "Tell me a story.", |
| 99 | + func(token string) bool { |
| 100 | + fmt.Print(token) |
| 101 | + // Return true to continue, false to stop early. |
| 102 | + return true |
| 103 | + }, |
| 104 | + inference.WithTemperature(0.8), |
| 105 | + inference.WithMaxTokens(512), |
| 106 | +) |
| 107 | +``` |
| 108 | + |
| 109 | +The callback function implements the `generate.TokenStream` type. It receives each decoded token string and returns a boolean: `true` to continue generation, `false` to stop immediately. |
| 110 | + |
| 111 | +## Stop Conditions |
| 112 | + |
| 113 | +Generation stops when any of these conditions is met: |
| 114 | + |
| 115 | +1. The end-of-sequence (EOS) token is generated. |
| 116 | +2. `MaxNewTokens` is reached. |
| 117 | +3. A stop string is found in the output. |
| 118 | +4. The streaming callback returns `false`. |
| 119 | +5. The context is cancelled. |
| 120 | + |
| 121 | +You can set custom stop strings: |
| 122 | + |
| 123 | +```go |
| 124 | +result, _ := model.Generate(ctx, prompt, |
| 125 | + inference.WithMaxTokens(512), |
| 126 | + inference.WithStopStrings("\n\n", "END"), |
| 127 | +) |
| 128 | +``` |
| 129 | + |
| 130 | +## Constrained Decoding with Grammars |
| 131 | + |
| 132 | +Zerfoo supports grammar-constrained generation using the `grammar` package. At each sampling step, a token mask restricts output to tokens valid according to the grammar: |
| 133 | + |
| 134 | +```go |
| 135 | +import "github.com/zerfoo/zerfoo/generate/grammar" |
| 136 | + |
| 137 | +g, err := grammar.Parse(`root ::= "{" ws "\"name\"" ws ":" ws string "}" ...`) |
| 138 | +result, _ := model.Generate(ctx, "Generate a JSON object with a name field.", |
| 139 | + inference.WithGrammar(g), |
| 140 | + inference.WithMaxTokens(128), |
| 141 | +) |
| 142 | +``` |
| 143 | + |
| 144 | +This is useful for generating structured output like JSON, SQL, or code that must conform to a specific syntax. |
| 145 | + |
| 146 | +## KV Cache and Performance |
| 147 | + |
| 148 | +The KV (Key-Value) cache is the single most important optimization for autoregressive generation. Without it, every decode step would reprocess the entire sequence from scratch. |
| 149 | + |
| 150 | +### How It Works |
| 151 | + |
| 152 | +During the prefill phase, the model computes attention keys and values for all prompt tokens and stores them in the KV cache. During decode, only the new token is processed -- its keys and values are appended to the cache, and attention is computed against all cached entries. |
| 153 | + |
| 154 | +### Memory Considerations |
| 155 | + |
| 156 | +KV cache memory grows linearly with sequence length and model size. For a 7B model with 32 layers and 4096 context length, the KV cache can use 1-2 GB of memory in FP32. You can halve this with FP16 KV storage: |
| 157 | + |
| 158 | +```go |
| 159 | +model, err := inference.LoadFile("model.gguf", |
| 160 | + inference.WithDevice("cuda"), |
| 161 | + inference.WithKVDtype("fp16"), |
| 162 | +) |
| 163 | +``` |
| 164 | + |
| 165 | +### Paged KV Cache |
| 166 | + |
| 167 | +For serving multiple concurrent requests, Zerfoo supports paged KV caching at the generator level. Paged KV allocates memory in blocks from a shared pool rather than pre-allocating the full sequence length per request. This significantly improves memory utilization when serving requests of varying lengths. |
| 168 | + |
| 169 | +### CUDA Graph Capture |
| 170 | + |
| 171 | +On CUDA devices, Zerfoo captures the decode step as a CUDA graph after the first execution. Subsequent decode steps replay the captured graph, eliminating CPU-side kernel launch overhead. This is why sessions are pooled in `inference.Model` -- reusing sessions preserves GPU memory addresses required for graph replay. |
| 172 | + |
| 173 | +## Batch Generation |
| 174 | + |
| 175 | +When you have multiple prompts to process, batch generation is more efficient than sequential calls: |
| 176 | + |
| 177 | +```go |
| 178 | +prompts := []string{ |
| 179 | + "Summarize quantum computing in one sentence.", |
| 180 | + "What is the capital of Japan?", |
| 181 | + "Explain REST APIs briefly.", |
| 182 | +} |
| 183 | + |
| 184 | +results, err := model.GenerateBatch(ctx, prompts, |
| 185 | + inference.WithTemperature(0.5), |
| 186 | + inference.WithMaxTokens(64), |
| 187 | +) |
| 188 | +for i, r := range results { |
| 189 | + fmt.Printf("Prompt %d: %s\n", i+1, r) |
| 190 | +} |
| 191 | +``` |
| 192 | + |
| 193 | +`GenerateBatch` processes prompts concurrently using the session pool, taking advantage of GPU parallelism when available. |
| 194 | + |
| 195 | +## Next Steps |
| 196 | + |
| 197 | +- [Running the OpenAI-Compatible API Server](/docs/api/) -- serve your model over HTTP. |
0 commit comments