Skip to content

Commit 481a566

Browse files
committed
docs(getting-started): add first inference tutorial
1 parent 67f59e9 commit 481a566

1 file changed

Lines changed: 270 additions & 0 deletions

File tree

Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
---
2+
title: Your First Inference
3+
weight: 3
4+
bookToc: true
5+
---
6+
7+
# Your First Inference
8+
9+
Go from zero to working LLM inference in under 5 minutes.
10+
11+
## Prerequisites
12+
13+
- **Go 1.25 or later** -- [download Go](https://go.dev/dl/)
14+
- A machine with at least 4 GB of RAM (8 GB recommended for 7B models)
15+
- Optional: an NVIDIA GPU with CUDA drivers for hardware-accelerated inference
16+
17+
Verify your Go installation:
18+
19+
```bash
20+
go version
21+
# go version go1.25.0 linux/amd64
22+
```
23+
24+
## Install the CLI
25+
26+
```bash
27+
go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest
28+
```
29+
30+
Verify:
31+
32+
```bash
33+
zerfoo version
34+
```
35+
36+
Zerfoo builds with zero CGo by default. GPU acceleration is loaded dynamically at runtime, so you do not need CUDA headers or build tags to compile.
37+
38+
## Download a Model
39+
40+
Zerfoo uses the GGUF model format -- the same format used by llama.cpp. Pull a small quantized model to get started:
41+
42+
```bash
43+
zerfoo pull gemma-3-1b-q4
44+
```
45+
46+
This downloads the GGUF file to `~/.cache/zerfoo`. You can also pull by full HuggingFace repo ID:
47+
48+
```bash
49+
zerfoo pull meta-llama/Llama-3.2-1B-Instruct-GGUF
50+
```
51+
52+
Manage cached models:
53+
54+
```bash
55+
zerfoo list # show cached models
56+
zerfoo rm gemma-3-1b-q4 # remove a model
57+
```
58+
59+
### Model aliases
60+
61+
Zerfoo ships with built-in aliases for popular models:
62+
63+
| Alias | HuggingFace Repo |
64+
|-------|-----------------|
65+
| `gemma-3-1b-q4` | `google/gemma-3-1b-it-qat-q4_0-gguf` |
66+
| `llama-3-1b-q4` | `meta-llama/Llama-3.2-1B-Instruct-GGUF` |
67+
| `llama-3-8b-q4` | `meta-llama/Llama-3.1-8B-Instruct-GGUF` |
68+
| `mistral-7b-q4` | `mistralai/Mistral-7B-Instruct-v0.3-GGUF` |
69+
| `qwen-2.5-7b-q4` | `Qwen/Qwen2.5-7B-Instruct-GGUF` |
70+
71+
You can also pass any HuggingFace repo ID directly, or a local file path.
72+
73+
## CLI Usage
74+
75+
### Interactive chat
76+
77+
Start a chat session with `zerfoo run`:
78+
79+
```bash
80+
zerfoo run gemma-3-1b-q4
81+
```
82+
83+
```
84+
Model loaded. Type your message (Ctrl-D to quit).
85+
86+
> What is the capital of France?
87+
The capital of France is Paris.
88+
>
89+
```
90+
91+
### Single prompt
92+
93+
Run a one-off prompt with `predict`:
94+
95+
```bash
96+
zerfoo predict --model gemma-3-1b-q4 --prompt "Explain what a tensor is in one paragraph."
97+
```
98+
99+
### Sampling parameters
100+
101+
Both `run` and `predict` accept these flags:
102+
103+
| Flag | Description | Default |
104+
|------|-------------|---------|
105+
| `--temperature` | Sampling temperature | 1.0 |
106+
| `--top-k` | Top-K sampling | disabled |
107+
| `--top-p` | Nucleus sampling | 1.0 |
108+
| `--repetition-penalty` | Penalize repeated tokens | 1.0 |
109+
| `--max-tokens` | Maximum tokens to generate | 256 |
110+
| `--system` | System prompt | none |
111+
| `--device` | Device (`cpu`, `cuda`) | `cpu` |
112+
113+
Example:
114+
115+
```bash
116+
zerfoo predict \
117+
--model gemma-3-1b-q4 \
118+
--prompt "Write a haiku about Go." \
119+
--temperature 0.7 \
120+
--max-tokens 64
121+
```
122+
123+
## Inference from Go Code
124+
125+
Zerfoo is designed to be embedded as a library. Create a new Go project:
126+
127+
```bash
128+
mkdir my-llm-app && cd my-llm-app
129+
go mod init my-llm-app
130+
go get github.com/zerfoo/zerfoo@latest
131+
```
132+
133+
Write `main.go`:
134+
135+
```go
136+
package main
137+
138+
import (
139+
"context"
140+
"fmt"
141+
"log"
142+
143+
"github.com/zerfoo/zerfoo/inference"
144+
)
145+
146+
func main() {
147+
// Load a quantized Gemma 3 1B model.
148+
// On first run, Zerfoo pulls the GGUF file from HuggingFace and caches it.
149+
mdl, err := inference.Load("gemma-3-1b-q4")
150+
if err != nil {
151+
log.Fatal(err)
152+
}
153+
defer mdl.Close()
154+
155+
// Generate text from a prompt.
156+
result, err := mdl.Generate(
157+
context.Background(),
158+
"Explain what a tensor is in one paragraph.",
159+
inference.WithMaxTokens(128),
160+
)
161+
if err != nil {
162+
log.Fatal(err)
163+
}
164+
fmt.Println(result)
165+
}
166+
```
167+
168+
Run it:
169+
170+
```bash
171+
go run main.go
172+
```
173+
174+
### Chat completion
175+
176+
For multi-turn conversations, use the `Chat` method:
177+
178+
```go
179+
resp, err := mdl.Chat(context.Background(), []inference.Message{
180+
{Role: "system", Content: "You are a helpful assistant."},
181+
{Role: "user", Content: "What is the capital of France?"},
182+
},
183+
inference.WithTemperature(0.5),
184+
inference.WithMaxTokens(64),
185+
)
186+
if err != nil {
187+
log.Fatal(err)
188+
}
189+
fmt.Println(resp.Content)
190+
fmt.Printf("Tokens used: %d (prompt: %d, completion: %d)\n",
191+
resp.TokensUsed, resp.PromptTokens, resp.CompletionTokens)
192+
```
193+
194+
### GPU acceleration
195+
196+
Pass `WithDevice` to run on a CUDA GPU:
197+
198+
```go
199+
mdl, err := inference.LoadFile("model.gguf",
200+
inference.WithDevice("cuda"),
201+
)
202+
```
203+
204+
Or from the CLI:
205+
206+
```bash
207+
zerfoo run gemma-3-1b-q4 --device cuda
208+
```
209+
210+
No build tags are needed. Zerfoo discovers CUDA libraries at runtime. If CUDA is not available, the call returns an error so you can fall back to CPU gracefully.
211+
212+
## Serve an OpenAI-Compatible API
213+
214+
Start a server with `zerfoo serve`:
215+
216+
```bash
217+
zerfoo serve gemma-3-1b-q4 --port 8080
218+
```
219+
220+
Send a request with `curl`:
221+
222+
```bash
223+
curl http://localhost:8080/v1/chat/completions \
224+
-H "Content-Type: application/json" \
225+
-d '{
226+
"model": "gemma-3-1b-q4",
227+
"messages": [{"role": "user", "content": "Hello!"}],
228+
"temperature": 0.7,
229+
"max_tokens": 256
230+
}'
231+
```
232+
233+
Enable streaming with SSE:
234+
235+
```bash
236+
curl http://localhost:8080/v1/chat/completions \
237+
-H "Content-Type: application/json" \
238+
-d '{
239+
"model": "gemma-3-1b-q4",
240+
"messages": [{"role": "user", "content": "Hello!"}],
241+
"stream": true
242+
}'
243+
```
244+
245+
Any OpenAI-compatible client library works -- just point it at `localhost:8080`:
246+
247+
```python
248+
from openai import OpenAI
249+
250+
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
251+
response = client.chat.completions.create(
252+
model="gemma-3-1b-q4",
253+
messages=[{"role": "user", "content": "Hello!"}],
254+
)
255+
print(response.choices[0].message.content)
256+
```
257+
258+
### Available endpoints
259+
260+
| Method | Path | Description |
261+
|--------|------|-------------|
262+
| POST | `/v1/chat/completions` | Chat completion |
263+
| POST | `/v1/completions` | Text completion |
264+
| POST | `/v1/embeddings` | Text embeddings |
265+
| GET | `/v1/models` | List loaded models |
266+
| GET | `/metrics` | Prometheus metrics |
267+
268+
## Next Steps
269+
270+
- [Installation]({{< relref "/docs/getting-started/installation" >}}) -- detailed installation and platform support

0 commit comments

Comments
 (0)