Skip to content

Commit 8d6055d

Browse files
committed
docs(tutorials): add API server and tabular/timeseries tutorials
1 parent 8888789 commit 8d6055d

2 files changed

Lines changed: 492 additions & 0 deletions

File tree

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
---
2+
title: API Server
3+
weight: 3
4+
bookToc: true
5+
---
6+
7+
# Running the OpenAI-Compatible API Server
8+
9+
This tutorial shows how to serve a model over HTTP using the Zerfoo API server, which implements the OpenAI API specification. Any client library or tool that works with the OpenAI API can connect to Zerfoo with a one-line base URL change.
10+
11+
## Starting the Server
12+
13+
The simplest way to start serving is with the `serve` CLI command:
14+
15+
```bash
16+
zerfoo serve gemma-3-1b-q4
17+
```
18+
19+
This downloads the model (if not already cached), loads it, and starts an HTTP server on `localhost:8080`. You can customize the host and port:
20+
21+
```bash
22+
zerfoo serve gemma-3-1b-q4 --host 0.0.0.0 --port 3000
23+
```
24+
25+
For GPU inference:
26+
27+
```bash
28+
zerfoo serve gemma-3-1b-q4 --device cuda
29+
```
30+
31+
## Available Endpoints
32+
33+
The server exposes these OpenAI-compatible endpoints:
34+
35+
| Method | Path | Description |
36+
|--------|------|-------------|
37+
| POST | `/v1/chat/completions` | Chat completion (multi-turn conversation) |
38+
| POST | `/v1/completions` | Text completion (single prompt) |
39+
| POST | `/v1/embeddings` | Text embeddings |
40+
| POST | `/v1/audio/transcriptions` | Audio transcription (when a transcriber is configured) |
41+
| GET | `/v1/models` | List loaded models |
42+
| GET | `/v1/models/{id}` | Get model information |
43+
| DELETE | `/v1/models/{id}` | Unload a model |
44+
| GET | `/metrics` | Prometheus metrics |
45+
| GET | `/openapi.yaml` | OpenAPI specification |
46+
47+
## Making Requests with curl
48+
49+
### Chat Completion
50+
51+
```bash
52+
curl http://localhost:8080/v1/chat/completions \
53+
-H "Content-Type: application/json" \
54+
-d '{
55+
"model": "gemma-3-1b-q4",
56+
"messages": [
57+
{"role": "system", "content": "You are a helpful assistant."},
58+
{"role": "user", "content": "What is the capital of France?"}
59+
],
60+
"temperature": 0.7,
61+
"max_tokens": 64
62+
}'
63+
```
64+
65+
### Streaming
66+
67+
Add `"stream": true` to receive server-sent events (SSE) as tokens are generated:
68+
69+
```bash
70+
curl http://localhost:8080/v1/chat/completions \
71+
-H "Content-Type: application/json" \
72+
-d '{
73+
"model": "gemma-3-1b-q4",
74+
"messages": [
75+
{"role": "user", "content": "Write a poem about Go."}
76+
],
77+
"stream": true,
78+
"max_tokens": 128
79+
}'
80+
```
81+
82+
Each SSE event contains a JSON chunk with the delta token. The stream ends with `data: [DONE]`.
83+
84+
### Text Completion
85+
86+
```bash
87+
curl http://localhost:8080/v1/completions \
88+
-H "Content-Type: application/json" \
89+
-d '{
90+
"model": "gemma-3-1b-q4",
91+
"prompt": "The Go programming language is",
92+
"max_tokens": 64,
93+
"temperature": 0.5
94+
}'
95+
```
96+
97+
### List Models
98+
99+
```bash
100+
curl http://localhost:8080/v1/models
101+
```
102+
103+
## Using with the OpenAI Python Client
104+
105+
Any OpenAI-compatible client library works. Here is an example with the official Python client:
106+
107+
```python
108+
from openai import OpenAI
109+
110+
client = OpenAI(
111+
base_url="http://localhost:8080/v1",
112+
api_key="not-needed", # Zerfoo does not require an API key by default
113+
)
114+
115+
response = client.chat.completions.create(
116+
model="gemma-3-1b-q4",
117+
messages=[
118+
{"role": "system", "content": "You are a helpful assistant."},
119+
{"role": "user", "content": "Explain transformers in ML."},
120+
],
121+
temperature=0.7,
122+
max_tokens=256,
123+
)
124+
125+
print(response.choices[0].message.content)
126+
```
127+
128+
For streaming:
129+
130+
```python
131+
stream = client.chat.completions.create(
132+
model="gemma-3-1b-q4",
133+
messages=[{"role": "user", "content": "Write a haiku."}],
134+
stream=True,
135+
)
136+
137+
for chunk in stream:
138+
if chunk.choices[0].delta.content:
139+
print(chunk.choices[0].delta.content, end="")
140+
```
141+
142+
## Starting the Server from Go Code
143+
144+
You can embed the server directly in your Go application:
145+
146+
```go
147+
package main
148+
149+
import (
150+
"log"
151+
"net/http"
152+
153+
"github.com/zerfoo/zerfoo/inference"
154+
"github.com/zerfoo/zerfoo/serve"
155+
)
156+
157+
func main() {
158+
model, err := inference.LoadFile("gemma-3-1b-it-q4_0.gguf",
159+
inference.WithDevice("cuda"),
160+
)
161+
if err != nil {
162+
log.Fatal(err)
163+
}
164+
defer model.Close()
165+
166+
srv := serve.NewServer(model)
167+
168+
log.Println("Listening on :8080")
169+
log.Fatal(http.ListenAndServe(":8080", srv.Handler()))
170+
}
171+
```
172+
173+
### Server Options
174+
175+
The `serve.NewServer` function accepts options for logging, metrics, batch scheduling, speculative decoding, and multi-GPU distribution:
176+
177+
```go
178+
srv := serve.NewServer(model,
179+
serve.WithLogger(logger),
180+
serve.WithMetrics(metricsCollector),
181+
serve.WithBatchScheduler(batchScheduler),
182+
serve.WithDraftModel(draftModel),
183+
serve.WithGPUs([]int{0, 1}),
184+
)
185+
```
186+
187+
**Speculative decoding**: When a draft model is set with `WithDraftModel`, the server uses speculative decoding for all completion requests. A smaller, faster model proposes tokens and the target model verifies them in a single batched forward pass, improving decode throughput.
188+
189+
**Batch scheduling**: When a `BatchScheduler` is attached with `WithBatchScheduler`, incoming non-streaming requests are grouped into batches for higher throughput under load.
190+
191+
## Prometheus Metrics
192+
193+
The server exposes a `/metrics` endpoint in Prometheus format. Key metrics include:
194+
195+
- Request count and latency per endpoint
196+
- Token generation rate (tokens per second)
197+
- Speculative decoding acceptance rate (when enabled)
198+
199+
Point your Prometheus scrape config at `http://localhost:8080/metrics` to collect these metrics.
200+
201+
## Monitoring and Health
202+
203+
The `/v1/models` endpoint serves as a lightweight health check. If the model is loaded and ready, it returns model metadata. After a `DELETE /v1/models/{id}` call, the model is unloaded and subsequent inference requests return an error.
204+
205+
The server includes built-in recovery middleware that catches panics during request handling and returns a 500 response instead of crashing the process.
206+
207+
## Using with the OpenAI Go Client
208+
209+
You can also use any Go HTTP client. Here is an example using the standard library:
210+
211+
```go
212+
package main
213+
214+
import (
215+
"bytes"
216+
"encoding/json"
217+
"fmt"
218+
"io"
219+
"log"
220+
"net/http"
221+
)
222+
223+
func main() {
224+
body := map[string]interface{}{
225+
"model": "gemma-3-1b-q4",
226+
"messages": []map[string]string{
227+
{"role": "user", "content": "What is Go?"},
228+
},
229+
"max_tokens": 64,
230+
}
231+
data, _ := json.Marshal(body)
232+
233+
resp, err := http.Post("http://localhost:8080/v1/chat/completions",
234+
"application/json", bytes.NewReader(data))
235+
if err != nil {
236+
log.Fatal(err)
237+
}
238+
defer resp.Body.Close()
239+
240+
out, _ := io.ReadAll(resp.Body)
241+
fmt.Println(string(out))
242+
}
243+
```
244+
245+
This works because the server speaks the same JSON schema as the OpenAI API. Any HTTP client in any language can send requests to Zerfoo without a dedicated SDK.
246+
247+
## What is Next
248+
249+
- [Tabular and Time-Series ML](/docs/tutorials/tabular-timeseries/) -- use Zerfoo for structured data prediction and forecasting.

0 commit comments

Comments
 (0)