|
| 1 | +--- |
| 2 | +title: API Server |
| 3 | +weight: 3 |
| 4 | +bookToc: true |
| 5 | +--- |
| 6 | + |
| 7 | +# Running the OpenAI-Compatible API Server |
| 8 | + |
| 9 | +This tutorial shows how to serve a model over HTTP using the Zerfoo API server, which implements the OpenAI API specification. Any client library or tool that works with the OpenAI API can connect to Zerfoo with a one-line base URL change. |
| 10 | + |
| 11 | +## Starting the Server |
| 12 | + |
| 13 | +The simplest way to start serving is with the `serve` CLI command: |
| 14 | + |
| 15 | +```bash |
| 16 | +zerfoo serve gemma-3-1b-q4 |
| 17 | +``` |
| 18 | + |
| 19 | +This downloads the model (if not already cached), loads it, and starts an HTTP server on `localhost:8080`. You can customize the host and port: |
| 20 | + |
| 21 | +```bash |
| 22 | +zerfoo serve gemma-3-1b-q4 --host 0.0.0.0 --port 3000 |
| 23 | +``` |
| 24 | + |
| 25 | +For GPU inference: |
| 26 | + |
| 27 | +```bash |
| 28 | +zerfoo serve gemma-3-1b-q4 --device cuda |
| 29 | +``` |
| 30 | + |
| 31 | +## Available Endpoints |
| 32 | + |
| 33 | +The server exposes these OpenAI-compatible endpoints: |
| 34 | + |
| 35 | +| Method | Path | Description | |
| 36 | +|--------|------|-------------| |
| 37 | +| POST | `/v1/chat/completions` | Chat completion (multi-turn conversation) | |
| 38 | +| POST | `/v1/completions` | Text completion (single prompt) | |
| 39 | +| POST | `/v1/embeddings` | Text embeddings | |
| 40 | +| POST | `/v1/audio/transcriptions` | Audio transcription (when a transcriber is configured) | |
| 41 | +| GET | `/v1/models` | List loaded models | |
| 42 | +| GET | `/v1/models/{id}` | Get model information | |
| 43 | +| DELETE | `/v1/models/{id}` | Unload a model | |
| 44 | +| GET | `/metrics` | Prometheus metrics | |
| 45 | +| GET | `/openapi.yaml` | OpenAPI specification | |
| 46 | + |
| 47 | +## Making Requests with curl |
| 48 | + |
| 49 | +### Chat Completion |
| 50 | + |
| 51 | +```bash |
| 52 | +curl http://localhost:8080/v1/chat/completions \ |
| 53 | + -H "Content-Type: application/json" \ |
| 54 | + -d '{ |
| 55 | + "model": "gemma-3-1b-q4", |
| 56 | + "messages": [ |
| 57 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 58 | + {"role": "user", "content": "What is the capital of France?"} |
| 59 | + ], |
| 60 | + "temperature": 0.7, |
| 61 | + "max_tokens": 64 |
| 62 | + }' |
| 63 | +``` |
| 64 | + |
| 65 | +### Streaming |
| 66 | + |
| 67 | +Add `"stream": true` to receive server-sent events (SSE) as tokens are generated: |
| 68 | + |
| 69 | +```bash |
| 70 | +curl http://localhost:8080/v1/chat/completions \ |
| 71 | + -H "Content-Type: application/json" \ |
| 72 | + -d '{ |
| 73 | + "model": "gemma-3-1b-q4", |
| 74 | + "messages": [ |
| 75 | + {"role": "user", "content": "Write a poem about Go."} |
| 76 | + ], |
| 77 | + "stream": true, |
| 78 | + "max_tokens": 128 |
| 79 | + }' |
| 80 | +``` |
| 81 | + |
| 82 | +Each SSE event contains a JSON chunk with the delta token. The stream ends with `data: [DONE]`. |
| 83 | + |
| 84 | +### Text Completion |
| 85 | + |
| 86 | +```bash |
| 87 | +curl http://localhost:8080/v1/completions \ |
| 88 | + -H "Content-Type: application/json" \ |
| 89 | + -d '{ |
| 90 | + "model": "gemma-3-1b-q4", |
| 91 | + "prompt": "The Go programming language is", |
| 92 | + "max_tokens": 64, |
| 93 | + "temperature": 0.5 |
| 94 | + }' |
| 95 | +``` |
| 96 | + |
| 97 | +### List Models |
| 98 | + |
| 99 | +```bash |
| 100 | +curl http://localhost:8080/v1/models |
| 101 | +``` |
| 102 | + |
| 103 | +## Using with the OpenAI Python Client |
| 104 | + |
| 105 | +Any OpenAI-compatible client library works. Here is an example with the official Python client: |
| 106 | + |
| 107 | +```python |
| 108 | +from openai import OpenAI |
| 109 | + |
| 110 | +client = OpenAI( |
| 111 | + base_url="http://localhost:8080/v1", |
| 112 | + api_key="not-needed", # Zerfoo does not require an API key by default |
| 113 | +) |
| 114 | + |
| 115 | +response = client.chat.completions.create( |
| 116 | + model="gemma-3-1b-q4", |
| 117 | + messages=[ |
| 118 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 119 | + {"role": "user", "content": "Explain transformers in ML."}, |
| 120 | + ], |
| 121 | + temperature=0.7, |
| 122 | + max_tokens=256, |
| 123 | +) |
| 124 | + |
| 125 | +print(response.choices[0].message.content) |
| 126 | +``` |
| 127 | + |
| 128 | +For streaming: |
| 129 | + |
| 130 | +```python |
| 131 | +stream = client.chat.completions.create( |
| 132 | + model="gemma-3-1b-q4", |
| 133 | + messages=[{"role": "user", "content": "Write a haiku."}], |
| 134 | + stream=True, |
| 135 | +) |
| 136 | + |
| 137 | +for chunk in stream: |
| 138 | + if chunk.choices[0].delta.content: |
| 139 | + print(chunk.choices[0].delta.content, end="") |
| 140 | +``` |
| 141 | + |
| 142 | +## Starting the Server from Go Code |
| 143 | + |
| 144 | +You can embed the server directly in your Go application: |
| 145 | + |
| 146 | +```go |
| 147 | +package main |
| 148 | + |
| 149 | +import ( |
| 150 | + "log" |
| 151 | + "net/http" |
| 152 | + |
| 153 | + "github.com/zerfoo/zerfoo/inference" |
| 154 | + "github.com/zerfoo/zerfoo/serve" |
| 155 | +) |
| 156 | + |
| 157 | +func main() { |
| 158 | + model, err := inference.LoadFile("gemma-3-1b-it-q4_0.gguf", |
| 159 | + inference.WithDevice("cuda"), |
| 160 | + ) |
| 161 | + if err != nil { |
| 162 | + log.Fatal(err) |
| 163 | + } |
| 164 | + defer model.Close() |
| 165 | + |
| 166 | + srv := serve.NewServer(model) |
| 167 | + |
| 168 | + log.Println("Listening on :8080") |
| 169 | + log.Fatal(http.ListenAndServe(":8080", srv.Handler())) |
| 170 | +} |
| 171 | +``` |
| 172 | + |
| 173 | +### Server Options |
| 174 | + |
| 175 | +The `serve.NewServer` function accepts options for logging, metrics, batch scheduling, speculative decoding, and multi-GPU distribution: |
| 176 | + |
| 177 | +```go |
| 178 | +srv := serve.NewServer(model, |
| 179 | + serve.WithLogger(logger), |
| 180 | + serve.WithMetrics(metricsCollector), |
| 181 | + serve.WithBatchScheduler(batchScheduler), |
| 182 | + serve.WithDraftModel(draftModel), |
| 183 | + serve.WithGPUs([]int{0, 1}), |
| 184 | +) |
| 185 | +``` |
| 186 | + |
| 187 | +**Speculative decoding**: When a draft model is set with `WithDraftModel`, the server uses speculative decoding for all completion requests. A smaller, faster model proposes tokens and the target model verifies them in a single batched forward pass, improving decode throughput. |
| 188 | + |
| 189 | +**Batch scheduling**: When a `BatchScheduler` is attached with `WithBatchScheduler`, incoming non-streaming requests are grouped into batches for higher throughput under load. |
| 190 | + |
| 191 | +## Prometheus Metrics |
| 192 | + |
| 193 | +The server exposes a `/metrics` endpoint in Prometheus format. Key metrics include: |
| 194 | + |
| 195 | +- Request count and latency per endpoint |
| 196 | +- Token generation rate (tokens per second) |
| 197 | +- Speculative decoding acceptance rate (when enabled) |
| 198 | + |
| 199 | +Point your Prometheus scrape config at `http://localhost:8080/metrics` to collect these metrics. |
| 200 | + |
| 201 | +## Monitoring and Health |
| 202 | + |
| 203 | +The `/v1/models` endpoint serves as a lightweight health check. If the model is loaded and ready, it returns model metadata. After a `DELETE /v1/models/{id}` call, the model is unloaded and subsequent inference requests return an error. |
| 204 | + |
| 205 | +The server includes built-in recovery middleware that catches panics during request handling and returns a 500 response instead of crashing the process. |
| 206 | + |
| 207 | +## Using with the OpenAI Go Client |
| 208 | + |
| 209 | +You can also use any Go HTTP client. Here is an example using the standard library: |
| 210 | + |
| 211 | +```go |
| 212 | +package main |
| 213 | + |
| 214 | +import ( |
| 215 | + "bytes" |
| 216 | + "encoding/json" |
| 217 | + "fmt" |
| 218 | + "io" |
| 219 | + "log" |
| 220 | + "net/http" |
| 221 | +) |
| 222 | + |
| 223 | +func main() { |
| 224 | + body := map[string]interface{}{ |
| 225 | + "model": "gemma-3-1b-q4", |
| 226 | + "messages": []map[string]string{ |
| 227 | + {"role": "user", "content": "What is Go?"}, |
| 228 | + }, |
| 229 | + "max_tokens": 64, |
| 230 | + } |
| 231 | + data, _ := json.Marshal(body) |
| 232 | + |
| 233 | + resp, err := http.Post("http://localhost:8080/v1/chat/completions", |
| 234 | + "application/json", bytes.NewReader(data)) |
| 235 | + if err != nil { |
| 236 | + log.Fatal(err) |
| 237 | + } |
| 238 | + defer resp.Body.Close() |
| 239 | + |
| 240 | + out, _ := io.ReadAll(resp.Body) |
| 241 | + fmt.Println(string(out)) |
| 242 | +} |
| 243 | +``` |
| 244 | + |
| 245 | +This works because the server speaks the same JSON schema as the OpenAI API. Any HTTP client in any language can send requests to Zerfoo without a dedicated SDK. |
| 246 | + |
| 247 | +## What is Next |
| 248 | + |
| 249 | +- [Tabular and Time-Series ML](/docs/tutorials/tabular-timeseries/) -- use Zerfoo for structured data prediction and forecasting. |
0 commit comments