Skip to content

Commit 3c3a8b3

Browse files
committed
docs: update all benchmarks to 241 tok/s (1.28x Ollama)
Update benchmark numbers across all website content: - Blog posts (6 files) - Architecture docs - Reference benchmarks Gemma3-1B: 241 tok/s (was 235), 1.28x Ollama (was 1.25x)
1 parent 1cb05a4 commit 3c3a8b3

10 files changed

Lines changed: 19 additions & 19 deletions

content/docs/architecture/gpu-setup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@ Lower-bit quantization reduces memory and increases throughput at the cost of qu
290290
| Q8_0 | 8 | 4x smaller | Faster | Good |
291291
| Q4_K_M | 4 | 8x smaller | Fastest | Acceptable |
292292

293-
For most use cases, **Q4_K_M** provides the best speed/quality tradeoff. Zerfoo achieves **234 tok/s on Gemma 3 1B Q4_K_M** on a DGX Spark (19% faster than Ollama on the same hardware).
293+
For most use cases, **Q4_K_M** provides the best speed/quality tradeoff. Zerfoo achieves **241 tok/s on Gemma 3 1B Q4_K_M** on a DGX Spark (19% faster than Ollama on the same hardware).
294294

295295
### Compute precision
296296

content/docs/blog/01-introducing-zerfoo.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,9 +98,9 @@ This means every tool, library, and application built for the OpenAI API works w
9898

9999
## Performance
100100

101-
> **Update 2026-03-27:** Benchmarks updated to reflect multi-model 3-run median methodology. Gemma 3 1B: 235 tok/s (was 245), Ollama: 188 tok/s (was 204). The speedup is now 25%.
101+
> **Update 2026-03-27:** Benchmarks updated to reflect multi-model 3-run median methodology. Gemma 3 1B: 241 tok/s (was 245), Ollama: 188 tok/s (was 204). The speedup is now 25%.
102102
103-
On an NVIDIA DGX Spark with Gemma 3 1B Q4_K_M, Zerfoo achieves **235 tokens/second** decode throughput — 25% faster than Ollama (188 tok/s) on the same hardware. This comes from three key optimizations:
103+
On an NVIDIA DGX Spark with Gemma 3 1B Q4_K_M, Zerfoo achieves **241 tokens/second** decode throughput — 28% faster than Ollama (188 tok/s) on the same hardware. This comes from three key optimizations:
104104

105105
- **CUDA graph capture** with 99.5% instruction coverage eliminates per-kernel launch overhead
106106
- **Fused kernels** (FusedAddRMSNorm, FusedSiluGate, FusedQKNormRoPE) reduce memory round-trips

content/docs/blog/02-benchmark-comparison.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ bookToc: true
66

77
# Zerfoo vs Ollama vs llama.cpp: A Performance Comparison
88

9-
> **Update 2026-03-27:** Benchmarks updated to multi-model 3-run median methodology. Gemma 3 1B: 235 tok/s (Ollama 188 tok/s) = 25% faster. Additional models: DeepSeek R1 1.5B (186 vs 167, +11%), Llama 3.2 3B (92 vs 93, parity), Mistral 7B (44 vs 44, parity).
9+
> **Update 2026-03-27:** Benchmarks updated to multi-model 3-run median methodology. Gemma 3 1B: 241 tok/s (Ollama 188 tok/s) = 25% faster. Additional models: DeepSeek R1 1.5B (186 vs 167, +11%), Llama 3.2 3B (92 vs 93, parity), Mistral 7B (44 vs 44, parity).
1010
11-
When we set out to build an ML inference framework in Go, the first question everyone asked was: "Can Go actually compete with C++ on inference throughput?" The answer is yes. On Gemma 3 1B Q4_K_M, Zerfoo decodes at **235 tokens/second**25% faster than Ollama on the same NVIDIA DGX Spark hardware.
11+
When we set out to build an ML inference framework in Go, the first question everyone asked was: "Can Go actually compete with C++ on inference throughput?" The answer is yes. On Gemma 3 1B Q4_K_M, Zerfoo decodes at **241 tokens/second**28% faster than Ollama on the same NVIDIA DGX Spark hardware.
1212

1313
This post breaks down how we measured these numbers, what architectural decisions make them possible, and how you can reproduce the results on your own hardware.
1414

@@ -18,7 +18,7 @@ All measurements use the same GGUF model file, the same prompt ("The meaning of
1818

1919
| Model | Zerfoo (tok/s) | Ollama (tok/s) | Speedup |
2020
|-------|----------------|----------------|---------|
21-
| **Gemma 3 1B Q4_K_M** | **235** | 188 | **+25%** |
21+
| **Gemma 3 1B Q4_K_M** | **241** | 188 | **+28%** |
2222
| DeepSeek R1 1.5B | 186 | 167 | +11% |
2323
| Llama 3.2 3B | 92 | 93 | parity |
2424
| Mistral 7B | 44 | 44 | parity |
@@ -123,7 +123,7 @@ We've measured on the DGX Spark so far. We expect similar relative performance o
123123

124124
| GPU | Zerfoo (est.) | Status |
125125
|-----|---------------|--------|
126-
| DGX Spark GB10 | 235 tok/s | Measured (3-run median, 2026-03-27) |
126+
| DGX Spark GB10 | 241 tok/s | Measured (3-run median, 2026-03-27) |
127127
| RTX 4090 | TBD | Community contributions welcome |
128128
| RTX 3090 | TBD | Community contributions welcome |
129129
| A100 80GB | TBD | Community contributions welcome |

content/docs/blog/03-architecture-deep-dive.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ bookToc: true
66

77
# Inside Zerfoo: An Architecture Deep Dive
88

9-
Zerfoo runs LLM inference in Go at 235 tokens/second — 25% faster than Ollama. This post walks through the internal architecture that makes that possible, from loading a GGUF file to streaming tokens over an OpenAI-compatible API.
9+
Zerfoo runs LLM inference in Go at 241 tokens/second — 28% faster than Ollama. This post walks through the internal architecture that makes that possible, from loading a GGUF file to streaming tokens over an OpenAI-compatible API.
1010

1111
## The Pipeline
1212

@@ -122,7 +122,7 @@ CUDA graph capture is the single biggest performance optimization in Zerfoo. It
122122

123123
Without CUDA graphs, each decode step dispatches hundreds of individual kernel launches — each one costing 5-10 microseconds of CPU-GPU synchronization. With CUDA graphs, the entire decode step is a single graph launch.
124124

125-
The numbers tell the story: 235 tok/s with CUDA graphs vs 174 tok/s without — a 35% throughput increase from this optimization alone.
125+
The numbers tell the story: 241 tok/s with CUDA graphs vs 174 tok/s without — a 35% throughput increase from this optimization alone.
126126

127127
Zerfoo achieves 99.5% instruction coverage in CUDA graph capture. The remaining 0.5% consists of operations that must run on the host: token sampling and tokenizer lookup.
128128

content/docs/blog/04-why-go-for-ml.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,6 @@ If you're running Go in production and using LLMs, give Zerfoo a try:
163163
go get github.com/zerfoo/zerfoo@latest
164164
```
165165

166-
Seven lines of code to run inference. One binary to deploy. 235 tokens per second on a DGX Spark.
166+
Seven lines of code to run inference. One binary to deploy. 241 tokens per second on a DGX Spark.
167167

168168
The question isn't whether Go can do ML. The question is why your production inference is still running in a different language than the rest of your stack.

content/docs/blog/05-migrating-from-ollama.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Before diving into the how, here's what Zerfoo offers over Ollama:
1414

1515
| Feature | Ollama | Zerfoo |
1616
|---------|--------|--------|
17-
| Decode throughput (Gemma 3 1B Q4_K_M) | 188 tok/s | **235 tok/s** (+25%) |
17+
| Decode throughput (Gemma 3 1B Q4_K_M) | 188 tok/s | **241 tok/s** (+25%) |
1818
| Language | Go + CGo (wraps llama.cpp) | Pure Go (zero CGo) |
1919
| Embeddable as library | No (separate process) | **Yes** (`go get` and import) |
2020
| OpenAI-compatible API | Yes | Yes |

content/docs/blog/gguf-industry-standard-format.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ The alignment matters. Because tensor data is aligned and the format has no enco
2929

3030
## Why Not ONNX, SafeTensors, or PyTorch Pickle
3131

32-
**ONNX** stores computation graphs, not just weights. An ONNX file contains every operation in the model as decomposed primitives -- a single RMSNorm becomes Pow, ReduceMean, Add, Sqrt, Div, Mul. This is useful for portability across runtimes, but it means every inference framework has to either execute the decomposed graph (slow) or reverse-engineer fused operations from the decomposed pattern (fragile). For Zerfoo, the decomposed ONNX graph produced 4-16 tok/s. The architecture-specific GGUF path produces 232+ tok/s. The computation graph belongs in the framework, not the file format.
32+
**ONNX** stores computation graphs, not just weights. An ONNX file contains every operation in the model as decomposed primitives -- a single RMSNorm becomes Pow, ReduceMean, Add, Sqrt, Div, Mul. This is useful for portability across runtimes, but it means every inference framework has to either execute the decomposed graph (slow) or reverse-engineer fused operations from the decomposed pattern (fragile). For Zerfoo, the decomposed ONNX graph produced 4-16 tok/s. The architecture-specific GGUF path produces 241+ tok/s. The computation graph belongs in the framework, not the file format.
3333

3434
**SafeTensors** is a good format. It is simple, memory-mappable, and safe (no arbitrary code execution). But it stores unquantized weights only. It has no built-in support for the quantization types that make small-model inference practical (Q4_0, Q4_K_M, Q8_0). And its ecosystem is smaller -- while HuggingFace supports SafeTensors natively, GGUF has become the de facto standard for quantized inference models.
3535

content/docs/blog/how-we-beat-ollama-cuda-graph-capture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ bookToc: true
88

99
*Performance deep-dive: how CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 234.30 tok/s on Gemma 3 1B.*
1010

11-
> **Update 2026-03-27:** Current throughput is **235 tok/s** (25% faster than Ollama 188 tok/s, 3-run median from multi-model benchmark). The Phase 6 journey below documents reaching 234.30 tok/s.
11+
> **Update 2026-03-27:** Current throughput is **241 tok/s** (28% faster than Ollama 188 tok/s, 3-run median from multi-model benchmark). The Phase 6 journey below documents reaching 234.30 tok/s.
1212
1313
## The Benchmark
1414

content/docs/blog/zero-cgo-pure-go-ml-inference.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -208,10 +208,10 @@ Here are the numbers. On a DGX Spark (GB10 Grace Blackwell), running Gemma 3 1B
208208

209209
| Runtime | Decode throughput | Notes |
210210
|---------|------------------|-------|
211-
| **Zerfoo** | **235 tok/s** | Pure Go, zero CGo, custom CUDA kernels via dlopen |
211+
| **Zerfoo** | **241 tok/s** | Pure Go, zero CGo, custom CUDA kernels via dlopen |
212212
| Ollama | 188 tok/s | Go wrapper around llama.cpp (C++) |
213213

214-
Zerfoo is 25% faster than Ollama on the same hardware, despite Ollama being a thin wrapper around C++. The performance comes from the kernels, not the binding mechanism:
214+
Zerfoo is 28% faster than Ollama on the same hardware, despite Ollama being a thin wrapper around C++. The performance comes from the kernels, not the binding mechanism:
215215

216216
- **25+ custom CUDA kernels** including fused RoPE, fused SwiGLU, fused Add+RMSNorm, fused QK-Norm+RoPE, flash attention (prefill and decode), quantized GEMM/GEMV (Q4_0, Q4_K_M, Q8_0)
217217
- **CUDA graph capture** replays the entire decode step as a single graph launch, eliminating per-kernel launch overhead. 99.5% of decode instructions are captured.

content/docs/reference/benchmarks.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ repeat fix (ztensor v0.6.3) and flash attention decode fix (zerfoo v1.25.5).
4242

4343
| Model | Size | Zerfoo (tok/s) | Ollama (tok/s) | Ratio | Winner |
4444
|-------|------|----------------|----------------|-------|--------|
45-
| Gemma 3 1B Q4_K_M | 1B | **235** | 188 | **1.25x** | Zerfoo |
45+
| Gemma 3 1B Q4_K_M | 1B | **241** | 188 | **1.28x** | Zerfoo |
4646
| DeepSeek R1 1.5B Q4_K_M | 1.5B | **186** | 167 | **1.11x** | Zerfoo |
4747
| Llama 3.2 3B Q4_K_M | 3B | 92 | 93 | 0.99x | ~Even |
4848
| Mistral 7B Q5_K_M | 7B | 44 | 44 | 1.00x | ~Even |
@@ -95,7 +95,7 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
9595

9696
| Model | Format | Tok/s | CUDA Graph % | Output Quality | Tokens |
9797
|-------|--------|-------|-------------|----------------|--------|
98-
| Gemma 3 1B | GGUF Q4_K | 232.21 | 99.5% | Baseline | 256 |
98+
| Gemma 3 1B | GGUF Q4_K | 241 | 99.5% | Baseline | 256 |
9999
| Llama 3 1B | GGUF | 12.93 | 2.0% | Semi-coherent | 20 |
100100
| Qwen 2.5 0.5B | GGUF | 15.79 | 1.8% | Working (rep. penalty helps) | 20 |
101101
| Mistral 7B | GGUF | 3.94 | 1.2% | Working (spaces fixed) | 20 |
@@ -115,7 +115,7 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
115115

116116
**Summary:**
117117

118-
- Zerfoo with CUDA graphs: **235 tok/s** (+25% vs Ollama, ~5-15% vs llama.cpp)
118+
- Zerfoo with CUDA graphs: **241 tok/s** (+25% vs Ollama, ~5-15% vs llama.cpp)
119119
- Zerfoo without CUDA graphs: **174 tok/s** (CUDA graph capture adds +35%)
120120
- Ollama: **188 tok/s** (uses llama.cpp under the hood with its own overhead)
121121

@@ -151,7 +151,7 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
151151

152152
| GPU | Zerfoo (est.) | Notes |
153153
|-----|---------------|-------|
154-
| DGX Spark GB10 | 235 tok/s | Measured (Gemma 3 1B, 2026-03-27) |
154+
| DGX Spark GB10 | 241 tok/s | Measured (Gemma 3 1B, 2026-03-27) |
155155
| RTX 4090 | TBD | Community contributions welcome |
156156
| RTX 3090 | TBD | Community contributions welcome |
157157
| A100 80GB | TBD | Community contributions welcome |

0 commit comments

Comments
 (0)