Skip to content

Commit 888defc

Browse files
committed
docs(benchmarks): correct Gemma 3 1B to 241 tok/s at 256 tokens
Re-verified at standard 256 token count: 241 tok/s Zerfoo vs 201 tok/s Ollama (1.20x, 20% advantage confirmed). The earlier 236 tok/s was measured at 128 tokens where CUDA graph amortization is lower.
1 parent 95945d8 commit 888defc

1 file changed

Lines changed: 8 additions & 7 deletions

File tree

content/docs/reference/benchmarks.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,13 @@ All official benchmarks run on a single machine:
3636

3737
### Multi-Model Comparison: Zerfoo vs Ollama (2026-03-25)
3838

39-
Head-to-head decode throughput on DGX Spark GB10. 128 tokens, 3 runs (median),
40-
greedy sampling (temp=0), commit `294aa43` (v1.19.0), Ollama v0.17.7.
39+
Head-to-head decode throughput on DGX Spark GB10. 128 tokens (except where
40+
noted), 3 runs (median), greedy sampling (temp=0), commit `294aa43` (v1.19.0),
41+
Ollama v0.17.7.
4142

4243
| Model | Architecture | Size | Zerfoo (tok/s) | Ollama (tok/s) | Ratio | Winner |
4344
|-------|-------------|------|----------------|----------------|-------|--------|
44-
| Gemma 3 1B Q4_K_M | gemma3 | 1B | **236.38** | 204.37 | **1.16x** | Zerfoo |
45+
| Gemma 3 1B Q4_K_M | gemma3 | 1B | **241** (256 tok) | 201 (256 tok) | **1.20x** | Zerfoo |
4546
| DeepSeek R1 1.5B Q4_K_M | deepseek2 | 1.5B | **192.83** | 184.75 | **1.04x** | Zerfoo |
4647
| Llama 3.2 3B Q4_K_M | llama | 3B | 96.06 | 97.66 | 0.98x | ~Even |
4748
| Mistral 7B Q4_K_M | mistral | 7B | 11.61 | 46.77 | 0.25x | Ollama |
@@ -105,16 +106,16 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
105106

106107
| Framework | Version | Tokens | Tok/s (decode) | CUDA Graphs | Notes |
107108
|-----------|---------|--------|----------------|-------------|-------|
108-
| **Zerfoo** | v1.19.0 | 128 | **236.38** | Yes | Multi-model benchmark (2026-03-25) |
109+
| **Zerfoo** | v1.19.0 | 256 | **241** | Yes | Multi-model benchmark (2026-03-25) |
109110
| **Zerfoo** | v0.x | 256 | **244.45** | Yes | Single-model baseline (2026-03-20) |
110111
| **Zerfoo** | v0.x | 256 | 174.44 | No | Without CUDA graph capture |
111112
| **Ollama** | 0.17.7 | 128 | 204.37 | N/A | Multi-model benchmark (2026-03-25) |
112113
| **llama.cpp** | b5220+ | 256 | ~210-230 | No | Estimated from community reports on GB10-class hardware |
113114

114115
**Summary:**
115116

116-
- Zerfoo with CUDA graphs: **236 tok/s** (+16% vs Ollama, ~5-10% vs llama.cpp)
117-
- Zerfoo without CUDA graphs: **174 tok/s** (CUDA graph capture adds +36%)
117+
- Zerfoo with CUDA graphs: **241 tok/s** (+20% vs Ollama, ~5-15% vs llama.cpp)
118+
- Zerfoo without CUDA graphs: **174 tok/s** (CUDA graph capture adds +38%)
118119
- Ollama: **204 tok/s** (uses llama.cpp under the hood with its own overhead)
119120

120121
> **Note on llama.cpp numbers:** Direct llama.cpp measurements on this exact
@@ -149,7 +150,7 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
149150

150151
| GPU | Zerfoo (est.) | Notes |
151152
|-----|---------------|-------|
152-
| DGX Spark GB10 | 236 tok/s | Measured (Gemma 3 1B, 2026-03-25) |
153+
| DGX Spark GB10 | 241 tok/s | Measured (Gemma 3 1B, 2026-03-25) |
153154
| RTX 4090 | TBD | Community contributions welcome |
154155
| RTX 3090 | TBD | Community contributions welcome |
155156
| A100 80GB | TBD | Community contributions welcome |

0 commit comments

Comments
 (0)