Skip to content

Commit d14867e

Browse files
committed
docs(bench): update multi-model comparison table with 2026-03-27 results
Replace old table with new 3-run median results: Gemma3-1B 235 tok/s (1.25x Ollama), DeepSeek-R1 186 (1.11x), Llama3.2 92 (0.99x), Mistral-7B 44 (1.00x). All models now produce coherent output after GQA fix. Remove Mistral output quality caveat.
1 parent 4aee7aa commit d14867e

1 file changed

Lines changed: 22 additions & 28 deletions

File tree

content/docs/reference/benchmarks.md

Lines changed: 22 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -34,28 +34,22 @@ All official benchmarks run on a single machine:
3434

3535
## Results
3636

37-
### Multi-Model Comparison: Zerfoo vs Ollama (2026-03-25)
38-
39-
Head-to-head decode throughput on DGX Spark GB10. 128 tokens (except where
40-
noted), 3 runs (median), greedy sampling (temp=0), commit `294aa43` (v1.19.0),
41-
Ollama v0.17.7.
42-
43-
| Model | Architecture | Size | Zerfoo (tok/s) | Ollama (tok/s) | Ratio | Winner |
44-
|-------|-------------|------|----------------|----------------|-------|--------|
45-
| Gemma 3 1B Q4_K_M | gemma3 | 1B | **241** (256 tok) | 201 (256 tok) | **1.20x** | Zerfoo |
46-
| DeepSeek R1 1.5B Q4_K_M | deepseek2 | 1.5B | **192.83** | 184.75 | **1.04x** | Zerfoo |
47-
| Llama 3.2 3B Q4_K_M | llama | 3B | 96.06 | 97.66 | 0.98x | ~Even |
48-
| Mistral 7B Q4_K_M | mistral | 7B | **44** | 46.77 | **0.94x** | ~Even |
49-
50-
Zerfoo wins on small models (1B-1.5B). Llama 3.2 3B is at parity. Mistral 7B
51-
was previously at 11 tok/s due to a performance regression; after the shared
52-
memory fix it runs at 44 tok/s (0.94x Ollama -- near parity).
53-
54-
> **Note on Mistral output quality:** Mistral 7B throughput is correct at 44
55-
> tok/s, but output quality is pending a tokenizer fix. The Mistral tokenizer
56-
> requires SentencePiece byte-fallback handling that is not yet fully
57-
> implemented. Throughput numbers are valid; text coherence will improve once
58-
> the tokenizer fix lands.
37+
### Multi-Model Comparison: Zerfoo vs Ollama (2026-03-27)
38+
39+
Head-to-head decode throughput on DGX Spark GB10. 128 tokens, 3 runs (median),
40+
greedy sampling (temp=0). All models produce coherent output after the GQA
41+
repeat fix (ztensor v0.6.3) and flash attention decode fix (zerfoo v1.25.5).
42+
43+
| Model | Size | Zerfoo (tok/s) | Ollama (tok/s) | Ratio | Winner |
44+
|-------|------|----------------|----------------|-------|--------|
45+
| Gemma 3 1B Q4_K_M | 1B | **235** | 188 | **1.25x** | Zerfoo |
46+
| DeepSeek R1 1.5B Q4_K_M | 1.5B | **186** | 167 | **1.11x** | Zerfoo |
47+
| Llama 3.2 3B Q4_K_M | 3B | 92 | 93 | 0.99x | ~Even |
48+
| Mistral 7B Q5_K_M | 7B | 44 | 44 | 1.00x | ~Even |
49+
50+
**Summary:** 25% faster on small models, parity at 7B. All four models now
51+
produce coherent output with CUDA graph capture enabled.
52+
5953
Additional architectures (Qwen, Phi, Mixtral, Command-R, Falcon, Mamba, RWKV)
6054
will be added as GGUF files are acquired and parser compatibility is resolved.
6155

@@ -113,17 +107,17 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
113107

114108
| Framework | Version | Tokens | Tok/s (decode) | CUDA Graphs | Notes |
115109
|-----------|---------|--------|----------------|-------------|-------|
116-
| **Zerfoo** | v1.19.0 | 256 | **241** | Yes | Multi-model benchmark (2026-03-25) |
110+
| **Zerfoo** | latest | 128 | **235** | Yes | Multi-model benchmark (2026-03-27) |
117111
| **Zerfoo** | v0.x | 256 | **244.45** | Yes | Single-model baseline (2026-03-20) |
118112
| **Zerfoo** | v0.x | 256 | 174.44 | No | Without CUDA graph capture |
119-
| **Ollama** | 0.17.7 | 128 | 204.37 | N/A | Multi-model benchmark (2026-03-25) |
113+
| **Ollama** | 0.17.7 | 128 | 188 | N/A | Multi-model benchmark (2026-03-27) |
120114
| **llama.cpp** | b5220+ | 256 | ~210-230 | No | Estimated from community reports on GB10-class hardware |
121115

122116
**Summary:**
123117

124-
- Zerfoo with CUDA graphs: **241 tok/s** (+20% vs Ollama, ~5-15% vs llama.cpp)
125-
- Zerfoo without CUDA graphs: **174 tok/s** (CUDA graph capture adds +38%)
126-
- Ollama: **204 tok/s** (uses llama.cpp under the hood with its own overhead)
118+
- Zerfoo with CUDA graphs: **235 tok/s** (+25% vs Ollama, ~5-15% vs llama.cpp)
119+
- Zerfoo without CUDA graphs: **174 tok/s** (CUDA graph capture adds +35%)
120+
- Ollama: **188 tok/s** (uses llama.cpp under the hood with its own overhead)
127121

128122
> **Note on llama.cpp numbers:** Direct llama.cpp measurements on this exact
129123
> DGX Spark unit are pending. The estimate above is based on published community
@@ -157,7 +151,7 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
157151

158152
| GPU | Zerfoo (est.) | Notes |
159153
|-----|---------------|-------|
160-
| DGX Spark GB10 | 241 tok/s | Measured (Gemma 3 1B, 2026-03-25) |
154+
| DGX Spark GB10 | 235 tok/s | Measured (Gemma 3 1B, 2026-03-27) |
161155
| RTX 4090 | TBD | Community contributions welcome |
162156
| RTX 3090 | TBD | Community contributions welcome |
163157
| A100 80GB | TBD | Community contributions welcome |

0 commit comments

Comments
 (0)