Skip to content

Commit 9b95f61

Browse files
committed
docs: update landing page and benchmarks to 241 tok/s (1.28x Ollama)
1 parent 3c3a8b3 commit 9b95f61

2 files changed

Lines changed: 9 additions & 9 deletions

File tree

content/_index.html

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
<meta charset="utf-8">
99
<meta name="viewport" content="width=device-width, initial-scale=1">
1010
<title>Zerfoo — Machine Learning Framework for Go</title>
11-
<meta name="description" content="Train, run, and serve ML models in your Go application. 235 tok/s on Gemma 3 1B — 25% faster than Ollama. Pure Go, zero CGo.">
11+
<meta name="description" content="Train, run, and serve ML models in your Go application. 241 tok/s on Gemma 3 1B — 28% faster than Ollama. Pure Go, zero CGo.">
1212
<meta name="theme-color" content="#8B5CF6">
1313
<link rel="icon" href="zerfoo.svg" type="image/svg+xml">
1414
<script>(function(){var t=localStorage.getItem('theme');if(t)document.documentElement.classList.add(t)})()</script>
@@ -266,7 +266,7 @@
266266
<h1>Machine learning for Go.<br><span class="grad">Pure Go. Zero CGo.</span></h1>
267267
<p class="sub">Train, run, and serve ML models in your Go application. One import, GPU-accelerated at runtime, no C compiler needed.</p>
268268
<div class="stats">
269-
<div class="stat"><div class="num">235 tok/s</div><div class="label">Gemma 3 1B Q4_K_M</div></div>
269+
<div class="stat"><div class="num">241 tok/s</div><div class="label">Gemma 3 1B Q4_K_M</div></div>
270270
<div class="stat"><div class="num">+25%</div><div class="label">faster than Ollama</div></div>
271271
<div class="stat"><div class="num">99.5%</div><div class="label">CUDA graph coverage</div></div>
272272
<div class="stat"><div class="num">0</div><div class="label">CGo calls</div></div>
@@ -445,8 +445,8 @@ <h2>Faster than Ollama</h2>
445445
<td class="highlight">Zerfoo</td>
446446
<td>
447447
<div class="bench-bar">
448-
<div class="bar" style="width:min(235px,60vw)"></div>
449-
<div class="val">235 tok/s</div>
448+
<div class="bar" style="width:min(241px,60vw)"></div>
449+
<div class="val">241 tok/s</div>
450450
</div>
451451
</td>
452452
<td>Pure Go, zero CGo, CUDA graph capture, fused kernels</td>
@@ -472,7 +472,7 @@ <h3 style="font-size:1rem;font-weight:600;margin-bottom:16px">Performance journe
472472
<tr><th>Date</th><th>Milestone</th><th>Tok/s</th><th>Improvement</th></tr>
473473
</thead>
474474
<tbody>
475-
<tr><td>Mar 27</td><td class="highlight">Multi-model benchmark (3-run median)</td><td class="highlight">235</td><td>+25% vs Ollama</td></tr>
475+
<tr><td>Mar 27</td><td class="highlight">Multi-model benchmark (3-run median)</td><td class="highlight">241</td><td>+28% vs Ollama</td></tr>
476476
<tr><td>Mar 17</td><td>Q4_0 re-quant restored</td><td>245</td><td>+32% vs regression</td></tr>
477477
<tr><td>Mar 14</td><td>CUDA graph capture</td><td>234</td><td>+26% vs non-graph</td></tr>
478478
<tr><td>Mar 13</td><td>GPU-first pipeline</td><td>103</td><td>D2H elimination</td></tr>
@@ -642,7 +642,7 @@ <h2>From the blog</h2>
642642
<a href="/docs/blog/how-we-beat-ollama-cuda-graph-capture/" class="blog-card">
643643
<div class="tag">Performance</div>
644644
<h3>How We Beat Ollama: CUDA Graph Capture in Pure Go</h3>
645-
<p>CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 235 tok/s. A deep dive into making the decode path GPU-only.</p>
645+
<p>CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 241 tok/s. A deep dive into making the decode path GPU-only.</p>
646646
</a>
647647
<a href="/docs/blog/zero-cgo-pure-go-ml-inference/" class="blog-card">
648648
<div class="tag">Architecture</div>

content/docs/reference/benchmarks.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -107,15 +107,15 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
107107

108108
| Framework | Version | Tokens | Tok/s (decode) | CUDA Graphs | Notes |
109109
|-----------|---------|--------|----------------|-------------|-------|
110-
| **Zerfoo** | latest | 128 | **235** | Yes | Multi-model benchmark (2026-03-27) |
110+
| **Zerfoo** | latest | 128 | **241** | Yes | Multi-model benchmark (2026-03-27) |
111111
| **Zerfoo** | v0.x | 256 | **244.45** | Yes | Single-model baseline (2026-03-20) |
112112
| **Zerfoo** | v0.x | 256 | 174.44 | No | Without CUDA graph capture |
113113
| **Ollama** | 0.17.7 | 128 | 188 | N/A | Multi-model benchmark (2026-03-27) |
114114
| **llama.cpp** | b5220+ | 256 | ~210-230 | No | Estimated from community reports on GB10-class hardware |
115115

116116
**Summary:**
117117

118-
- Zerfoo with CUDA graphs: **241 tok/s** (+25% vs Ollama, ~5-15% vs llama.cpp)
118+
- Zerfoo with CUDA graphs: **241 tok/s** (+28% vs Ollama)
119119
- Zerfoo without CUDA graphs: **174 tok/s** (CUDA graph capture adds +35%)
120120
- Ollama: **188 tok/s** (uses llama.cpp under the hood with its own overhead)
121121

@@ -183,7 +183,7 @@ QWENVL_GGUF_PATH=/path/to/qwenvl.gguf go test -run TestQwenVL_VisionPipeline -co
183183

184184
| Date | Milestone | Tok/s | Notes |
185185
|------|-----------|-------|-------|
186-
| 2026-03-27 | Multi-model benchmark (3-run median) | 235 | +25% vs Ollama (188 tok/s) |
186+
| 2026-03-31 | Multi-model benchmark (3-run median) | 241 | +28% vs Ollama (188 tok/s) |
187187
| 2026-03-17 | dp4a + arena reuse | 245.15 | Parity at batch=1 (memory-bound); dp4a benefits at larger batches |
188188
| 2026-03-17 | Q4_0 re-quant restored | 244.99 | +32% vs regression |
189189
| 2026-03-14 | CUDA graph capture | 234.30 | +26% vs non-graph baseline |

0 commit comments

Comments
 (0)