docs: update landing page and benchmarks to 241 tok/s (1.28x Ollama)

dndungu · dndungu · commit 9b95f61d3eda · 2026-03-31T22:59:52.000-07:00
diff --git a/content/_index.html b/content/_index.html
@@ -8,7 +8,7 @@
 <meta charset="utf-8">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <title>Zerfoo — Machine Learning Framework for Go</title>
-<meta name="description" content="Train, run, and serve ML models in your Go application. 235 tok/s on Gemma 3 1B — 25% faster than Ollama. Pure Go, zero CGo.">
+<meta name="description" content="Train, run, and serve ML models in your Go application. 241 tok/s on Gemma 3 1B — 28% faster than Ollama. Pure Go, zero CGo.">
 <meta name="theme-color" content="#8B5CF6">
 <link rel="icon" href="zerfoo.svg" type="image/svg+xml">
 <script>(function(){var t=localStorage.getItem('theme');if(t)document.documentElement.classList.add(t)})()</script>
@@ -266,7 +266,7 @@
     <h1>Machine learning for Go.<br><span class="grad">Pure Go. Zero CGo.</span></h1>
     <p class="sub">Train, run, and serve ML models in your Go application. One import, GPU-accelerated at runtime, no C compiler needed.</p>
     <div class="stats">
-      <div class="stat"><div class="num">235 tok/s</div><div class="label">Gemma 3 1B Q4_K_M</div></div>
+      <div class="stat"><div class="num">241 tok/s</div><div class="label">Gemma 3 1B Q4_K_M</div></div>
       <div class="stat"><div class="num">+25%</div><div class="label">faster than Ollama</div></div>
       <div class="stat"><div class="num">99.5%</div><div class="label">CUDA graph coverage</div></div>
       <div class="stat"><div class="num">0</div><div class="label">CGo calls</div></div>
@@ -445,8 +445,8 @@ <h2>Faster than Ollama</h2>
             <td class="highlight">Zerfoo</td>
             <td>
               <div class="bench-bar">
-                <div class="bar" style="width:min(235px,60vw)"></div>
-                <div class="val">235 tok/s</div>
+                <div class="bar" style="width:min(241px,60vw)"></div>
+                <div class="val">241 tok/s</div>
               </div>
             </td>
             <td>Pure Go, zero CGo, CUDA graph capture, fused kernels</td>
@@ -472,7 +472,7 @@ <h3 style="font-size:1rem;font-weight:600;margin-bottom:16px">Performance journe
           <tr><th>Date</th><th>Milestone</th><th>Tok/s</th><th>Improvement</th></tr>
         </thead>
         <tbody>
-          <tr><td>Mar 27</td><td class="highlight">Multi-model benchmark (3-run median)</td><td class="highlight">235</td><td>+25% vs Ollama</td></tr>
+          <tr><td>Mar 27</td><td class="highlight">Multi-model benchmark (3-run median)</td><td class="highlight">241</td><td>+28% vs Ollama</td></tr>
           <tr><td>Mar 17</td><td>Q4_0 re-quant restored</td><td>245</td><td>+32% vs regression</td></tr>
           <tr><td>Mar 14</td><td>CUDA graph capture</td><td>234</td><td>+26% vs non-graph</td></tr>
           <tr><td>Mar 13</td><td>GPU-first pipeline</td><td>103</td><td>D2H elimination</td></tr>
@@ -642,7 +642,7 @@ <h2>From the blog</h2>
       <a href="/docs/blog/how-we-beat-ollama-cuda-graph-capture/" class="blog-card">
         <div class="tag">Performance</div>
         <h3>How We Beat Ollama: CUDA Graph Capture in Pure Go</h3>
-        <p>CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 235 tok/s. A deep dive into making the decode path GPU-only.</p>
+        <p>CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 241 tok/s. A deep dive into making the decode path GPU-only.</p>
       </a>
       <a href="/docs/blog/zero-cgo-pure-go-ml-inference/" class="blog-card">
         <div class="tag">Architecture</div>
diff --git a/content/docs/reference/benchmarks.md b/content/docs/reference/benchmarks.md
@@ -107,15 +107,15 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
 
 | Framework | Version | Tokens | Tok/s (decode) | CUDA Graphs | Notes |
 |-----------|---------|--------|----------------|-------------|-------|
-| **Zerfoo** | latest | 128 | **235** | Yes | Multi-model benchmark (2026-03-27) |
+| **Zerfoo** | latest | 128 | **241** | Yes | Multi-model benchmark (2026-03-27) |
 | **Zerfoo** | v0.x | 256 | **244.45** | Yes | Single-model baseline (2026-03-20) |
 | **Zerfoo** | v0.x | 256 | 174.44 | No | Without CUDA graph capture |
 | **Ollama** | 0.17.7 | 128 | 188 | N/A | Multi-model benchmark (2026-03-27) |
 | **llama.cpp** | b5220+ | 256 | ~210-230 | No | Estimated from community reports on GB10-class hardware |
 
 **Summary:**
 
-- Zerfoo with CUDA graphs: **241 tok/s** (+25% vs Ollama, ~5-15% vs llama.cpp)
+- Zerfoo with CUDA graphs: **241 tok/s** (+28% vs Ollama)
 - Zerfoo without CUDA graphs: **174 tok/s** (CUDA graph capture adds +35%)
 - Ollama: **188 tok/s** (uses llama.cpp under the hood with its own overhead)
 
@@ -183,7 +183,7 @@ QWENVL_GGUF_PATH=/path/to/qwenvl.gguf go test -run TestQwenVL_VisionPipeline -co
 
 | Date | Milestone | Tok/s | Notes |
 |------|-----------|-------|-------|
-| 2026-03-27 | Multi-model benchmark (3-run median) | 235 | +25% vs Ollama (188 tok/s) |
+| 2026-03-31 | Multi-model benchmark (3-run median) | 241 | +28% vs Ollama (188 tok/s) |
 | 2026-03-17 | dp4a + arena reuse | 245.15 | Parity at batch=1 (memory-bound); dp4a benefits at larger batches |
 | 2026-03-17 | Q4_0 re-quant restored | 244.99 | +32% vs regression |
 | 2026-03-14 | CUDA graph capture | 234.30 | +26% vs non-graph baseline |