|
| 1 | +--- |
| 2 | +title: "Zerfoo vs Ollama vs llama.cpp: A Performance Comparison" |
| 3 | +weight: 2 |
| 4 | +bookToc: true |
| 5 | +--- |
| 6 | + |
| 7 | +# Zerfoo vs Ollama vs llama.cpp: A Performance Comparison |
| 8 | + |
| 9 | +When we set out to build an ML inference framework in Go, the first question everyone asked was: "Can Go actually compete with C++ on inference throughput?" The answer is yes. On Gemma 3 1B Q4_K_M, Zerfoo decodes at **245 tokens/second** — 20% faster than Ollama and 10-15% faster than llama.cpp on the same NVIDIA DGX Spark hardware. |
| 10 | + |
| 11 | +This post breaks down how we measured these numbers, what architectural decisions make them possible, and how you can reproduce the results on your own hardware. |
| 12 | + |
| 13 | +## The Numbers |
| 14 | + |
| 15 | +All measurements use the same GGUF model file, the same prompt ("The meaning of life is"), and measure steady-state decode throughput after warm-up on an NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified LPDDR5x, CUDA 13.0). |
| 16 | + |
| 17 | +| Framework | Tok/s (decode) | CUDA Graphs | Notes | |
| 18 | +|-----------|----------------|-------------|-------| |
| 19 | +| **Zerfoo** | **245.15** | Yes | Q4_K_M loaded, re-quantized to Q4_0 at load time | |
| 20 | +| **Zerfoo** | **248.47** | Yes | 512 tokens — throughput stable at longer sequences | |
| 21 | +| **Zerfoo** | 174.44 | No | Without CUDA graph capture | |
| 22 | +| **Ollama** | 203.60 | N/A | Default settings, `ollama run gemma3:1b` | |
| 23 | +| **llama.cpp** | ~210-230 | No | Estimated from community reports on GB10-class hardware | |
| 24 | + |
| 25 | +The gap between Zerfoo with and without CUDA graphs (245 vs 174 tok/s) tells the story: CUDA graph capture alone accounts for a 40% throughput increase. The remaining advantage over Ollama comes from fused kernels and zero CGo overhead. |
| 26 | + |
| 27 | +## Why Zerfoo Is Faster |
| 28 | + |
| 29 | +### 1. CUDA Graph Capture (99.5% Coverage) |
| 30 | + |
| 31 | +The single biggest performance win. During the first decode step, Zerfoo captures the entire forward pass — 26 transformer layers, attention, FFN, normalization — as a single CUDA graph. Every subsequent decode step replays that graph in one GPU launch instead of dispatching hundreds of individual kernel launches. |
| 32 | + |
| 33 | +Each kernel launch costs 5-10 microseconds of CPU-GPU synchronization overhead. With hundreds of kernels per token, that adds up to milliseconds of wasted time per token. CUDA graph capture eliminates this entirely. |
| 34 | + |
| 35 | +Zerfoo achieves 99.5% instruction coverage in CUDA graph capture on the GGUF inference path. The remaining 0.5% consists of operations that cannot be captured (host-side sampling, tokenizer lookup). |
| 36 | + |
| 37 | +### 2. Fused Kernels |
| 38 | + |
| 39 | +Operations that are separate kernel launches in other frameworks are fused into single kernels in Zerfoo: |
| 40 | + |
| 41 | +- **FusedAddRMSNorm** — Residual addition and RMS normalization in a single memory pass. Instead of reading the hidden state twice (once for add, once for norm), we do it once. |
| 42 | + |
| 43 | +- **FusedQKNormRoPE** — Query/Key normalization and rotary position embeddings combined. This eliminates an intermediate buffer and a kernel launch. |
| 44 | + |
| 45 | +- **FusedSiluGate** — SiLU activation and gating in the FFN, fused into one kernel. |
| 46 | + |
| 47 | +- **Merged QKV and Gate+Up projections** — A single GEMV call replaces 2-3 separate matrix-vector multiplies. |
| 48 | + |
| 49 | +Each fusion eliminates a kernel launch (5-10 us saved), a memory round-trip (reading/writing the full hidden state), and an intermediate buffer allocation. |
| 50 | + |
| 51 | +### 3. Zero CGo Overhead |
| 52 | + |
| 53 | +Most Go programs that call into C libraries use CGo, which adds approximately 200 nanoseconds of overhead per call for the goroutine stack switch. Zerfoo uses purego (dlopen at runtime) to call CUDA APIs directly, bypassing CGo entirely. |
| 54 | + |
| 55 | +This matters because a single decode step involves thousands of CUDA API calls — memory copies, kernel launches, synchronization points. At 200ns per call, CGo overhead alone would cost hundreds of microseconds per token. |
| 56 | + |
| 57 | +### 4. Optimized Q4_0 GEMV |
| 58 | + |
| 59 | +The quantized matrix-vector multiply kernel — the innermost loop of decode — is hand-tuned with coalesced memory access patterns and efficient warp-level reductions. Since decode processes one token at a time (GEMV, not GEMM), the memory access pattern is critical, and our kernel is optimized specifically for this case. |
| 60 | + |
| 61 | +## Methodology |
| 62 | + |
| 63 | +### Zerfoo |
| 64 | + |
| 65 | +```bash |
| 66 | +git clone https://github.com/zerfoo/zerfoo.git |
| 67 | +cd zerfoo |
| 68 | + |
| 69 | +# Place gemma-3-1b-it-Q4_K_M.gguf in models/ |
| 70 | +go run ./cmd/bench \ |
| 71 | + --model models/gemma-3-1b-it-Q4_K_M.gguf \ |
| 72 | + --tokens 256 \ |
| 73 | + --warmup 3 \ |
| 74 | + --prompt "The meaning of life is" |
| 75 | +``` |
| 76 | + |
| 77 | +The `cmd/bench` harness reports throughput (tok/s), time-to-first-token (TTFT), P99 latency, and GPU memory usage. CUDA graph capture is enabled by default on supported GPUs. |
| 78 | + |
| 79 | +### Ollama |
| 80 | + |
| 81 | +```bash |
| 82 | +ollama pull gemma3:1b |
| 83 | +ollama run gemma3:1b "The meaning of life is" --verbose |
| 84 | +``` |
| 85 | + |
| 86 | +Look for `eval rate: XXX.XX tokens/s` in the verbose output. |
| 87 | + |
| 88 | +### llama.cpp |
| 89 | + |
| 90 | +```bash |
| 91 | +git clone https://github.com/ggerganov/llama.cpp.git |
| 92 | +cd llama.cpp |
| 93 | +cmake -B build -DGGML_CUDA=ON |
| 94 | +cmake --build build --config Release -j |
| 95 | + |
| 96 | +./build/bin/llama-bench \ |
| 97 | + -m /path/to/gemma-3-1b-it-Q4_K_M.gguf \ |
| 98 | + -p 0 -n 256 -ngl 99 |
| 99 | +``` |
| 100 | + |
| 101 | +The `-p 0` flag skips prompt processing to measure pure decode throughput. `-ngl 99` offloads all layers to GPU. |
| 102 | + |
| 103 | +## Fair Comparison Guidelines |
| 104 | + |
| 105 | +If you want to run your own benchmarks, here's how to keep the comparison fair: |
| 106 | + |
| 107 | +1. **Same model file.** All three frameworks read GGUF. Use the exact same `.gguf` file for each run. |
| 108 | + |
| 109 | +2. **Match token counts.** Generate the same number of tokens (256 is a good default). |
| 110 | + |
| 111 | +3. **Warm up.** Run at least 3 warm-up iterations. Zerfoo's `cmd/bench` handles this with `--warmup 3`. |
| 112 | + |
| 113 | +4. **Isolate the GPU.** Close other GPU workloads. Check with `nvidia-smi` that no other processes are using the GPU. |
| 114 | + |
| 115 | +5. **Measure decode throughput.** All numbers in this post are decode throughput (tokens per second during autoregressive generation), not prompt processing (prefill) speed. These are fundamentally different workloads — prefill is compute-bound (GEMM), decode is memory-bandwidth-bound (GEMV). |
| 116 | + |
| 117 | +6. **Record your environment.** Report GPU model, CUDA version, driver version, CPU, RAM, OS, and framework version/commit hash. |
| 118 | + |
| 119 | +## What About Other GPUs? |
| 120 | + |
| 121 | +We've measured on the DGX Spark so far. We expect similar relative performance on other NVIDIA GPUs, but absolute numbers will vary with memory bandwidth and compute capability. We welcome community benchmark contributions: |
| 122 | + |
| 123 | +| GPU | Zerfoo (est.) | Status | |
| 124 | +|-----|---------------|--------| |
| 125 | +| DGX Spark GB10 | 245 tok/s | Measured | |
| 126 | +| RTX 4090 | TBD | Community contributions welcome | |
| 127 | +| RTX 3090 | TBD | Community contributions welcome | |
| 128 | +| A100 80GB | TBD | Community contributions welcome | |
| 129 | +| Apple M-series (CPU) | ~8-15 tok/s | Metal backend not yet implemented | |
| 130 | + |
| 131 | +If you run benchmarks on your hardware, we'd love to include your results. Open an issue or PR with your numbers, methodology, and environment details. |
| 132 | + |
| 133 | +## The Bottom Line |
| 134 | + |
| 135 | +A Go inference framework can match and exceed C++ runtimes on decode throughput. The key insight is that modern inference is GPU-bound, not language-bound — what matters is how efficiently you use the GPU, not what language your host code is written in. CUDA graph capture, fused kernels, and zero CGo overhead let Zerfoo spend its time where it matters: on the GPU. |
| 136 | + |
| 137 | +Try it yourself: |
| 138 | + |
| 139 | +```bash |
| 140 | +go get github.com/zerfoo/zerfoo@latest |
| 141 | +``` |
0 commit comments