feat: highlight mmap default loading and over-RAM model support

dndungu · dndungu · commit 5a938cf18a1b · 2026-03-28T21:54:48.000-07:00
- Add "Models Larger Than RAM" feature card to homepage features grid
- Add "229B on 128 GB via mmap" stat to hero section
- Add MiniMax-M2 138 GB over-RAM table to benchmarks section
- Update model-loading tutorial: mmap is default, split-GGUF auto-detected
- Update inference API docs: WithMmap is now opt-out, not opt-in
diff --git a/content/_index.html b/content/_index.html
@@ -270,6 +270,7 @@ <h1>Machine learning for Go.<br><span class="grad">Pure Go. Zero CGo.</span></h1
       <div class="stat"><div class="num">+25%</div><div class="label">faster than Ollama</div></div>
       <div class="stat"><div class="num">99.5%</div><div class="label">CUDA graph coverage</div></div>
       <div class="stat"><div class="num">0</div><div class="label">CGo calls</div></div>
+      <div class="stat"><div class="num">229B</div><div class="label">on 128 GB via mmap</div></div>
     </div>
     <div class="actions">
       <a href="#quickstart" class="btn btn-primary">Get Started</a>
@@ -418,6 +419,11 @@ <h3>Q4_K Fused GEMV</h3>
         <h3>Advanced Serving</h3>
         <p>Multi-LoRA per-request serving, quantized KV cache (Q4/Q3), and hybrid CPU/GPU MoE routing. Production-grade features for multi-tenant deployments.</p>
       </div>
+      <div class="feat">
+        <div class="icon">&#128190;</div>
+        <h3>Models Larger Than RAM</h3>
+        <p>GGUF files are memory-mapped by default. Tensor data is paged from NVMe on demand — the OS handles it. A 229B MiniMax-M2 (138 GB, 3 shards) runs on 128 GB with no flags and no configuration.</p>
+      </div>
     </div>
   </div>
 </section>
@@ -476,6 +482,25 @@ <h3 style="font-size:1rem;font-weight:600;margin-bottom:16px">Performance journe
       </table>
     </div>
 
+    <div style="margin-top:40px">
+      <h3 style="font-size:1rem;font-weight:600;margin-bottom:16px">Models larger than RAM</h3>
+      <p style="color:var(--text2);margin-bottom:16px;font-size:0.9rem">GGUF files are memory-mapped by default. Tensor data pages from NVMe on demand — no flags needed. Split GGUF shards are detected and mapped automatically.</p>
+      <table class="bench-table">
+        <thead>
+          <tr><th>Model</th><th>Params</th><th>File size</th><th>RAM available</th><th>Status</th></tr>
+        </thead>
+        <tbody>
+          <tr>
+            <td class="highlight">MiniMax-M2 Q4_K_M</td>
+            <td>229B MoE</td>
+            <td>138 GB (3 shards)</td>
+            <td>128 GB</td>
+            <td>Loads and generates text</td>
+          </tr>
+        </tbody>
+      </table>
+    </div>
+
     <div class="bench-note">
       Hardware: NVIDIA DGX Spark GB10, sm_121, 128GB LPDDR5x. Methodology: 3 runs, 32-token warmup, median reported. Full details in <a href="/docs/reference/benchmarks/">benchmarks</a>.
     </div>
diff --git a/content/docs/api/inference.md b/content/docs/api/inference.md
@@ -180,7 +180,7 @@ Sets the KV cache storage dtype. Supported: `"fp32"` (default), `"fp16"`. FP16 h
 func WithMmap(enabled bool) Option
 ```
 
-Enables memory-mapped model loading. When true, the file is mapped into memory using `syscall.Mmap` instead of `os.ReadFile`, avoiding heap allocation for model weights. Only supported on unix platforms.
+Controls memory-mapped model loading. **mmap is enabled by default.** When enabled, the GGUF file is mapped into virtual address space using `syscall.Mmap`; tensor data is paged from disk on demand by the OS, avoiding heap allocation and enabling models larger than physical RAM. Pass `false` to use heap loading, which is required for CUDA graph capture. Only supported on unix platforms.
 
 ---
 
diff --git a/content/docs/tutorials/model-loading.md b/content/docs/tutorials/model-loading.md
@@ -22,14 +22,25 @@ Zerfoo uses GGUF as its sole model format. When you call `inference.LoadFile`, t
 model, err := inference.LoadFile("path/to/model.gguf")
 ```
 
-GGUF files are mmap-friendly. On Unix platforms, you can enable memory-mapped loading to avoid copying weights into the Go heap:
+GGUF files are memory-mapped by default. Zerfoo maps the file into virtual address space and lets the OS page tensor data from disk on demand — no weights are copied into heap memory at startup. This gives near-instant load times regardless of model size and allows loading models larger than physical RAM.
 
 ```go
+// mmap is the default — no options needed
+model, err := inference.LoadFile("model.gguf")
+
+// Opt out for heap loading (required for CUDA graph capture)
 model, err := inference.LoadFile("model.gguf",
-	inference.WithMmap(true),
+	inference.WithMmap(false),
 )
 ```
 
+Split GGUF files (the `-NNNNN-of-NNNNN.gguf` naming convention used for 70B+ models from HuggingFace) are detected and loaded automatically. Pass the path to the first shard — Zerfoo finds the rest.
+
+```go
+// Load a 138 GB model (3 shards) on a 128 GB machine
+model, err := inference.LoadFile("MiniMax-M2-Q4_K_M-00001-of-00003.gguf")
+```
+
 ## Supported Architectures
 
 Zerfoo includes architecture-specific graph builders for each model family. The architecture is detected automatically from GGUF metadata -- you do not need to specify it.