Skip to content

Commit 5a938cf

Browse files
committed
feat: highlight mmap default loading and over-RAM model support
- Add "Models Larger Than RAM" feature card to homepage features grid - Add "229B on 128 GB via mmap" stat to hero section - Add MiniMax-M2 138 GB over-RAM table to benchmarks section - Update model-loading tutorial: mmap is default, split-GGUF auto-detected - Update inference API docs: WithMmap is now opt-out, not opt-in
1 parent c9fde40 commit 5a938cf

3 files changed

Lines changed: 39 additions & 3 deletions

File tree

content/_index.html

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,7 @@ <h1>Machine learning for Go.<br><span class="grad">Pure Go. Zero CGo.</span></h1
270270
<div class="stat"><div class="num">+25%</div><div class="label">faster than Ollama</div></div>
271271
<div class="stat"><div class="num">99.5%</div><div class="label">CUDA graph coverage</div></div>
272272
<div class="stat"><div class="num">0</div><div class="label">CGo calls</div></div>
273+
<div class="stat"><div class="num">229B</div><div class="label">on 128 GB via mmap</div></div>
273274
</div>
274275
<div class="actions">
275276
<a href="#quickstart" class="btn btn-primary">Get Started</a>
@@ -418,6 +419,11 @@ <h3>Q4_K Fused GEMV</h3>
418419
<h3>Advanced Serving</h3>
419420
<p>Multi-LoRA per-request serving, quantized KV cache (Q4/Q3), and hybrid CPU/GPU MoE routing. Production-grade features for multi-tenant deployments.</p>
420421
</div>
422+
<div class="feat">
423+
<div class="icon">&#128190;</div>
424+
<h3>Models Larger Than RAM</h3>
425+
<p>GGUF files are memory-mapped by default. Tensor data is paged from NVMe on demand — the OS handles it. A 229B MiniMax-M2 (138 GB, 3 shards) runs on 128 GB with no flags and no configuration.</p>
426+
</div>
421427
</div>
422428
</div>
423429
</section>
@@ -476,6 +482,25 @@ <h3 style="font-size:1rem;font-weight:600;margin-bottom:16px">Performance journe
476482
</table>
477483
</div>
478484

485+
<div style="margin-top:40px">
486+
<h3 style="font-size:1rem;font-weight:600;margin-bottom:16px">Models larger than RAM</h3>
487+
<p style="color:var(--text2);margin-bottom:16px;font-size:0.9rem">GGUF files are memory-mapped by default. Tensor data pages from NVMe on demand — no flags needed. Split GGUF shards are detected and mapped automatically.</p>
488+
<table class="bench-table">
489+
<thead>
490+
<tr><th>Model</th><th>Params</th><th>File size</th><th>RAM available</th><th>Status</th></tr>
491+
</thead>
492+
<tbody>
493+
<tr>
494+
<td class="highlight">MiniMax-M2 Q4_K_M</td>
495+
<td>229B MoE</td>
496+
<td>138 GB (3 shards)</td>
497+
<td>128 GB</td>
498+
<td>Loads and generates text</td>
499+
</tr>
500+
</tbody>
501+
</table>
502+
</div>
503+
479504
<div class="bench-note">
480505
Hardware: NVIDIA DGX Spark GB10, sm_121, 128GB LPDDR5x. Methodology: 3 runs, 32-token warmup, median reported. Full details in <a href="/docs/reference/benchmarks/">benchmarks</a>.
481506
</div>

content/docs/api/inference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ Sets the KV cache storage dtype. Supported: `"fp32"` (default), `"fp16"`. FP16 h
180180
func WithMmap(enabled bool) Option
181181
```
182182

183-
Enables memory-mapped model loading. When true, the file is mapped into memory using `syscall.Mmap` instead of `os.ReadFile`, avoiding heap allocation for model weights. Only supported on unix platforms.
183+
Controls memory-mapped model loading. **mmap is enabled by default.** When enabled, the GGUF file is mapped into virtual address space using `syscall.Mmap`; tensor data is paged from disk on demand by the OS, avoiding heap allocation and enabling models larger than physical RAM. Pass `false` to use heap loading, which is required for CUDA graph capture. Only supported on unix platforms.
184184

185185
---
186186

content/docs/tutorials/model-loading.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,25 @@ Zerfoo uses GGUF as its sole model format. When you call `inference.LoadFile`, t
2222
model, err := inference.LoadFile("path/to/model.gguf")
2323
```
2424

25-
GGUF files are mmap-friendly. On Unix platforms, you can enable memory-mapped loading to avoid copying weights into the Go heap:
25+
GGUF files are memory-mapped by default. Zerfoo maps the file into virtual address space and lets the OS page tensor data from disk on demand — no weights are copied into heap memory at startup. This gives near-instant load times regardless of model size and allows loading models larger than physical RAM.
2626

2727
```go
28+
// mmap is the default — no options needed
29+
model, err := inference.LoadFile("model.gguf")
30+
31+
// Opt out for heap loading (required for CUDA graph capture)
2832
model, err := inference.LoadFile("model.gguf",
29-
inference.WithMmap(true),
33+
inference.WithMmap(false),
3034
)
3135
```
3236

37+
Split GGUF files (the `-NNNNN-of-NNNNN.gguf` naming convention used for 70B+ models from HuggingFace) are detected and loaded automatically. Pass the path to the first shard — Zerfoo finds the rest.
38+
39+
```go
40+
// Load a 138 GB model (3 shards) on a 128 GB machine
41+
model, err := inference.LoadFile("MiniMax-M2-Q4_K_M-00001-of-00003.gguf")
42+
```
43+
3344
## Supported Architectures
3445

3546
Zerfoo includes architecture-specific graph builders for each model family. The architecture is detected automatically from GGUF metadata -- you do not need to specify it.

0 commit comments

Comments
 (0)