docs(zonnx): add zonnx overview, ONNX, and SafeTensors conversion guides

dndungu · dndungu · commit d76b81f20430 · 2026-03-25T11:15:57.000-07:00
diff --git a/content/docs/zonnx/onnx-to-gguf.md b/content/docs/zonnx/onnx-to-gguf.md
@@ -0,0 +1,132 @@
+---
+title: ONNX to GGUF
+weight: 2
+bookToc: true
+---
+
+# ONNX to GGUF Conversion
+
+This guide walks through converting an ONNX model to GGUF format using zonnx. The resulting GGUF file can be loaded by zerfoo or llama.cpp.
+
+## Prerequisites
+
+- zonnx installed (`go install github.com/zerfoo/zonnx/cmd/zonnx@latest`)
+- An ONNX model file, either local or on HuggingFace
+
+## Step 1: Download a Model from HuggingFace
+
+Use the `download` command to fetch an ONNX model and its tokenizer files:
+
+```bash
+zonnx download --model google/gemma-2-2b-it --output ./models
+```
+
+For gated models that require authentication:
+
+```bash
+# Via flag
+zonnx download --model meta-llama/Llama-3-8B --output ./models --api-key YOUR_HF_TOKEN
+
+# Via environment variable
+export HF_API_KEY=YOUR_HF_TOKEN
+zonnx download --model meta-llama/Llama-3-8B --output ./models
+```
+
+The `--api-key` flag takes precedence over the `HF_API_KEY` environment variable.
+
+After downloading, you should have at minimum:
+
+```
+models/
+  model.onnx
+  config.json        # optional but recommended for metadata
+  tokenizer.json     # downloaded automatically if available
+```
+
+## Step 2: Convert to GGUF
+
+Run the `convert` command with the appropriate `--arch` flag:
+
+```bash
+zonnx convert --arch gemma --output ./models/gemma-2b.gguf ./models/model.onnx
+```
+
+### The `--arch` Flag
+
+The `--arch` flag selects the tensor name mapping and metadata mapping for the target architecture. If a `config.json` file exists alongside the ONNX file, zonnx reads it automatically and maps HuggingFace config fields to GGUF metadata keys.
+
+If `--arch` is omitted, it defaults to `llama`.
+
+### Convert Command Flags
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--output` | `<input-dir>/<input-base>.gguf` | Output GGUF file path |
+| `--arch` | `llama` | Model architecture for metadata and tensor mapping |
+| `--format` | `onnx` | Input format: `onnx` or `safetensors` |
+| `--quantize` | (none) | Quantize weights: `q4_0` or `q8_0` |
+
+## Step 3: Quantize During Conversion (Optional)
+
+To reduce model size, quantize weights during conversion:
+
+```bash
+# 4-bit quantization (smallest, some quality loss)
+zonnx convert --arch gemma --quantize q4_0 --output ./models/gemma-2b-q4.gguf ./models/model.onnx
+
+# 8-bit quantization (good balance of size and quality)
+zonnx convert --arch gemma --quantize q8_0 --output ./models/gemma-2b-q8.gguf ./models/model.onnx
+```
+
+| Quantization | Bits per Weight | Use Case |
+|-------------|-----------------|----------|
+| (none) | 32 | Full precision, largest file |
+| `q8_0` | 8 | Good quality, ~4x smaller than F32 |
+| `q4_0` | 4 | Smallest, ~8x smaller than F32 |
+
+## Step 4: Verify the Output
+
+Inspect the generated GGUF file to confirm metadata and tensors:
+
+```bash
+zonnx inspect --pretty ./models/gemma-2b.gguf
+```
+
+## Supported Architectures
+
+| Architecture | `--arch` value | Tensor Mapping | Notes |
+|-------------|----------------|----------------|-------|
+| Llama | `llama` (default) | Decoder layers (`model.layers.N.*`) | Llama 3, Code Llama |
+| Gemma | `gemma` | Decoder layers (`model.layers.N.*`) | Gemma, Gemma 2, Gemma 3 |
+| BERT | `bert` | Encoder layers (`bert.encoder.layer.N.*`) | Classification, embeddings |
+| RoBERTa | `roberta` | Encoder layers (`roberta.encoder.layer.N.*`) | Same structure as BERT |
+
+## Metadata Mapping
+
+When a `config.json` file is present alongside the ONNX model, zonnx maps these HuggingFace fields to GGUF metadata:
+
+| config.json field | GGUF key |
+|-------------------|----------|
+| `hidden_size` | `{arch}.embedding_length` |
+| `num_hidden_layers` | `{arch}.block_count` |
+| `num_attention_heads` | `{arch}.attention.head_count` |
+| `num_key_value_heads` | `{arch}.attention.head_count_kv` |
+| `intermediate_size` | `{arch}.feed_forward_length` |
+| `vocab_size` | `{arch}.vocab_size` |
+| `max_position_embeddings` | `{arch}.context_length` |
+| `rms_norm_eps` | `{arch}.attention.layer_norm_rms_epsilon` |
+| `rope_theta` | `{arch}.rope.freq_base` |
+
+## Using the GGUF File with Zerfoo
+
+Once converted, load the model with zerfoo:
+
+```bash
+zerfoo run ./models/gemma-2b.gguf --prompt "Hello, world!"
+```
+
+Or serve it as an OpenAI-compatible API:
+
+```bash
+zerfoo serve ./models/gemma-2b.gguf
+```
diff --git a/content/docs/zonnx/overview.md b/content/docs/zonnx/overview.md
@@ -0,0 +1,81 @@
+---
+title: zonnx Overview
+weight: 1
+bookToc: true
+---
+
+# zonnx Overview
+
+zonnx is a standalone command-line tool that converts machine learning models to [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) format. It accepts ONNX and SafeTensors inputs and produces portable GGUF files compatible with both the zerfoo runtime and llama.cpp.
+
+zonnx ships as a single static binary with no CGo dependency.
+
+## Features
+
+- **ONNX to GGUF conversion** -- convert decoder models (Llama, Gemma) from ONNX format
+- **SafeTensors to GGUF conversion** -- convert encoder models (BERT, RoBERTa) from SafeTensors format
+- **Post-conversion quantization** -- quantize weights to Q4_0 or Q8_0 during conversion
+- **HuggingFace integration** -- download ONNX models and tokenizer files directly from the Hub
+- **Model inspection** -- inspect ONNX and GGUF files for metadata, tensors, and structure
+- **Architecture-aware mappings** -- tensor name and metadata mappings tuned per model family
+
+## Installation
+
+Requires Go 1.26 or later. Install with:
+
+```bash
+go install github.com/zerfoo/zonnx/cmd/zonnx@latest
+```
+
+Or build from source:
+
+```bash
+git clone https://github.com/zerfoo/zonnx.git
+cd zonnx
+go build -o zonnx ./cmd/zonnx
+```
+
+CGo is not required -- `CGO_ENABLED=0` works.
+
+## Supported Architectures
+
+| Architecture | `--arch` value | Input Formats | Notes |
+|-------------|----------------|---------------|-------|
+| Llama | `llama` (default) | ONNX | Llama 3, Code Llama |
+| Gemma | `gemma` | ONNX | Gemma, Gemma 2, Gemma 3 |
+| BERT | `bert` | ONNX, SafeTensors | Classification, embeddings |
+| RoBERTa | `roberta` | ONNX, SafeTensors | Same layer structure as BERT |
+
+Any architecture string can be passed via `--arch`. Generic metadata mapping applies to all architectures. Tensor name mapping currently covers Llama-style decoder models and BERT/RoBERTa encoder models.
+
+## Basic Usage
+
+```bash
+# Download an ONNX model from HuggingFace
+zonnx download --model google/gemma-2-2b-it --output ./models
+
+# Convert ONNX to GGUF
+zonnx convert --arch gemma --output ./models/model.gguf ./models/model.onnx
+
+# Convert SafeTensors to GGUF
+zonnx convert --format safetensors --arch bert --output ./models/model.gguf ./models/bert-dir/
+
+# Convert with quantization
+zonnx convert --quantize q4_0 --output ./models/model-q4.gguf ./models/model.onnx
+
+# Inspect a model file
+zonnx inspect --pretty ./models/model.onnx
+```
+
+## Commands
+
+| Command | Description |
+|---------|-------------|
+| `convert` | Convert ONNX or SafeTensors models to GGUF |
+| `download` | Download ONNX models and tokenizer files from HuggingFace Hub |
+| `inspect` | Inspect ONNX or GGUF model files |
+
+## Next Steps
+
+- [ONNX to GGUF]({{< relref "onnx-to-gguf" >}}) -- step-by-step guide for converting ONNX models
+- [SafeTensors to GGUF]({{< relref "safetensors-to-gguf" >}}) -- guide for converting SafeTensors models (BERT, RoBERTa)
diff --git a/content/docs/zonnx/safetensors-to-gguf.md b/content/docs/zonnx/safetensors-to-gguf.md
@@ -0,0 +1,175 @@
+---
+title: SafeTensors to GGUF
+weight: 3
+bookToc: true
+---
+
+# SafeTensors to GGUF Conversion
+
+This guide covers converting SafeTensors models (typically BERT and RoBERTa) to GGUF format using zonnx. SafeTensors is HuggingFace's preferred serialization format for model weights.
+
+## Prerequisites
+
+- zonnx installed (`go install github.com/zerfoo/zonnx/cmd/zonnx@latest`)
+- A HuggingFace model directory containing `config.json` and `model.safetensors`
+
+## Directory Structure
+
+zonnx expects a directory as input for SafeTensors conversion. The directory must contain:
+
+```
+model-dir/
+  config.json           # required -- model configuration
+  model.safetensors     # required -- model weights
+```
+
+The `config.json` provides architecture metadata (hidden size, layer count, attention heads, etc.) that zonnx maps to GGUF metadata keys. The `model.safetensors` file contains the weight tensors.
+
+## Step 1: Download a Model
+
+Download a model from HuggingFace. For example, to get [FinBERT](https://huggingface.co/ProsusAI/finbert) for financial sentiment analysis:
+
+```bash
+# Create a directory for the model
+mkdir -p ./models/finbert
+
+# Download config.json and model.safetensors
+# (use the HuggingFace CLI, git clone, or manual download)
+huggingface-cli download ProsusAI/finbert \
+  --include config.json model.safetensors \
+  --local-dir ./models/finbert
+```
+
+Verify the directory contents:
+
+```bash
+ls ./models/finbert/
+# config.json  model.safetensors
+```
+
+## Step 2: Convert to GGUF
+
+Run the `convert` command with `--format safetensors` and the appropriate `--arch`:
+
+```bash
+zonnx convert \
+  --format safetensors \
+  --arch bert \
+  --output ./models/finbert.gguf \
+  ./models/finbert/
+```
+
+Note that the input argument is the **directory** path, not the `.safetensors` file path.
+
+## config.json Fields and Metadata Mapping
+
+zonnx reads `config.json` and maps fields to GGUF metadata. For BERT and RoBERTa models, the following fields are mapped:
+
+### Standard Fields (All Architectures)
+
+| config.json field | GGUF key |
+|-------------------|----------|
+| `hidden_size` | `{arch}.embedding_length` |
+| `num_hidden_layers` | `{arch}.block_count` |
+| `num_attention_heads` | `{arch}.attention.head_count` |
+| `num_key_value_heads` | `{arch}.attention.head_count_kv` |
+| `intermediate_size` | `{arch}.feed_forward_length` |
+| `vocab_size` | `{arch}.vocab_size` |
+| `max_position_embeddings` | `{arch}.context_length` |
+
+### BERT/RoBERTa-Specific Fields
+
+| config.json field | GGUF key |
+|-------------------|----------|
+| `layer_norm_eps` | `{arch}.attention.layer_norm_epsilon` |
+| `num_labels` | `{arch}.num_labels` |
+| (auto) | `{arch}.pooler_type` = `"cls"` |
+
+If `num_labels` is not present in `config.json` but `id2label` is, zonnx derives the label count from the `id2label` mapping.
+
+## Supported Data Types
+
+zonnx handles these SafeTensors data types:
+
+| SafeTensors dtype | GGUF dtype |
+|-------------------|------------|
+| `F32` | Float32 |
+| `F16` | Float16 |
+| `BF16` | BFloat16 |
+
+Non-float tensors (e.g., `position_ids` with int64 dtype) are skipped automatically during conversion.
+
+## End-to-End Example: FinBERT
+
+This example converts [ProsusAI/finbert](https://huggingface.co/ProsusAI/finbert), a BERT model fine-tuned for financial sentiment classification.
+
+### 1. Download the Model
+
+```bash
+mkdir -p ./models/finbert
+huggingface-cli download ProsusAI/finbert \
+  --include config.json model.safetensors \
+  --local-dir ./models/finbert
+```
+
+### 2. Inspect config.json
+
+A typical FinBERT `config.json` contains:
+
+```json
+{
+  "architectures": ["BertForSequenceClassification"],
+  "hidden_size": 768,
+  "num_hidden_layers": 12,
+  "num_attention_heads": 12,
+  "intermediate_size": 3072,
+  "vocab_size": 30522,
+  "max_position_embeddings": 512,
+  "layer_norm_eps": 1e-12,
+  "id2label": {
+    "0": "positive",
+    "1": "negative",
+    "2": "neutral"
+  }
+}
+```
+
+zonnx maps these fields to GGUF metadata keys like `bert.embedding_length`, `bert.block_count`, `bert.attention.head_count`, etc. The three labels in `id2label` produce `bert.num_labels = 3`.
+
+### 3. Convert
+
+```bash
+zonnx convert \
+  --format safetensors \
+  --arch bert \
+  --output ./models/finbert.gguf \
+  ./models/finbert/
+```
+
+### 4. Verify
+
+```bash
+zonnx inspect --pretty ./models/finbert.gguf
+```
+
+The output should show GGUF metadata with `bert.*` keys and all encoder layer tensors.
+
+### 5. Use with Zerfoo
+
+```bash
+zerfoo predict ./models/finbert.gguf --input "Revenue exceeded expectations this quarter"
+```
+
+## RoBERTa Models
+
+RoBERTa conversion follows the same steps. Use `--arch roberta`:
+
+```bash
+zonnx convert \
+  --format safetensors \
+  --arch roberta \
+  --output ./models/roberta.gguf \
+  ./models/roberta-dir/
+```
+
+RoBERTa uses the same encoder layer structure as BERT. The `--arch` flag ensures tensor names are mapped using the `roberta.encoder.layer.N.*` prefix pattern.