Skip to content

Commit d76b81f

Browse files
committed
docs(zonnx): add zonnx overview, ONNX, and SafeTensors conversion guides
1 parent cb4d0b0 commit d76b81f

3 files changed

Lines changed: 388 additions & 0 deletions

File tree

content/docs/zonnx/onnx-to-gguf.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
---
2+
title: ONNX to GGUF
3+
weight: 2
4+
bookToc: true
5+
---
6+
7+
# ONNX to GGUF Conversion
8+
9+
This guide walks through converting an ONNX model to GGUF format using zonnx. The resulting GGUF file can be loaded by zerfoo or llama.cpp.
10+
11+
## Prerequisites
12+
13+
- zonnx installed (`go install github.com/zerfoo/zonnx/cmd/zonnx@latest`)
14+
- An ONNX model file, either local or on HuggingFace
15+
16+
## Step 1: Download a Model from HuggingFace
17+
18+
Use the `download` command to fetch an ONNX model and its tokenizer files:
19+
20+
```bash
21+
zonnx download --model google/gemma-2-2b-it --output ./models
22+
```
23+
24+
For gated models that require authentication:
25+
26+
```bash
27+
# Via flag
28+
zonnx download --model meta-llama/Llama-3-8B --output ./models --api-key YOUR_HF_TOKEN
29+
30+
# Via environment variable
31+
export HF_API_KEY=YOUR_HF_TOKEN
32+
zonnx download --model meta-llama/Llama-3-8B --output ./models
33+
```
34+
35+
The `--api-key` flag takes precedence over the `HF_API_KEY` environment variable.
36+
37+
After downloading, you should have at minimum:
38+
39+
```
40+
models/
41+
model.onnx
42+
config.json # optional but recommended for metadata
43+
tokenizer.json # downloaded automatically if available
44+
```
45+
46+
## Step 2: Convert to GGUF
47+
48+
Run the `convert` command with the appropriate `--arch` flag:
49+
50+
```bash
51+
zonnx convert --arch gemma --output ./models/gemma-2b.gguf ./models/model.onnx
52+
```
53+
54+
### The `--arch` Flag
55+
56+
The `--arch` flag selects the tensor name mapping and metadata mapping for the target architecture. If a `config.json` file exists alongside the ONNX file, zonnx reads it automatically and maps HuggingFace config fields to GGUF metadata keys.
57+
58+
If `--arch` is omitted, it defaults to `llama`.
59+
60+
### Convert Command Flags
61+
62+
| Flag | Default | Description |
63+
|------|---------|-------------|
64+
| `--output` | `<input-dir>/<input-base>.gguf` | Output GGUF file path |
65+
| `--arch` | `llama` | Model architecture for metadata and tensor mapping |
66+
| `--format` | `onnx` | Input format: `onnx` or `safetensors` |
67+
| `--quantize` | (none) | Quantize weights: `q4_0` or `q8_0` |
68+
69+
## Step 3: Quantize During Conversion (Optional)
70+
71+
To reduce model size, quantize weights during conversion:
72+
73+
```bash
74+
# 4-bit quantization (smallest, some quality loss)
75+
zonnx convert --arch gemma --quantize q4_0 --output ./models/gemma-2b-q4.gguf ./models/model.onnx
76+
77+
# 8-bit quantization (good balance of size and quality)
78+
zonnx convert --arch gemma --quantize q8_0 --output ./models/gemma-2b-q8.gguf ./models/model.onnx
79+
```
80+
81+
| Quantization | Bits per Weight | Use Case |
82+
|-------------|-----------------|----------|
83+
| (none) | 32 | Full precision, largest file |
84+
| `q8_0` | 8 | Good quality, ~4x smaller than F32 |
85+
| `q4_0` | 4 | Smallest, ~8x smaller than F32 |
86+
87+
## Step 4: Verify the Output
88+
89+
Inspect the generated GGUF file to confirm metadata and tensors:
90+
91+
```bash
92+
zonnx inspect --pretty ./models/gemma-2b.gguf
93+
```
94+
95+
## Supported Architectures
96+
97+
| Architecture | `--arch` value | Tensor Mapping | Notes |
98+
|-------------|----------------|----------------|-------|
99+
| Llama | `llama` (default) | Decoder layers (`model.layers.N.*`) | Llama 3, Code Llama |
100+
| Gemma | `gemma` | Decoder layers (`model.layers.N.*`) | Gemma, Gemma 2, Gemma 3 |
101+
| BERT | `bert` | Encoder layers (`bert.encoder.layer.N.*`) | Classification, embeddings |
102+
| RoBERTa | `roberta` | Encoder layers (`roberta.encoder.layer.N.*`) | Same structure as BERT |
103+
104+
## Metadata Mapping
105+
106+
When a `config.json` file is present alongside the ONNX model, zonnx maps these HuggingFace fields to GGUF metadata:
107+
108+
| config.json field | GGUF key |
109+
|-------------------|----------|
110+
| `hidden_size` | `{arch}.embedding_length` |
111+
| `num_hidden_layers` | `{arch}.block_count` |
112+
| `num_attention_heads` | `{arch}.attention.head_count` |
113+
| `num_key_value_heads` | `{arch}.attention.head_count_kv` |
114+
| `intermediate_size` | `{arch}.feed_forward_length` |
115+
| `vocab_size` | `{arch}.vocab_size` |
116+
| `max_position_embeddings` | `{arch}.context_length` |
117+
| `rms_norm_eps` | `{arch}.attention.layer_norm_rms_epsilon` |
118+
| `rope_theta` | `{arch}.rope.freq_base` |
119+
120+
## Using the GGUF File with Zerfoo
121+
122+
Once converted, load the model with zerfoo:
123+
124+
```bash
125+
zerfoo run ./models/gemma-2b.gguf --prompt "Hello, world!"
126+
```
127+
128+
Or serve it as an OpenAI-compatible API:
129+
130+
```bash
131+
zerfoo serve ./models/gemma-2b.gguf
132+
```

content/docs/zonnx/overview.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: zonnx Overview
3+
weight: 1
4+
bookToc: true
5+
---
6+
7+
# zonnx Overview
8+
9+
zonnx is a standalone command-line tool that converts machine learning models to [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) format. It accepts ONNX and SafeTensors inputs and produces portable GGUF files compatible with both the zerfoo runtime and llama.cpp.
10+
11+
zonnx ships as a single static binary with no CGo dependency.
12+
13+
## Features
14+
15+
- **ONNX to GGUF conversion** -- convert decoder models (Llama, Gemma) from ONNX format
16+
- **SafeTensors to GGUF conversion** -- convert encoder models (BERT, RoBERTa) from SafeTensors format
17+
- **Post-conversion quantization** -- quantize weights to Q4_0 or Q8_0 during conversion
18+
- **HuggingFace integration** -- download ONNX models and tokenizer files directly from the Hub
19+
- **Model inspection** -- inspect ONNX and GGUF files for metadata, tensors, and structure
20+
- **Architecture-aware mappings** -- tensor name and metadata mappings tuned per model family
21+
22+
## Installation
23+
24+
Requires Go 1.26 or later. Install with:
25+
26+
```bash
27+
go install github.com/zerfoo/zonnx/cmd/zonnx@latest
28+
```
29+
30+
Or build from source:
31+
32+
```bash
33+
git clone https://github.com/zerfoo/zonnx.git
34+
cd zonnx
35+
go build -o zonnx ./cmd/zonnx
36+
```
37+
38+
CGo is not required -- `CGO_ENABLED=0` works.
39+
40+
## Supported Architectures
41+
42+
| Architecture | `--arch` value | Input Formats | Notes |
43+
|-------------|----------------|---------------|-------|
44+
| Llama | `llama` (default) | ONNX | Llama 3, Code Llama |
45+
| Gemma | `gemma` | ONNX | Gemma, Gemma 2, Gemma 3 |
46+
| BERT | `bert` | ONNX, SafeTensors | Classification, embeddings |
47+
| RoBERTa | `roberta` | ONNX, SafeTensors | Same layer structure as BERT |
48+
49+
Any architecture string can be passed via `--arch`. Generic metadata mapping applies to all architectures. Tensor name mapping currently covers Llama-style decoder models and BERT/RoBERTa encoder models.
50+
51+
## Basic Usage
52+
53+
```bash
54+
# Download an ONNX model from HuggingFace
55+
zonnx download --model google/gemma-2-2b-it --output ./models
56+
57+
# Convert ONNX to GGUF
58+
zonnx convert --arch gemma --output ./models/model.gguf ./models/model.onnx
59+
60+
# Convert SafeTensors to GGUF
61+
zonnx convert --format safetensors --arch bert --output ./models/model.gguf ./models/bert-dir/
62+
63+
# Convert with quantization
64+
zonnx convert --quantize q4_0 --output ./models/model-q4.gguf ./models/model.onnx
65+
66+
# Inspect a model file
67+
zonnx inspect --pretty ./models/model.onnx
68+
```
69+
70+
## Commands
71+
72+
| Command | Description |
73+
|---------|-------------|
74+
| `convert` | Convert ONNX or SafeTensors models to GGUF |
75+
| `download` | Download ONNX models and tokenizer files from HuggingFace Hub |
76+
| `inspect` | Inspect ONNX or GGUF model files |
77+
78+
## Next Steps
79+
80+
- [ONNX to GGUF]({{< relref "onnx-to-gguf" >}}) -- step-by-step guide for converting ONNX models
81+
- [SafeTensors to GGUF]({{< relref "safetensors-to-gguf" >}}) -- guide for converting SafeTensors models (BERT, RoBERTa)
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
---
2+
title: SafeTensors to GGUF
3+
weight: 3
4+
bookToc: true
5+
---
6+
7+
# SafeTensors to GGUF Conversion
8+
9+
This guide covers converting SafeTensors models (typically BERT and RoBERTa) to GGUF format using zonnx. SafeTensors is HuggingFace's preferred serialization format for model weights.
10+
11+
## Prerequisites
12+
13+
- zonnx installed (`go install github.com/zerfoo/zonnx/cmd/zonnx@latest`)
14+
- A HuggingFace model directory containing `config.json` and `model.safetensors`
15+
16+
## Directory Structure
17+
18+
zonnx expects a directory as input for SafeTensors conversion. The directory must contain:
19+
20+
```
21+
model-dir/
22+
config.json # required -- model configuration
23+
model.safetensors # required -- model weights
24+
```
25+
26+
The `config.json` provides architecture metadata (hidden size, layer count, attention heads, etc.) that zonnx maps to GGUF metadata keys. The `model.safetensors` file contains the weight tensors.
27+
28+
## Step 1: Download a Model
29+
30+
Download a model from HuggingFace. For example, to get [FinBERT](https://huggingface.co/ProsusAI/finbert) for financial sentiment analysis:
31+
32+
```bash
33+
# Create a directory for the model
34+
mkdir -p ./models/finbert
35+
36+
# Download config.json and model.safetensors
37+
# (use the HuggingFace CLI, git clone, or manual download)
38+
huggingface-cli download ProsusAI/finbert \
39+
--include config.json model.safetensors \
40+
--local-dir ./models/finbert
41+
```
42+
43+
Verify the directory contents:
44+
45+
```bash
46+
ls ./models/finbert/
47+
# config.json model.safetensors
48+
```
49+
50+
## Step 2: Convert to GGUF
51+
52+
Run the `convert` command with `--format safetensors` and the appropriate `--arch`:
53+
54+
```bash
55+
zonnx convert \
56+
--format safetensors \
57+
--arch bert \
58+
--output ./models/finbert.gguf \
59+
./models/finbert/
60+
```
61+
62+
Note that the input argument is the **directory** path, not the `.safetensors` file path.
63+
64+
## config.json Fields and Metadata Mapping
65+
66+
zonnx reads `config.json` and maps fields to GGUF metadata. For BERT and RoBERTa models, the following fields are mapped:
67+
68+
### Standard Fields (All Architectures)
69+
70+
| config.json field | GGUF key |
71+
|-------------------|----------|
72+
| `hidden_size` | `{arch}.embedding_length` |
73+
| `num_hidden_layers` | `{arch}.block_count` |
74+
| `num_attention_heads` | `{arch}.attention.head_count` |
75+
| `num_key_value_heads` | `{arch}.attention.head_count_kv` |
76+
| `intermediate_size` | `{arch}.feed_forward_length` |
77+
| `vocab_size` | `{arch}.vocab_size` |
78+
| `max_position_embeddings` | `{arch}.context_length` |
79+
80+
### BERT/RoBERTa-Specific Fields
81+
82+
| config.json field | GGUF key |
83+
|-------------------|----------|
84+
| `layer_norm_eps` | `{arch}.attention.layer_norm_epsilon` |
85+
| `num_labels` | `{arch}.num_labels` |
86+
| (auto) | `{arch}.pooler_type` = `"cls"` |
87+
88+
If `num_labels` is not present in `config.json` but `id2label` is, zonnx derives the label count from the `id2label` mapping.
89+
90+
## Supported Data Types
91+
92+
zonnx handles these SafeTensors data types:
93+
94+
| SafeTensors dtype | GGUF dtype |
95+
|-------------------|------------|
96+
| `F32` | Float32 |
97+
| `F16` | Float16 |
98+
| `BF16` | BFloat16 |
99+
100+
Non-float tensors (e.g., `position_ids` with int64 dtype) are skipped automatically during conversion.
101+
102+
## End-to-End Example: FinBERT
103+
104+
This example converts [ProsusAI/finbert](https://huggingface.co/ProsusAI/finbert), a BERT model fine-tuned for financial sentiment classification.
105+
106+
### 1. Download the Model
107+
108+
```bash
109+
mkdir -p ./models/finbert
110+
huggingface-cli download ProsusAI/finbert \
111+
--include config.json model.safetensors \
112+
--local-dir ./models/finbert
113+
```
114+
115+
### 2. Inspect config.json
116+
117+
A typical FinBERT `config.json` contains:
118+
119+
```json
120+
{
121+
"architectures": ["BertForSequenceClassification"],
122+
"hidden_size": 768,
123+
"num_hidden_layers": 12,
124+
"num_attention_heads": 12,
125+
"intermediate_size": 3072,
126+
"vocab_size": 30522,
127+
"max_position_embeddings": 512,
128+
"layer_norm_eps": 1e-12,
129+
"id2label": {
130+
"0": "positive",
131+
"1": "negative",
132+
"2": "neutral"
133+
}
134+
}
135+
```
136+
137+
zonnx maps these fields to GGUF metadata keys like `bert.embedding_length`, `bert.block_count`, `bert.attention.head_count`, etc. The three labels in `id2label` produce `bert.num_labels = 3`.
138+
139+
### 3. Convert
140+
141+
```bash
142+
zonnx convert \
143+
--format safetensors \
144+
--arch bert \
145+
--output ./models/finbert.gguf \
146+
./models/finbert/
147+
```
148+
149+
### 4. Verify
150+
151+
```bash
152+
zonnx inspect --pretty ./models/finbert.gguf
153+
```
154+
155+
The output should show GGUF metadata with `bert.*` keys and all encoder layer tensors.
156+
157+
### 5. Use with Zerfoo
158+
159+
```bash
160+
zerfoo predict ./models/finbert.gguf --input "Revenue exceeded expectations this quarter"
161+
```
162+
163+
## RoBERTa Models
164+
165+
RoBERTa conversion follows the same steps. Use `--arch roberta`:
166+
167+
```bash
168+
zonnx convert \
169+
--format safetensors \
170+
--arch roberta \
171+
--output ./models/roberta.gguf \
172+
./models/roberta-dir/
173+
```
174+
175+
RoBERTa uses the same encoder layer structure as BERT. The `--arch` flag ensures tensor names are mapped using the `roberta.encoder.layer.N.*` prefix pattern.

0 commit comments

Comments
 (0)