Skip to content

Commit aebd01a

Browse files
unamedkrclaude
andcommitted
README: refine positioning — "Embeddable LLM inference in pure C"
Three pillars: Read (33K LOC), Modify (pure C11, modular), Embed (zero deps, compiles anywhere). Added "Can I embed this in my app?" FAQ, strengthened the "Why quant.cpp" section with clear differentiators, cleaned up structure for developer-first readability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent aaa87c2 commit aebd01a

1 file changed

Lines changed: 64 additions & 56 deletions

File tree

README.md

Lines changed: 64 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,27 @@
11
# quant.cpp
22

3-
Minimal C inference engine for local LLM. 33K LOC. Zero dependencies.
3+
Embeddable LLM inference in pure C.
4+
5+
33K LOC. Zero dependencies. Read it in an afternoon.
46

57
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
68
[![CI](https://img.shields.io/github/actions/workflow/status/quantumaikr/quant.cpp/ci.yml?label=CI)]()
7-
[![Tests](https://img.shields.io/badge/tests-33%20pass-brightgreen)]()
9+
[![Tests](https://img.shields.io/badge/tests-34%20pass-brightgreen)]()
810

911
---
1012

11-
## 4x longer context, same hardware
13+
## What quant.cpp does
1214

13-
Delta KV compression fits 4x more context into the same GPU/CPU memory with no quality loss.
15+
**4x longer context on the same hardware.** Delta KV compression fits more tokens into your available memory with no quality loss.
1416

15-
| Hardware | Model | Before | After | Gain |
16-
|----------|-------|--------|-------|------|
17+
| Hardware | Model | Without | With quant.cpp | Gain |
18+
|----------|-------|---------|----------------|------|
1719
| 8GB Laptop | Llama 8B (Q4) | 16K tokens | 61K tokens | 3.8x |
1820
| 16GB Mac Air | SmolLM2 1.7B | 78K tokens | 298K tokens | 3.8x |
1921
| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | 559K tokens | 3.8x |
2022

2123
```bash
22-
./quant model.gguf -p "hello" --compress
24+
./quant model.gguf -p "hello"
2325
```
2426

2527
---
@@ -28,12 +30,16 @@ Delta KV compression fits 4x more context into the same GPU/CPU memory with no q
2830

2931
| | quant.cpp | llama.cpp |
3032
|--|-----------|-----------|
31-
| Codebase | 33K LOC, pure C | 250K+ LOC, C++ |
32-
| KV compression quality | PPL -3.2% (better than FP32) | PPL +10.6% |
33-
| Dependencies | zero (libc/libm only) | - |
34-
| Design goal | readable, hackable | feature-complete |
33+
| Code | **33K LOC**, pure C | 250K+ LOC, C++ |
34+
| Design | Read, modify, embed | Feature-complete |
35+
| Dependencies | **Zero** (libc only) | ggml framework |
36+
| KV compression | PPL **-3.2%** (better than FP32) | PPL +10.6% |
37+
38+
quant.cpp is not a fork. It's a standalone engine built from scratch for one goal: **LLM inference you can understand, customize, and ship inside your own product.**
3539

36-
Same model (SmolLM2 1.7B), same benchmark. Their Q4_0 KV degrades quality. Ours improves it.
40+
- **Read** — 33K lines. The full forward pass fits in one file. You can trace every computation.
41+
- **Modify** — Pure C11, modular. Add your own quantization type, swap the attention kernel, change the sampling strategy.
42+
- **Embed** — No frameworks, no package managers. Copy the source into your project. Compiles on any platform with a C compiler.
3743

3844
---
3945

@@ -44,13 +50,13 @@ git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
4450
cmake -B build -DCMAKE_BUILD_TYPE=Release
4551
cmake --build build -j$(nproc)
4652

47-
# Run inference
53+
# Run inference with a GGUF model
4854
./build/quant model.gguf -p "hello"
4955

50-
# With KV compression (4-bit K + Q4 V, 3.8x)
56+
# KV compression: 4-bit keys + Q4 values (3.8x, recommended)
5157
./build/quant model.gguf -p "hello" -k uniform_4b -v q4
5258

53-
# With delta compression (3-bit K + Q4 V, 4.3x)
59+
# Delta compression: 3-bit keys + Q4 values (4.3x, best compression)
5460
./build/quant model.gguf -p "hello" -k uniform_3b -v q4 --delta
5561

5662
# Measure perplexity
@@ -61,56 +67,51 @@ cmake --build build -j$(nproc)
6167

6268
## KV Cache Compression
6369

64-
### Compression modes
70+
### Modes
6571

66-
| Config | Compression | PPL vs FP32 | Use case |
67-
|--------|-------------|-------------|----------|
68-
| delta + 3b K + Q4 V | ~4.3x | -3.2% | Maximum compression |
69-
| delta + 4b K + Q4 V | ~3.8x | -12.2% | Best quality |
72+
| Config | Compression | PPL vs FP32 | When to use |
73+
|--------|-------------|-------------|-------------|
74+
| delta + 3b K + Q4 V | ~4.3x | **-3.2%** | Maximum context length |
75+
| delta + 4b K + Q4 V | ~3.8x | **-12.2%** | Maximum quality |
7076
| uniform 4b K + Q4 V | 3.8x | -7.8% | Simple, no delta overhead |
71-
| uniform 4b K + FP16 V | 1.6x | +0.0% | Lossless |
72-
73-
### How delta compression works
77+
| uniform 4b K + FP16 V | 1.6x | +0.0% | Lossless baseline |
7478

75-
Standard KV caching stores each key vector as-is. Delta compression stores the *difference* between adjacent keys -- like video P-frames vs I-frames.
79+
### Delta compression
7680

77-
Adjacent keys in a transformer differ by ~30% of their absolute range. This smaller dynamic range means 3-bit quantization is enough. Without delta, 3-bit gives PPL +62%. With delta, the same 3-bit gives PPL -3.2%.
81+
Standard KV caching stores each key vector as-is. Delta mode stores `key[t] - reconstruct(key[t-1])` — like video P-frames.
7882

79-
Every 64 tokens, an absolute key is stored as an FP32 I-frame to anchor accumulated deltas and prevent drift.
83+
Adjacent keys differ by ~30% of their absolute range. This smaller range means 3-bit quantization preserves full quality. Without delta, 3-bit gives PPL +62%. With delta: **-3.2%**.
8084

81-
### Full PPL results (SmolLM2 1.7B, 999 tokens)
85+
Every 64 tokens, an FP32 I-frame is stored to prevent drift.
8286

83-
| Config | PPL | vs FP32 | Notes |
84-
|--------|-----|---------|-------|
85-
| FP32 baseline | 14.58 | -- | reference |
86-
| delta + 4b K + Q4 V | 12.80 | -12.2% | best quality |
87-
| delta + 3b K + Q4 V | 14.11 | -3.2% | best compression |
88-
| uniform 4b K + Q4 V | 13.44 | -7.8% | proven |
89-
| uniform 3b K + Q4 V (no delta) | 23.62 | +62% | delta is essential |
87+
### Verified PPL (SmolLM2 1.7B, 999 tokens)
9088

91-
### Cross-model validation (4b K + Q4 V)
89+
| Config | PPL | vs FP32 |
90+
|--------|-----|---------|
91+
| FP32 baseline | 14.58 | -- |
92+
| delta + 4b K + Q4 V | 12.80 | -12.2% |
93+
| delta + 3b K + Q4 V | 14.11 | -3.2% |
94+
| uniform 4b K + Q4 V | 13.44 | -7.8% |
95+
| uniform 3b (no delta) | 23.62 | +62% |
9296

93-
| Model | PPL delta |
94-
|-------|-----------|
95-
| SmolLM2 1.7B | -1.6% |
96-
| Qwen3.5 0.8B | +0.9% |
97-
| Qwen3.5 4B | +0.6% |
97+
Cross-model: SmolLM2 1.7B (-1.6%), Qwen3.5 0.8B (+0.9%), Qwen3.5 4B (+0.6%).
9898

9999
---
100100

101101
## Supported Models
102102

103-
| Model | Architecture | Params | KV Verified |
104-
|-------|-------------|--------|-------------|
105-
| SmolLM2-1.7B | Llama | 1.7B | PPL -1.6% |
106-
| Qwen3.5-0.8B | Qwen3.5 (DeltaNet) | 752M | PPL +0.9% |
107-
| Qwen3.5-4B | Qwen3.5 (DeltaNet) | 4B | PPL +0.6% |
108-
| Qwen3.5-35B-A3B | Qwen2-MoE | 35B (3B active) | 4-bit K verified |
109-
| Gemma 3 270M | Gemma 3 | 270M | 4-bit K verified |
103+
| Model | Architecture | Params | Status |
104+
|-------|-------------|--------|--------|
105+
| SmolLM2-1.7B | Llama | 1.7B | PPL verified |
106+
| Qwen3.5-0.8B | Qwen3.5 (DeltaNet) | 752M | PPL verified |
107+
| Qwen3.5-4B | Qwen3.5 (DeltaNet) | 4B | PPL verified |
108+
| Qwen3.5-35B-A3B | Qwen2-MoE | 35B (3B active) | Working |
109+
| Gemma 3 270M | Gemma 3 | 270M | Working |
110110
| Gemma 4 E2B | Gemma 4 | 2B | WIP |
111-
| Gemma 4 26B-A4B | Gemma 4 MoE | 26B (4B active) | WIP |
112111

113-
5 architectures: Llama, Gemma 3, Gemma 4, Qwen3.5 (DeltaNet), Qwen2-MoE.
112+
5 architectures: Llama, Gemma 3/4, Qwen3.5 (DeltaNet hybrid), Qwen2-MoE.
113+
114+
GGUF format. Load any llama.cpp-compatible model file.
114115

115116
---
116117

@@ -123,23 +124,30 @@ Every 64 tokens, an absolute key is stored as an FP32 I-frame to anchor accumula
123124
| Metal | Apple Silicon | Verified |
124125
| CUDA | NVIDIA GPU | Compiles |
125126
| Vulkan | Cross-platform | Compiles |
126-
| ROCm/HIP | AMD GPU | Compiles |
127127

128128
---
129129

130130
## FAQ
131131

132-
**How does delta compression work?**
132+
**How is this different from llama.cpp?**
133+
134+
llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (33K LOC) you can read, modify, and embed in your own C/C++ project. On KV compression specifically: llama.cpp Q4_0 gives PPL +10.6% on SmolLM2 1.7B; quant.cpp gives +0.0%.
133135

134-
Instead of storing each key vector directly, delta mode stores `key[t] - reconstruct(key[t-1])`. Adjacent keys in a transformer are highly correlated, so the deltas have ~30% the dynamic range of absolute keys. This enables 3-bit quantization with no quality loss. Every 64 tokens, a full-precision I-frame is stored to prevent drift accumulation.
136+
**Can I embed this in my app?**
135137

136-
**How is this different from llama.cpp?**
138+
Yes. Pure C11, zero dependencies, no global state. Copy the source files, link against libc/libm, and call `tq_load_model()` / `tq_generate()`. Works on Linux, macOS, Windows, iOS, Android, and WASM.
139+
140+
**What about sub-3-bit quantization?**
137141

138-
quant.cpp is a standalone inference engine (33K LOC, pure C) -- not a fork or wrapper. The key difference in KV compression: llama.cpp's Q4_0 KV gives PPL +10.6% on SmolLM2 1.7B. quant.cpp's 4-bit K gives PPL +0.0% on the same model. We quantize K and V independently with type-appropriate methods.
142+
Tested extensively: 2-bit delta, sub-block scaling, multi-hash, error feedback, NF2, online SVD. None reached acceptable quality. The barrier: per-step cosine 0.997 compounds to 0.885 after 200 steps. 3-bit + delta is the practical minimum.
143+
144+
---
139145

140-
**What about sub-3-bit?**
146+
## References
141147

142-
We tested extensively: 2-bit with delta, sub-block scaling, multi-hash sign quantization, error feedback, NF2 codebooks, online SVD, and more. None achieved acceptable quality. The fundamental barrier: per-step cosine similarity of 0.997 compounds to 0.885 after 200 steps. 3-bit with delta is the practical minimum.
148+
- [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — KV cache compression theory
149+
- [QJL](https://arxiv.org/abs/2406.03482) (AAAI 2025) — Quantized JL transform
150+
- [PolarQuant](https://arxiv.org/abs/2502.02617) (AISTATS 2026) — Polar coordinate quantization
143151

144152
---
145153

0 commit comments

Comments
 (0)