You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
README: refine positioning — "Embeddable LLM inference in pure C"
Three pillars: Read (33K LOC), Modify (pure C11, modular),
Embed (zero deps, compiles anywhere).
Added "Can I embed this in my app?" FAQ, strengthened the
"Why quant.cpp" section with clear differentiators, cleaned
up structure for developer-first readability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
quant.cpp is not a fork. It's a standalone engine built from scratch for one goal: **LLM inference you can understand, customize, and ship inside your own product.**
35
39
36
-
Same model (SmolLM2 1.7B), same benchmark. Their Q4_0 KV degrades quality. Ours improves it.
40
+
-**Read** — 33K lines. The full forward pass fits in one file. You can trace every computation.
41
+
-**Modify** — Pure C11, modular. Add your own quantization type, swap the attention kernel, change the sampling strategy.
42
+
-**Embed** — No frameworks, no package managers. Copy the source into your project. Compiles on any platform with a C compiler.
37
43
38
44
---
39
45
@@ -44,13 +50,13 @@ git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
| delta + 3b K + Q4 V |~4.3x |**-3.2%**| Maximum context length|
75
+
| delta + 4b K + Q4 V |~3.8x |**-12.2%**|Maximum quality |
70
76
| uniform 4b K + Q4 V | 3.8x | -7.8% | Simple, no delta overhead |
71
-
| uniform 4b K + FP16 V | 1.6x | +0.0% | Lossless |
72
-
73
-
### How delta compression works
77
+
| uniform 4b K + FP16 V | 1.6x | +0.0% | Lossless baseline |
74
78
75
-
Standard KV caching stores each key vector as-is. Delta compression stores the *difference* between adjacent keys -- like video P-frames vs I-frames.
79
+
### Delta compression
76
80
77
-
Adjacent keys in a transformer differ by ~30% of their absolute range. This smaller dynamic range means 3-bit quantization is enough. Without delta, 3-bit gives PPL +62%. With delta, the same 3-bit gives PPL -3.2%.
81
+
Standard KV caching stores each key vector as-is. Delta mode stores `key[t] - reconstruct(key[t-1])` — like video P-frames.
78
82
79
-
Every 64 tokens, an absolute key is stored as an FP32 I-frame to anchor accumulated deltas and prevent drift.
83
+
Adjacent keys differ by ~30% of their absolute range. This smaller range means 3-bit quantization preserves full quality. Without delta, 3-bit gives PPL +62%. With delta: **-3.2%**.
80
84
81
-
### Full PPL results (SmolLM2 1.7B, 999 tokens)
85
+
Every 64 tokens, an FP32 I-frame is stored to prevent drift.
82
86
83
-
| Config | PPL | vs FP32 | Notes |
84
-
|--------|-----|---------|-------|
85
-
| FP32 baseline | 14.58 | -- | reference |
86
-
| delta + 4b K + Q4 V | 12.80 | -12.2% | best quality |
87
-
| delta + 3b K + Q4 V | 14.11 | -3.2% | best compression |
88
-
| uniform 4b K + Q4 V | 13.44 | -7.8% | proven |
89
-
| uniform 3b K + Q4 V (no delta) | 23.62 | +62% | delta is essential |
GGUF format. Load any llama.cpp-compatible model file.
114
115
115
116
---
116
117
@@ -123,23 +124,30 @@ Every 64 tokens, an absolute key is stored as an FP32 I-frame to anchor accumula
123
124
| Metal | Apple Silicon | Verified |
124
125
| CUDA | NVIDIA GPU | Compiles |
125
126
| Vulkan | Cross-platform | Compiles |
126
-
| ROCm/HIP | AMD GPU | Compiles |
127
127
128
128
---
129
129
130
130
## FAQ
131
131
132
-
**How does delta compression work?**
132
+
**How is this different from llama.cpp?**
133
+
134
+
llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (33K LOC) you can read, modify, and embed in your own C/C++ project. On KV compression specifically: llama.cpp Q4_0 gives PPL +10.6% on SmolLM2 1.7B; quant.cpp gives +0.0%.
133
135
134
-
Instead of storing each key vector directly, delta mode stores `key[t] - reconstruct(key[t-1])`. Adjacent keys in a transformer are highly correlated, so the deltas have ~30% the dynamic range of absolute keys. This enables 3-bit quantization with no quality loss. Every 64 tokens, a full-precision I-frame is stored to prevent drift accumulation.
136
+
**Can I embed this in my app?**
135
137
136
-
**How is this different from llama.cpp?**
138
+
Yes. Pure C11, zero dependencies, no global state. Copy the source files, link against libc/libm, and call `tq_load_model()` / `tq_generate()`. Works on Linux, macOS, Windows, iOS, Android, and WASM.
139
+
140
+
**What about sub-3-bit quantization?**
137
141
138
-
quant.cpp is a standalone inference engine (33K LOC, pure C) -- not a fork or wrapper. The key difference in KV compression: llama.cpp's Q4_0 KV gives PPL +10.6% on SmolLM2 1.7B. quant.cpp's 4-bit K gives PPL +0.0% on the same model. We quantize K and V independently with type-appropriate methods.
142
+
Tested extensively: 2-bit delta, sub-block scaling, multi-hash, error feedback, NF2, online SVD. None reached acceptable quality. The barrier: per-step cosine 0.997 compounds to 0.885 after 200 steps. 3-bit + delta is the practical minimum.
143
+
144
+
---
139
145
140
-
**What about sub-3-bit?**
146
+
## References
141
147
142
-
We tested extensively: 2-bit with delta, sub-block scaling, multi-hash sign quantization, error feedback, NF2 codebooks, online SVD, and more. None achieved acceptable quality. The fundamental barrier: per-step cosine similarity of 0.997 compounds to 0.885 after 200 steps. 3-bit with delta is the practical minimum.
148
+
-[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — KV cache compression theory
0 commit comments