|
| 1 | +# Show HN Post (2026-04-04) |
| 2 | + |
| 3 | +## Title |
| 4 | + |
| 5 | +Show HN: quant.cpp – 33K LOC C inference engine with lossless KV cache compression |
| 6 | + |
| 7 | +## URL |
| 8 | + |
| 9 | +https://github.com/quantumaikr/quant.cpp |
| 10 | + |
| 11 | +## Text |
| 12 | + |
| 13 | +quant.cpp is a minimal LLM inference engine in pure C. No external libraries, 33K lines of code, builds with any C11 compiler. |
| 14 | + |
| 15 | +The main feature is KV cache compression. On WikiText-2 with SmolLM2 1.7B: |
| 16 | + |
| 17 | +- 4-bit keys: PPL +0.0% (lossless), 4x less KV memory |
| 18 | +- 3-bit keys with delta compression: PPL +1.3%, 4.3x less KV memory |
| 19 | + |
| 20 | +For comparison, llama.cpp's Q4_0 KV gives +10.6% PPL on the same model. |
| 21 | + |
| 22 | +Delta compression stores the difference between adjacent keys instead of the keys themselves. Adjacent keys in a transformer differ by ~30% of their absolute range, so the deltas compress better at the same bit budget. Like video P-frames for KV cache. Full explanation in our blog post: https://github.com/quantumaikr/quant.cpp/blob/main/docs/blog/breaking-3bit-barrier.md |
| 23 | + |
| 24 | +On an 8GB laptop with Llama 8B Q4, this extends usable context from ~16K to ~61K tokens. |
| 25 | + |
| 26 | +What quant.cpp is NOT: it's not a llama.cpp replacement. llama.cpp is a full-featured framework. quant.cpp is a small engine you can read in an afternoon and embed in your own project. Different use case. |
| 27 | + |
| 28 | +Honest disclaimers: delta mode slows generation from 25 to 7 tok/s (sequential reconstruction). We don't support batched inference or speculative decoding. We initially claimed 1-bit worked, found a bug where FP32 was being read instead of quantized data, and retracted the claim. |
| 29 | + |
| 30 | +--- |
| 31 | + |
| 32 | +# Reddit r/LocalLLaMA Post (2026-04-04) |
| 33 | + |
| 34 | +## Title |
| 35 | + |
| 36 | +quant.cpp — 33K LOC C inference engine, 4-bit KV is lossless on WikiText-2, delta compression pushes 3-bit to +1.3% PPL |
| 37 | + |
| 38 | +## Body |
| 39 | + |
| 40 | +Hey r/LocalLLaMA, |
| 41 | + |
| 42 | +Some of you saw this project a few days ago as TurboQuant.cpp. We rebranded to quant.cpp and did a lot of cleanup based on your feedback. Here's what changed: |
| 43 | + |
| 44 | +**What we fixed:** |
| 45 | +- Ran WikiText-2 (standard benchmark) instead of our own test text |
| 46 | +- Corrected overstated PPL numbers: the -3.2% we claimed earlier was on a non-standard dataset. WikiText-2 shows +1.3% for delta+3bit. 4-bit is genuinely +0.0%. |
| 47 | +- Fixed a broken test that our rebrand script accidentally corrupted |
| 48 | +- Removed "zero dependencies" claim (we need pthreads) |
| 49 | +- Published speed tradeoffs: delta mode is 7 tok/s vs 25 tok/s for plain 4-bit |
| 50 | + |
| 51 | +**WikiText-2 results (SmolLM2 1.7B):** |
| 52 | + |
| 53 | +| Config | PPL | vs FP32 | |
| 54 | +|--------|-----|---------| |
| 55 | +| FP32 baseline | 14.63 | -- | |
| 56 | +| 4-bit K + Q4 V (3.8x) | 14.57 | -0.4% | |
| 57 | +| delta + 3-bit K + Q4 V (4.3x) | 14.82 | +1.3% | |
| 58 | +| llama.cpp Q4_0 KV | 16.18 | +10.6% | |
| 59 | + |
| 60 | +**What quant.cpp actually is:** |
| 61 | +A 33K LOC pure C inference engine. Not a llama.cpp fork or wrapper. Loads GGUF models. The selling point isn't speed — it's that you can read the whole codebase in an afternoon and embed it in your own C project. |
| 62 | + |
| 63 | +**Blog post** with the full story (including the FP32 fallback bug we found and disclosed): https://github.com/quantumaikr/quant.cpp/blob/main/docs/blog/breaking-3bit-barrier.md |
| 64 | + |
| 65 | +**Pre-built binaries** now available: https://github.com/quantumaikr/quant.cpp/releases/tag/v0.2.0 |
| 66 | + |
| 67 | +``` |
| 68 | +./quant model.gguf -p "hello" -k uniform_4b -v q4 |
| 69 | +``` |
| 70 | + |
| 71 | +Repo: https://github.com/quantumaikr/quant.cpp |
| 72 | + |
| 73 | +We have 5 good-first-issues open if anyone wants to contribute. Feedback welcome — the last round of comments made the project significantly more honest. |
0 commit comments