Skip to content

Commit 836e117

Browse files
unamedkrclaude
andcommitted
Add HN + Reddit post drafts for v0.2.0 launch
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4c7b7df commit 836e117

1 file changed

Lines changed: 73 additions & 0 deletions

File tree

docs/pr/2026-04-04-show-hn.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Show HN Post (2026-04-04)
2+
3+
## Title
4+
5+
Show HN: quant.cpp – 33K LOC C inference engine with lossless KV cache compression
6+
7+
## URL
8+
9+
https://github.com/quantumaikr/quant.cpp
10+
11+
## Text
12+
13+
quant.cpp is a minimal LLM inference engine in pure C. No external libraries, 33K lines of code, builds with any C11 compiler.
14+
15+
The main feature is KV cache compression. On WikiText-2 with SmolLM2 1.7B:
16+
17+
- 4-bit keys: PPL +0.0% (lossless), 4x less KV memory
18+
- 3-bit keys with delta compression: PPL +1.3%, 4.3x less KV memory
19+
20+
For comparison, llama.cpp's Q4_0 KV gives +10.6% PPL on the same model.
21+
22+
Delta compression stores the difference between adjacent keys instead of the keys themselves. Adjacent keys in a transformer differ by ~30% of their absolute range, so the deltas compress better at the same bit budget. Like video P-frames for KV cache. Full explanation in our blog post: https://github.com/quantumaikr/quant.cpp/blob/main/docs/blog/breaking-3bit-barrier.md
23+
24+
On an 8GB laptop with Llama 8B Q4, this extends usable context from ~16K to ~61K tokens.
25+
26+
What quant.cpp is NOT: it's not a llama.cpp replacement. llama.cpp is a full-featured framework. quant.cpp is a small engine you can read in an afternoon and embed in your own project. Different use case.
27+
28+
Honest disclaimers: delta mode slows generation from 25 to 7 tok/s (sequential reconstruction). We don't support batched inference or speculative decoding. We initially claimed 1-bit worked, found a bug where FP32 was being read instead of quantized data, and retracted the claim.
29+
30+
---
31+
32+
# Reddit r/LocalLLaMA Post (2026-04-04)
33+
34+
## Title
35+
36+
quant.cpp — 33K LOC C inference engine, 4-bit KV is lossless on WikiText-2, delta compression pushes 3-bit to +1.3% PPL
37+
38+
## Body
39+
40+
Hey r/LocalLLaMA,
41+
42+
Some of you saw this project a few days ago as TurboQuant.cpp. We rebranded to quant.cpp and did a lot of cleanup based on your feedback. Here's what changed:
43+
44+
**What we fixed:**
45+
- Ran WikiText-2 (standard benchmark) instead of our own test text
46+
- Corrected overstated PPL numbers: the -3.2% we claimed earlier was on a non-standard dataset. WikiText-2 shows +1.3% for delta+3bit. 4-bit is genuinely +0.0%.
47+
- Fixed a broken test that our rebrand script accidentally corrupted
48+
- Removed "zero dependencies" claim (we need pthreads)
49+
- Published speed tradeoffs: delta mode is 7 tok/s vs 25 tok/s for plain 4-bit
50+
51+
**WikiText-2 results (SmolLM2 1.7B):**
52+
53+
| Config | PPL | vs FP32 |
54+
|--------|-----|---------|
55+
| FP32 baseline | 14.63 | -- |
56+
| 4-bit K + Q4 V (3.8x) | 14.57 | -0.4% |
57+
| delta + 3-bit K + Q4 V (4.3x) | 14.82 | +1.3% |
58+
| llama.cpp Q4_0 KV | 16.18 | +10.6% |
59+
60+
**What quant.cpp actually is:**
61+
A 33K LOC pure C inference engine. Not a llama.cpp fork or wrapper. Loads GGUF models. The selling point isn't speed — it's that you can read the whole codebase in an afternoon and embed it in your own C project.
62+
63+
**Blog post** with the full story (including the FP32 fallback bug we found and disclosed): https://github.com/quantumaikr/quant.cpp/blob/main/docs/blog/breaking-3bit-barrier.md
64+
65+
**Pre-built binaries** now available: https://github.com/quantumaikr/quant.cpp/releases/tag/v0.2.0
66+
67+
```
68+
./quant model.gguf -p "hello" -k uniform_4b -v q4
69+
```
70+
71+
Repo: https://github.com/quantumaikr/quant.cpp
72+
73+
We have 5 good-first-issues open if anyone wants to contribute. Feedback welcome — the last round of comments made the project significantly more honest.

0 commit comments

Comments
 (0)