Add HN + Reddit post drafts for v0.2.0 launch

unamedkr · claude · unamedkr · commit 836e117eb502 · 2026-04-04T01:57:10.000+09:00
Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/pr/2026-04-04-show-hn.md b/docs/pr/2026-04-04-show-hn.md
@@ -0,0 +1,73 @@
+# Show HN Post (2026-04-04)
+
+## Title
+
+Show HN: quant.cpp – 33K LOC C inference engine with lossless KV cache compression
+
+## URL
+
+https://github.com/quantumaikr/quant.cpp
+
+## Text
+
+quant.cpp is a minimal LLM inference engine in pure C. No external libraries, 33K lines of code, builds with any C11 compiler.
+
+The main feature is KV cache compression. On WikiText-2 with SmolLM2 1.7B:
+
+- 4-bit keys: PPL +0.0% (lossless), 4x less KV memory
+- 3-bit keys with delta compression: PPL +1.3%, 4.3x less KV memory
+
+For comparison, llama.cpp's Q4_0 KV gives +10.6% PPL on the same model.
+
+Delta compression stores the difference between adjacent keys instead of the keys themselves. Adjacent keys in a transformer differ by ~30% of their absolute range, so the deltas compress better at the same bit budget. Like video P-frames for KV cache. Full explanation in our blog post: https://github.com/quantumaikr/quant.cpp/blob/main/docs/blog/breaking-3bit-barrier.md
+
+On an 8GB laptop with Llama 8B Q4, this extends usable context from ~16K to ~61K tokens.
+
+What quant.cpp is NOT: it's not a llama.cpp replacement. llama.cpp is a full-featured framework. quant.cpp is a small engine you can read in an afternoon and embed in your own project. Different use case.
+
+Honest disclaimers: delta mode slows generation from 25 to 7 tok/s (sequential reconstruction). We don't support batched inference or speculative decoding. We initially claimed 1-bit worked, found a bug where FP32 was being read instead of quantized data, and retracted the claim.
+
+---
+
+# Reddit r/LocalLLaMA Post (2026-04-04)
+
+## Title
+
+quant.cpp — 33K LOC C inference engine, 4-bit KV is lossless on WikiText-2, delta compression pushes 3-bit to +1.3% PPL
+
+## Body
+
+Hey r/LocalLLaMA,
+
+Some of you saw this project a few days ago as TurboQuant.cpp. We rebranded to quant.cpp and did a lot of cleanup based on your feedback. Here's what changed:
+
+**What we fixed:**
+- Ran WikiText-2 (standard benchmark) instead of our own test text
+- Corrected overstated PPL numbers: the -3.2% we claimed earlier was on a non-standard dataset. WikiText-2 shows +1.3% for delta+3bit. 4-bit is genuinely +0.0%.
+- Fixed a broken test that our rebrand script accidentally corrupted
+- Removed "zero dependencies" claim (we need pthreads)
+- Published speed tradeoffs: delta mode is 7 tok/s vs 25 tok/s for plain 4-bit
+
+**WikiText-2 results (SmolLM2 1.7B):**
+
+| Config | PPL | vs FP32 |
+|--------|-----|---------|
+| FP32 baseline | 14.63 | -- |
+| 4-bit K + Q4 V (3.8x) | 14.57 | -0.4% |
+| delta + 3-bit K + Q4 V (4.3x) | 14.82 | +1.3% |
+| llama.cpp Q4_0 KV | 16.18 | +10.6% |
+
+**What quant.cpp actually is:**
+A 33K LOC pure C inference engine. Not a llama.cpp fork or wrapper. Loads GGUF models. The selling point isn't speed — it's that you can read the whole codebase in an afternoon and embed it in your own C project.
+
+**Blog post** with the full story (including the FP32 fallback bug we found and disclosed): https://github.com/quantumaikr/quant.cpp/blob/main/docs/blog/breaking-3bit-barrier.md
+
+**Pre-built binaries** now available: https://github.com/quantumaikr/quant.cpp/releases/tag/v0.2.0
+
+```
+./quant model.gguf -p "hello" -k uniform_4b -v q4
+```
+
+Repo: https://github.com/quantumaikr/quant.cpp
+
+We have 5 good-first-issues open if anyone wants to contribute. Feedback welcome — the last round of comments made the project significantly more honest.