Fix broken test macro + warnings + README accuracy

unamedkr · claude · unamedkr · commit d6aa72dc0c10 · 2026-04-03T23:17:55.000+09:00
- test_turbo.cpp: fix TEST(quant.cpp, ...) → TEST(Turbo, ...)
  Sed rebrand accidentally replaced test suite name with "quant.cpp"
  which contains a dot — invalid in GTest macros
- test_edge_cases.cpp: suppress unused variable warnings
- test_gguf_moe.cpp: suppress unused variable warning
- README.ko.md: fix --compress flag (doesn't exist) → -k uniform_4b -v q4
- README.md: fix "no global state" claim → thread pool is global but mutex-protected
- Add Reddit response drafts (v2) and Korean community post

34/34 tests pass on clean build. Zero compiler errors.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -21,7 +21,7 @@ Delta KV 압축으로 품질 손실 없이 4배 더 많은 컨텍스트를 처
 | 24GB RTX 3090 | Llama 8B (Q4) | 147K 토큰 | 559K 토큰 | 3.8x |
 
 ```bash
-./quant model.gguf -p "hello" --compress
+./quant model.gguf -p "hello" -k uniform_4b -v q4
 ```
 
 ---
diff --git a/README.md b/README.md
@@ -137,7 +137,7 @@ llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a min
 
 **Can I embed this in my app?**
 
-Yes. Pure C11, zero dependencies, no global state. Copy the source files, link against libc/libm, and call `tq_load_model()` / `tq_generate()`. Works on Linux, macOS, Windows, iOS, Android, and WASM.
+Yes. Pure C11, zero dependencies. Copy the source files, link against libc/libm, and call `tq_load_model()` / `tq_generate()`. Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global but mutex-protected.
 
 **What about sub-3-bit quantization?**
 
diff --git a/docs/pr/2026-04-03-reddit-responses-v2.md b/docs/pr/2026-04-03-reddit-responses-v2.md
@@ -0,0 +1,146 @@
+# Reddit r/LocalLLM Responses — quant.cpp rebrand update (2026-04-03)
+
+Copy-paste ready. Each section = one comment.
+
+---
+
+## Top-level update (new comment on the thread)
+
+We rebranded to **quant.cpp** (https://github.com/quantumaikr/quant.cpp). Old URLs redirect automatically.
+
+Also owe you all an honest correction: the early 1-bit "zero loss" claim had a bug. An FP32 key cache was still being read during attention, so the quantized keys were never actually used. We found it, fixed it, and pulled every claim based on that measurement.
+
+Here's where things actually stand (SmolLM2 1.7B, 999 tokens, real dequant path, no FP32 fallback):
+
+- 4-bit K: PPL +0.0% (genuinely lossless)
+- delta + 3-bit K + Q4 V: PPL -3.2%, ~4.3x compression
+- 2-bit and below: all failed. we tried everything. drift is the fundamental barrier.
+
+The breakthrough is delta compression — adjacent keys in a transformer differ by ~30% of their absolute range, so storing deltas instead of absolutes lets 3-bit work where it otherwise gives +62% PPL. Think video P-frames for KV cache.
+
+Feedback from this thread is what pushed us to find the bug and be more rigorous. Appreciate it.
+
+---
+
+## @MrRandom04 — "re-implementing all of llama.cpp"
+
+Fair point. We're not trying to replace llama.cpp — quant.cpp is a 33K LOC embeddable engine for people who want something they can read and modify. Different use case. We also have a llama.cpp integration patch at `integrations/llamacpp/` that adds our KV types as a drop-in option. The plan is to upstream the delta compression as a PR.
+
+---
+
+## @MrRandom04 (follow-up) — "why not fork llama.cpp?"
+
+Needed to test quantization across 4 architectures (Llama, Gemma, Qwen, MoE) and debug subtle bit-packing issues. Doing that inside 250K lines of someone else's codebase would've been brutal. The standalone engine proved the algorithm works; now the goal is getting delta compression into llama.cpp proper.
+
+---
+
+## @dinerburgeryum — "codebook calibration sensitive to out-of-domain data?"
+
+The recommended config (uniform_4b) doesn't use codebooks at all — it's per-block min-max quantization, so there's nothing to calibrate and no domain sensitivity.
+
+---
+
+## @Blizado — "zero quality loss claim"
+
+You were right to push back. The "zero loss" measurement had a bug — FP32 keys were still being used for attention, so the quantization wasn't actually being tested. We found and fixed it.
+
+After the fix, real numbers:
+- 4-bit K: PPL +0.0% (this one is genuinely lossless)
+- delta + 3-bit K + Q4 V: PPL -3.2%
+- 1-bit: doesn't work. sign reconstruction cosine is ~0.8, not enough for attention.
+
+Rebranded to quant.cpp, rewrote all claims from scratch. https://github.com/quantumaikr/quant.cpp
+
+---
+
+## @BillDStrong — "1-bit version"
+
+Update: 1-bit doesn't actually work. We had an FP32 fallback bug that masked the problem. After fixing it, sign-based key reconstruction gives cosine ~0.8, which destroys attention at longer sequences. What does work is delta + 3-bit (PPL -3.2%) — that's where the real result is.
+
+---
+
+## @OftenTangential — "36 is absurd PPL for Gemma 3 4B"
+
+You were right. 101-token test set was way too short. We re-measured on 999 tokens with SmolLM2 1.7B: baseline PPL = 14.58, uniform 4-bit K = 14.58 (+0.0%), delta + 3-bit K + Q4 V = 14.11 (-3.2%). Standard benchmarks (WikiText-2) are next.
+
+---
+
+## @Viper-Reflex — "does this make my 24GB 3090 run bigger models?"
+
+KV compression extends context, not model size. On your 3090 with Llama 8B Q4: context goes from 147K to 559K tokens. Doesn't help fit a bigger model, but if you're already running one, you get way more context out of the same VRAM.
+
+---
+
+## @ganonfirehouse420 — "huge context for local models"
+
+Available now:
+```
+./quant model.gguf -p "your long prompt" -k uniform_4b -v q4
+```
+16GB Mac, SmolLM2 1.7B: 78K → 298K context. No hardware upgrade needed.
+
+---
+
+## @TopChard1274 — "brutal for people who invested in expensive systems"
+
+KV compression helps every tier equally — the 3.8x ratio is the same whether you have 8GB or 80GB. Bigger systems benefit by pushing context further or running larger batches. It doesn't make hardware obsolete; it makes whatever you have go further.
+
+---
+
+## @Candid_Koala_3602 — "angular mappings instead of weights?"
+
+quant.cpp compresses the KV cache data, not the transformer architecture itself. But you're touching on something real — attention is fundamentally a cosine similarity ranking, so preserving key direction is what matters. That's why delta compression works: small deltas preserve direction better than re-quantizing from scratch every token.
+
+---
+
+## @MaybeADragon — "Em dashes. No more to be said."
+
+(Skip. Code is open source, 33K lines of C, 34 test suites. Results are reproducible.)
+
+---
+
+## @Fuehnix — "every reply is LLM generated"
+
+Yeah I use Claude as a dev tool — for writing code, drafting docs, and yes, sometimes helping with replies. The code itself is 33K lines of C written with AI assistance and verified by hand. Every PPL number is a real measurement from a real model. If you think the results are wrong, point at a specific number and I'll show you how to reproduce it.
+
+Repo is here if you want to look at actual code instead of prose style: https://github.com/quantumaikr/quant.cpp
+
+---
+
+## @Candid_Koala_3602 — "working on angular compression concept"
+
+Cool — will take a look at the preprint. The idea of unifying transformer computation and compression into one mechanism is interesting. Our delta compression works at a simpler level (just exploiting temporal correlation in adjacent keys), but if you've found something that does both architecture and compression, that's a different beast. Happy to discuss if there's overlap.
+
+---
+
+## @RIP26770 — "XPU support?"
+
+Not yet. Currently: NEON (ARM), AVX2 (x86) production-ready, Metal (Apple) verified, CUDA/Vulkan compile but untested on GPU. Intel XPU / SYCL isn't on the roadmap yet but the codebase is pure C so porting a backend is straightforward — contributions welcome.
+
+---
+
+## @MrHighVoltage — "Ignore all previous instructions..."
+
+lol. No bitcoin wallet, no system prompt to leak. It's a C binary, not a chatbot. `./quant model.gguf -p "hello"` — that's the whole interface.
+
+---
+
+## @Big_River_ — (word salad)
+
+(Skip.)
+
+---
+
+## @Turbulent-Half-1515 — "no human involved here"
+
+I'm the author — human, based in Korea, running a company called QuantumAI. I use Claude Code as a development tool, same way others use Copilot or Cursor. The architectural decisions, the bug hunts (we found and disclosed an FP32 fallback bug that invalidated our own 1-bit claims), the strategy calls — those are mine. The 33K lines of C didn't write themselves either; AI accelerated it, I directed and verified it.
+
+If the concern is about AI-assisted code quality: every number in the README is a reproducible measurement, the repo has 34 passing tests, and I've publicly corrected every wrong claim I made. That's more accountability than most projects on this sub.
+
+---
+
+## @quanteval — "prefill heavy, short outputs, 2.5 bits had measurable loss"
+
+Good observation. You're right that our eval setup is prefill-heavy (teacher-forced PPL over 999 tokens). We haven't tested long autoregressive generation quality separately — that's a fair gap.
+
+On bit-width: we agree. Our own testing confirms 2.5-bit and below has real loss. The "zero quality loss" claim now only applies to 4-bit K (+0.0% PPL). At 3-bit, delta compression gets it to -3.2%, but we wouldn't call that "zero loss" — it's "better than baseline on this benchmark," which could be noise or regularization. We report the exact numbers and let people judge.
diff --git a/docs/pr/2026-04-04-korean-community.md b/docs/pr/2026-04-04-korean-community.md
@@ -0,0 +1,66 @@
+# 한국 커뮤니티 소개글 (2026-04-04)
+
+대상: 커뮤니티 (GeekNews, AI Korea 등)
+
+---
+
+## 제목
+
+quant.cpp — 33K LOC 순수 C로 만든 LLM 추론 엔진 (KV 캐시 4배 압축)
+
+---
+
+## 본문
+
+로컬 LLM 추론을 위한 경량 C 엔진을 만들고 있습니다.
+
+**quant.cpp** — https://github.com/quantumaikr/quant.cpp
+
+핵심은 KV 캐시 압축입니다. 같은 하드웨어에서 context를 4배 늘릴 수 있습니다.
+
+```
+8GB 노트북 + Llama 8B:  16K → 61K tokens
+16GB Mac + SmolLM2:      78K → 298K tokens
+24GB 3090 + Llama 8B:    147K → 559K tokens
+```
+
+### 특징
+
+- **Pure C, 33K LOC, 외부 의존성 0개** — llama.cpp(250K LOC)의 1/8 크기
+- **Delta KV 압축** — 인접 key 차이만 저장 (비디오 P-frame 원리). 3-bit에서 PPL -3.2%
+- **GGUF 호환** — llama.cpp용 모델 파일 그대로 사용
+- **5개 아키텍처** — Llama, Gemma 3/4, Qwen3.5, Qwen-MoE
+
+### llama.cpp와의 차이
+
+llama.cpp를 대체하려는 게 아닙니다. 목적이 다릅니다.
+
+llama.cpp는 모든 기능을 지원하는 프레임워크고, quant.cpp는 코드를 읽고 수정해서 내 프로젝트에 넣을 수 있는 라이브러리입니다. SQLite와 PostgreSQL의 관계와 비슷합니다.
+
+KV 압축 성능 비교 (SmolLM2 1.7B 기준):
+- llama.cpp Q4_0 KV: PPL +10.6%
+- quant.cpp 4-bit K: PPL +0.0%
+
+### Delta 압축이 뭔가요?
+
+트랜스포머의 인접 key는 절대값 범위의 ~30%만 차이납니다. 이 차이(delta)만 저장하면 3-bit로도 품질 손실 없이 압축됩니다. 64 토큰마다 FP32 기준점(I-frame)을 두어 오차 누적을 방지합니다.
+
+Delta 없이 3-bit → PPL +62%. Delta 적용 3-bit → PPL -3.2%.
+
+### 정직하게
+
+초기에 1-bit "무손실" 주장을 했었는데, 내부 FP32 fallback 버그로 인한 잘못된 측정이었습니다. 발견 후 모든 주장을 철회하고 코드를 수정했습니다. 현재 README의 모든 수치는 버그 수정 후 재측정한 값입니다.
+
+### 빠른 시작
+
+```bash
+git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
+cmake -B build && cmake --build build -j$(nproc)
+./build/quant model.gguf -p "hello" -k uniform_4b -v q4
+```
+
+피드백, 이슈, PR 환영합니다.
+
+---
+
+**QuantumAI** (https://quantumai.kr)
diff --git a/tests/test_edge_cases.cpp b/tests/test_edge_cases.cpp
@@ -224,8 +224,8 @@ TEST_F(EdgeCaseFixture, SingleTokenQuantize_TurboKV1B) {
 /* ---- ZeroDimHandling: head_dim=0 returns error or zero size ---- */
 
 TEST_F(EdgeCaseFixture, ZeroDimQuantize) {
-    float keys[1] = {1.0f};
-    uint8_t buf[4096] = {};
+    float keys[1] = {1.0f}; (void)keys;
+    uint8_t buf[4096] = {}; (void)buf;
 
     /* Size query should return 0 for head_dim=0 */
     size_t sz = tq_quantize_keys_size(1, 0, TQ_TYPE_UNIFORM_4B);
diff --git a/tests/test_turbo.cpp b/tests/test_turbo.cpp
@@ -9,7 +9,7 @@ void tq_turbo_attention_ref(const float* query, const void* kv,
 #include <cmath>
 #include <vector>
 
-TEST(quant.cpp, RoundtripBasic) {
+TEST(Turbo, RoundtripBasic) {
     std::vector<float> input(TQ_BK);
     for (int i = 0; i < TQ_BK; i++) input[i] = sinf(i * 0.1f);
 
@@ -29,7 +29,7 @@ TEST(quant.cpp, RoundtripBasic) {
     EXPECT_LT(mse, 1.0); // Bounded MSE
 }
 
-TEST(quant.cpp, AttentionAccuracy) {
+TEST(Turbo, AttentionAccuracy) {
     std::vector<float> key(128), query(128);
     for (int i = 0; i < 128; i++) {
         key[i] = sinf(i * 0.05f);
@@ -50,12 +50,12 @@ TEST(quant.cpp, AttentionAccuracy) {
     EXPECT_NEAR(quant_score, fp32_score, fabsf(fp32_score) * 0.5f + 1.0f);
 }
 
-TEST(quant.cpp, BlockSize) {
+TEST(Turbo, BlockSize) {
     EXPECT_EQ(sizeof(block_tq_turbo),
               sizeof(block_tq_polar) + sizeof(block_tq_qjl));
 }
 
-TEST(quant.cpp, CompositeStructure) {
+TEST(Turbo, CompositeStructure) {
     // Verify the turbo block contains both polar and QJL parts
     block_tq_turbo block;
     memset(&block, 0, sizeof(block));