pillar1(R6): v0.19.0 release — BPE root-cause fix

unamedkr · claude · unamedkr · commit 1e0c3d75b050 · 2026-04-20T12:40:23.000+09:00
Bumps Python bindings to 0.19.0. Adds v0.19.0 entry to RELEASE_NOTES
(headlined as "BPE Stale-Entry ROOT-CAUSE Fix") consolidating R3 +
R4 validation + R5 regression guard.

README.md + README.ko.md v3.12 blurb: external-facing announcement
with the before/after tokenization evidence and impact summary.

Regression: 15/15 test_models.sh + 4/4 test_tokenizer.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
 
 > **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
 
+> **v3.12 ★ BPE 근본 원인 수정 — Qwen3 family 완전 coherent (2026-04-20)**: `src/engine/tq_tokenizer.c:1442`에 한 줄 추가 (`if (tokens[top.pos] < 0) continue;`) 로 30+ 라운드의 "Qwen3 drift" 모든 증상의 진짜 원인 제거. 버그: BPE heap merge에서 다른 merge의 RIGHT neighbor로 죽은 position이 `gen[]` bump 없이 `tokens[P]=-1`만 되어, 오래된 heap entry가 죽은 slot을 부활시키며 linked list 오염 → 토큰 문자 중복/손실. HF reference 비교로 확인: 우리 엔진이 `"Hello"`를 `[32713="Hel", 654="ll"]` = **"Helll"** (문자 5개: H,e,l,l,**l** — 'o' 대신 'l')로 인코딩; HF는 `[9707="Hello"]` 정상. 다운스트림 결과: Qwen3.6-35B 40+ 단어 프롬프트가 **완벽한 Python 코드** + **완전한 서사문** 생성 (이전 garbage); Phi-3.5 "What is 2+2?"가 "The sum of 2 and 2 is equal to four." 정답 (이전 "tti" 환각). 방법론: Python HF reference diff가 3 라운드 만에 발견. Regression 15/15 + 새 토크나이저 테스트 4/4. 전체 before/after 증명: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` 단일 헤더는 naive O(n²) BPE 사용으로 영향 없음.
+
 > **v3.11 TTFT/decode 분리 + 데일리 드라이버 픽 (2026-04-20)**: CLI 출력이 `TTFT | decode`로 분리되어, 짧은 쿼리에서 cold-start에 지배당하는 "overall tok/s" 대신 prefill latency와 sustained decode를 개별적으로 보여줍니다. 16 GB M1 Pro CPU-only warm 실측: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (짧은 대화), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot 코드/수학), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE 품질 장문). Decode는 모델 성질, TTFT는 warmup 성질 — 같은 명령 두 번 실행하면 두 번째부터 warm 수치. 3-모델 매트릭스 + 용도별 추천: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).
 
 > **v3.10 Qwen3.x 모든 형식 승리 (2026-04-20)**: 두 구조적 수정이 Qwen3.5/3.6의 장문 드리프트를 완전히 제거. **(1) NEOX-style partial RoPE** — Qwen3.5/3.6은 `LLAMA_ROPE_TYPE_IMROPE` (NEOX half-split `(q[i], q[i+rope_pairs])`). **(2) Arch-conditional QK-norm** — Gemma 4 필수, Qwen family 비활성화. Qwen3.6-UD-IQ4_XS (--chat, T=0): "Once upon a time" n=60 → **60-token 완전 스토리**, `def fibonacci(n):` → **정확한 Python**, haiku/list/팩트 모두 작동. 15/15 regression PASS. v0.17.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
diff --git a/README.md b/README.md
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
 
 > **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
 
+> **v3.12 ★ BPE root-cause FIXED — Qwen3 family now fully coherent (2026-04-20):** One line added to `src/engine/tq_tokenizer.c:1442` (`if (tokens[top.pos] < 0) continue;`) eliminates the BPE heap merge bug that caused every "Qwen3 drift" symptom we chased across 30+ rounds. The bug: positions that died as right-neighbor of a merge weren't having their `gen[]` bumped, so stale heap entries resurrected dead linked-list slots and produced corrupted tokens. Measured on Qwen3-0.6B with HF reference: our engine encoded `"Hello"` as `[32713="Hel", 654="ll"]` = literally **"Helll"** (extra 'l', missing 'o'); HF encoded as `[9707="Hello"]`. Fix makes our tokens match. Downstream: Qwen3.6-35B 40+ word prompts now produce **coherent Python code** and **full narrative text** (previously garbage); Phi-3.5 "What is 2+2?" now gives "The sum of 2 and 2 is equal to four." (previously hallucinated "tti"). Methodology: Python HF reference diff caught it in 3 rounds ([`tools/pillar1/`](tools/pillar1/)). Regression 15/15 + new tokenizer test 4/4. Full before/after proof: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` single-header unaffected (uses naive O(n²) BPE, correct by construction).
+
 > **v3.11 TTFT/decode split + daily-driver picks (2026-04-20):** The CLI now prints `TTFT | decode` separately so individual devs see prefill latency vs sustained decode rate, not a blended "overall tok/s" that's dominated by cold-start on short queries. Measured warm on 16 GB M1 Pro CPU-only: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (snappy chat), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot code/math), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE quality long-form). Decode is a model property, TTFT is a warmup property — running twice makes the second call see the warm numbers. Full 3-model matrix + use-case picks: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).
 
 > **v3.10 Qwen3.x correctness — ALL FORMATS WIN (2026-04-20):** Two structural fixes close long-standing drift on Qwen3.5/3.6. **(1) NEOX-style partial RoPE** — `LLAMA_ROPE_TYPE_IMROPE` (`refs/llama.cpp/src/llama-model.cpp:9298`) requires NEOX half-split `(q[i], q[i+rope_pairs])`, not LLaMA-style pairs. **(2) Arch-conditional QK-norm** — Gemma 4 requires it, Qwen family degrades with it. Measured on Qwen3.6-35B-A3B-UD-IQ4_XS (--chat, T=0): "Once upon a time" n=60 → **60-token "Jack...packed his bag with a map, a compass, and some food and water. He set off early in the"**; `def fibonacci(n):` → **`if n <= 0: return "Invalid input"`**; haiku "Silence speaks loud, Silence speaks in the quietest way."; list "1. Apple 2. Banana 3. Orange". 15/15 regression PASS (added long-form + code guards). v0.17.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
diff --git a/bindings/python/pyproject.toml b/bindings/python/pyproject.toml
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "quantcpp"
-version = "0.18.0"
+version = "0.19.0"
 description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
 readme = "README.md"
 license = { text = "Apache-2.0" }
diff --git a/bindings/python/quantcpp/__init__.py b/bindings/python/quantcpp/__init__.py
@@ -21,7 +21,7 @@
     from importlib.metadata import version as _pkg_version
     __version__ = _pkg_version("quantcpp")
 except Exception:
-    __version__ = "0.18.0"  # fallback for editable / source-tree imports
+    __version__ = "0.19.0"  # fallback for editable / source-tree imports
 
 import os
 import sys
diff --git a/docs/RELEASE_NOTES.md b/docs/RELEASE_NOTES.md
@@ -6,6 +6,95 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ---
 
+## [v0.19.0] — 2026-04-20 ★ (BPE Stale-Entry ROOT-CAUSE Fix)
+
+### Headline
+
+**One-line fix to `src/engine/tq_tokenizer.c:1442` eliminates the
+structural tokenization bug that caused every "Qwen3 drift" symptom
+across 30+ rounds of kernel/MoE/DeltaNet investigation.** Pillar 1
+of the Mission E roadmap, closed in 3 rounds via HF reference diff.
+
+### The fix
+
+```c
+  if (top.gen != gen[top.pos]) continue;
++ if (tokens[top.pos] < 0) continue;   // ★ missing dead-slot guard
+  int ri = next[top.pos];
+  if (ri >= n_tokens || tokens[ri] < 0) continue;
+```
+
+**Root cause**: In the heap-based BPE merge loop, a position `P` that
+dies as the RIGHT neighbor of some other merge has `tokens[P]` set to
+-1 but `gen[P]` is **not** bumped. Stale heap entries at position `P`
+pass the gen-based staleness check, then the code overwrites dead
+`tokens[P]` with a new merge result — resurrecting the slot, scrambling
+the linked list, and producing malformed token sequences.
+
+### Symptom (same prompt, before/after)
+
+| | Tokens for "Hello" | Decoded |
+|---|---|---|
+| HF reference | `[9707]` | "Hello" |
+| Our engine BEFORE | `[32713, 654]` | **"Helll"** (extra 'l', lost 'o') |
+| Our engine AFTER | `[9707]` | "Hello" ✓ |
+
+### What this fixes (consolidated)
+
+| Symptom (previous attributed cause) | Actual cause |
+|---|---|
+| Qwen3.5/3.6 "quicck bbrrown" char doubling | tokenizer |
+| Qwen3.6-35B ≥40-word prompt → UTF-8 garbage | tokenizer |
+| Phi-3.5 "What is 2+2?" → hallucinating "tti" | tokenizer |
+| R32 Mission C "drift is Qwen-common architecture" | WRONG — was tokenizer |
+| R46-50 Mission D "structural bug needs HF Python diff" | correct diagnosis; R3 finishes it |
+
+### Validation
+
+- **Regression**: 15/15 `test_models.sh` + new `test_tokenizer.sh` 4/4
+- **Real output**: Qwen3.6-35B on 40+ word prompts produces coherent
+  Python code and full narrative text (previously garbage)
+- **Phi-3.5**: "What is 2+2?" → "The sum of 2 and 2 is equal to four."
+  (previously "I'm sorry but 'tti' doesn't appear to...")
+
+### Methodology (the actual insight)
+
+Pillar 1 R1-R3 built Python + HF Qwen3-0.6B FP32 reference env
+(`tools/pillar1/`) specifically to enable per-layer diff debugging.
+Before the first layer diff was ever needed, just comparing
+**tokenizer output** revealed the mismatch. The entire transformer
+investigation from R26-R50 had been working with corrupted input.
+
+**Lesson**: When debugging LLM coherence, compare tokens to HF
+reference FIRST. Don't "rule out" the tokenizer without actually
+running `AutoTokenizer.encode(prompt)` side-by-side.
+
+### Files changed
+
+- `src/engine/tq_tokenizer.c` — 1-line fix + comment
+- `src/engine/tq_transformer.c` — env-gated per-layer dump
+  (`TQ_DUMP_HIDDEN=dir`) retained as debugging infrastructure
+- `scripts/test_models.sh` — Phi-3.5 expected "answer" → "sum"
+  (Phi-3 now gives actual factual math answer)
+- `scripts/test_tokenizer.sh` — **NEW** 4-test regression guard
+- `tools/pillar1/` — HF reference env + hf_dump.py dump tool
+- `bench/results/2026-04-20_bpe_fix_proof.md` — full before/after proof
+
+### Non-impact
+
+- `quant.h` (single-header): uses naive O(n²) BPE merge, correct by
+  construction. Embed/WASM users have NEVER hit this bug. Only the
+  split-source engine needed the fix.
+- No API change.
+- No performance change (the stale check is O(1)).
+
+### Compatibility
+
+No migration needed. Users of prior versions will simply see coherent
+output on previously-broken prompts. All existing models work.
+
+---
+
 ## [v0.18.0] — 2026-04-20 (Daily-Driver UX — TTFT/decode split)
 
 ### Headline