Skip to content

Commit 1e0c3d7

Browse files
unamedkrclaude
andcommitted
pillar1(R6): v0.19.0 release — BPE root-cause fix
Bumps Python bindings to 0.19.0. Adds v0.19.0 entry to RELEASE_NOTES (headlined as "BPE Stale-Entry ROOT-CAUSE Fix") consolidating R3 + R4 validation + R5 regression guard. README.md + README.ko.md v3.12 blurb: external-facing announcement with the before/after tokenization evidence and impact summary. Regression: 15/15 test_models.sh + 4/4 test_tokenizer.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b9980d5 commit 1e0c3d7

5 files changed

Lines changed: 95 additions & 2 deletions

File tree

README.ko.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
7676

7777
> **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
7878
79+
> **v3.12 ★ BPE 근본 원인 수정 — Qwen3 family 완전 coherent (2026-04-20)**: `src/engine/tq_tokenizer.c:1442`에 한 줄 추가 (`if (tokens[top.pos] < 0) continue;`) 로 30+ 라운드의 "Qwen3 drift" 모든 증상의 진짜 원인 제거. 버그: BPE heap merge에서 다른 merge의 RIGHT neighbor로 죽은 position이 `gen[]` bump 없이 `tokens[P]=-1`만 되어, 오래된 heap entry가 죽은 slot을 부활시키며 linked list 오염 → 토큰 문자 중복/손실. HF reference 비교로 확인: 우리 엔진이 `"Hello"`를 `[32713="Hel", 654="ll"]` = **"Helll"** (문자 5개: H,e,l,l,**l** — 'o' 대신 'l')로 인코딩; HF는 `[9707="Hello"]` 정상. 다운스트림 결과: Qwen3.6-35B 40+ 단어 프롬프트가 **완벽한 Python 코드** + **완전한 서사문** 생성 (이전 garbage); Phi-3.5 "What is 2+2?"가 "The sum of 2 and 2 is equal to four." 정답 (이전 "tti" 환각). 방법론: Python HF reference diff가 3 라운드 만에 발견. Regression 15/15 + 새 토크나이저 테스트 4/4. 전체 before/after 증명: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` 단일 헤더는 naive O(n²) BPE 사용으로 영향 없음.
80+
7981
> **v3.11 TTFT/decode 분리 + 데일리 드라이버 픽 (2026-04-20)**: CLI 출력이 `TTFT | decode`로 분리되어, 짧은 쿼리에서 cold-start에 지배당하는 "overall tok/s" 대신 prefill latency와 sustained decode를 개별적으로 보여줍니다. 16 GB M1 Pro CPU-only warm 실측: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (짧은 대화), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot 코드/수학), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE 품질 장문). Decode는 모델 성질, TTFT는 warmup 성질 — 같은 명령 두 번 실행하면 두 번째부터 warm 수치. 3-모델 매트릭스 + 용도별 추천: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).
8082
8183
> **v3.10 Qwen3.x 모든 형식 승리 (2026-04-20)**: 두 구조적 수정이 Qwen3.5/3.6의 장문 드리프트를 완전히 제거. **(1) NEOX-style partial RoPE** — Qwen3.5/3.6은 `LLAMA_ROPE_TYPE_IMROPE` (NEOX half-split `(q[i], q[i+rope_pairs])`). **(2) Arch-conditional QK-norm** — Gemma 4 필수, Qwen family 비활성화. Qwen3.6-UD-IQ4_XS (--chat, T=0): "Once upon a time" n=60 → **60-token 완전 스토리**, `def fibonacci(n):`**정확한 Python**, haiku/list/팩트 모두 작동. 15/15 regression PASS. v0.17.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
167167
168168
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
169169
170+
> **v3.12 ★ BPE root-cause FIXED — Qwen3 family now fully coherent (2026-04-20):** One line added to `src/engine/tq_tokenizer.c:1442` (`if (tokens[top.pos] < 0) continue;`) eliminates the BPE heap merge bug that caused every "Qwen3 drift" symptom we chased across 30+ rounds. The bug: positions that died as right-neighbor of a merge weren't having their `gen[]` bumped, so stale heap entries resurrected dead linked-list slots and produced corrupted tokens. Measured on Qwen3-0.6B with HF reference: our engine encoded `"Hello"` as `[32713="Hel", 654="ll"]` = literally **"Helll"** (extra 'l', missing 'o'); HF encoded as `[9707="Hello"]`. Fix makes our tokens match. Downstream: Qwen3.6-35B 40+ word prompts now produce **coherent Python code** and **full narrative text** (previously garbage); Phi-3.5 "What is 2+2?" now gives "The sum of 2 and 2 is equal to four." (previously hallucinated "tti"). Methodology: Python HF reference diff caught it in 3 rounds ([`tools/pillar1/`](tools/pillar1/)). Regression 15/15 + new tokenizer test 4/4. Full before/after proof: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` single-header unaffected (uses naive O(n²) BPE, correct by construction).
171+
170172
> **v3.11 TTFT/decode split + daily-driver picks (2026-04-20):** The CLI now prints `TTFT | decode` separately so individual devs see prefill latency vs sustained decode rate, not a blended "overall tok/s" that's dominated by cold-start on short queries. Measured warm on 16 GB M1 Pro CPU-only: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (snappy chat), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot code/math), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE quality long-form). Decode is a model property, TTFT is a warmup property — running twice makes the second call see the warm numbers. Full 3-model matrix + use-case picks: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).
171173
172174
> **v3.10 Qwen3.x correctness — ALL FORMATS WIN (2026-04-20):** Two structural fixes close long-standing drift on Qwen3.5/3.6. **(1) NEOX-style partial RoPE**`LLAMA_ROPE_TYPE_IMROPE` (`refs/llama.cpp/src/llama-model.cpp:9298`) requires NEOX half-split `(q[i], q[i+rope_pairs])`, not LLaMA-style pairs. **(2) Arch-conditional QK-norm** — Gemma 4 requires it, Qwen family degrades with it. Measured on Qwen3.6-35B-A3B-UD-IQ4_XS (--chat, T=0): "Once upon a time" n=60 → **60-token "Jack...packed his bag with a map, a compass, and some food and water. He set off early in the"**; `def fibonacci(n):`**`if n <= 0: return "Invalid input"`**; haiku "Silence speaks loud, Silence speaks in the quietest way."; list "1. Apple 2. Banana 3. Orange". 15/15 regression PASS (added long-form + code guards). v0.17.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).

bindings/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
77

88
[project]
99
name = "quantcpp"
10-
version = "0.18.0"
10+
version = "0.19.0"
1111
description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
1212
readme = "README.md"
1313
license = { text = "Apache-2.0" }

bindings/python/quantcpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
from importlib.metadata import version as _pkg_version
2222
__version__ = _pkg_version("quantcpp")
2323
except Exception:
24-
__version__ = "0.18.0" # fallback for editable / source-tree imports
24+
__version__ = "0.19.0" # fallback for editable / source-tree imports
2525

2626
import os
2727
import sys

docs/RELEASE_NOTES.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,95 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

77
---
88

9+
## [v0.19.0] — 2026-04-20 ★ (BPE Stale-Entry ROOT-CAUSE Fix)
10+
11+
### Headline
12+
13+
**One-line fix to `src/engine/tq_tokenizer.c:1442` eliminates the
14+
structural tokenization bug that caused every "Qwen3 drift" symptom
15+
across 30+ rounds of kernel/MoE/DeltaNet investigation.** Pillar 1
16+
of the Mission E roadmap, closed in 3 rounds via HF reference diff.
17+
18+
### The fix
19+
20+
```c
21+
if (top.gen != gen[top.pos]) continue;
22+
+ if (tokens[top.pos] < 0) continue; // ★ missing dead-slot guard
23+
int ri = next[top.pos];
24+
if (ri >= n_tokens || tokens[ri] < 0) continue;
25+
```
26+
27+
**Root cause**: In the heap-based BPE merge loop, a position `P` that
28+
dies as the RIGHT neighbor of some other merge has `tokens[P]` set to
29+
-1 but `gen[P]` is **not** bumped. Stale heap entries at position `P`
30+
pass the gen-based staleness check, then the code overwrites dead
31+
`tokens[P]` with a new merge result — resurrecting the slot, scrambling
32+
the linked list, and producing malformed token sequences.
33+
34+
### Symptom (same prompt, before/after)
35+
36+
| | Tokens for "Hello" | Decoded |
37+
|---|---|---|
38+
| HF reference | `[9707]` | "Hello" |
39+
| Our engine BEFORE | `[32713, 654]` | **"Helll"** (extra 'l', lost 'o') |
40+
| Our engine AFTER | `[9707]` | "Hello" ✓ |
41+
42+
### What this fixes (consolidated)
43+
44+
| Symptom (previous attributed cause) | Actual cause |
45+
|---|---|
46+
| Qwen3.5/3.6 "quicck bbrrown" char doubling | tokenizer |
47+
| Qwen3.6-35B ≥40-word prompt → UTF-8 garbage | tokenizer |
48+
| Phi-3.5 "What is 2+2?" → hallucinating "tti" | tokenizer |
49+
| R32 Mission C "drift is Qwen-common architecture" | WRONG — was tokenizer |
50+
| R46-50 Mission D "structural bug needs HF Python diff" | correct diagnosis; R3 finishes it |
51+
52+
### Validation
53+
54+
- **Regression**: 15/15 `test_models.sh` + new `test_tokenizer.sh` 4/4
55+
- **Real output**: Qwen3.6-35B on 40+ word prompts produces coherent
56+
Python code and full narrative text (previously garbage)
57+
- **Phi-3.5**: "What is 2+2?" → "The sum of 2 and 2 is equal to four."
58+
(previously "I'm sorry but 'tti' doesn't appear to...")
59+
60+
### Methodology (the actual insight)
61+
62+
Pillar 1 R1-R3 built Python + HF Qwen3-0.6B FP32 reference env
63+
(`tools/pillar1/`) specifically to enable per-layer diff debugging.
64+
Before the first layer diff was ever needed, just comparing
65+
**tokenizer output** revealed the mismatch. The entire transformer
66+
investigation from R26-R50 had been working with corrupted input.
67+
68+
**Lesson**: When debugging LLM coherence, compare tokens to HF
69+
reference FIRST. Don't "rule out" the tokenizer without actually
70+
running `AutoTokenizer.encode(prompt)` side-by-side.
71+
72+
### Files changed
73+
74+
- `src/engine/tq_tokenizer.c` — 1-line fix + comment
75+
- `src/engine/tq_transformer.c` — env-gated per-layer dump
76+
(`TQ_DUMP_HIDDEN=dir`) retained as debugging infrastructure
77+
- `scripts/test_models.sh` — Phi-3.5 expected "answer" → "sum"
78+
(Phi-3 now gives actual factual math answer)
79+
- `scripts/test_tokenizer.sh`**NEW** 4-test regression guard
80+
- `tools/pillar1/` — HF reference env + hf_dump.py dump tool
81+
- `bench/results/2026-04-20_bpe_fix_proof.md` — full before/after proof
82+
83+
### Non-impact
84+
85+
- `quant.h` (single-header): uses naive O(n²) BPE merge, correct by
86+
construction. Embed/WASM users have NEVER hit this bug. Only the
87+
split-source engine needed the fix.
88+
- No API change.
89+
- No performance change (the stale check is O(1)).
90+
91+
### Compatibility
92+
93+
No migration needed. Users of prior versions will simply see coherent
94+
output on previously-broken prompts. All existing models work.
95+
96+
---
97+
998
## [v0.18.0] — 2026-04-20 (Daily-Driver UX — TTFT/decode split)
1099

11100
### Headline

0 commit comments

Comments
 (0)