Skip to content

Commit 2d0401a

Browse files
unamedkrclaude
andcommitted
pillar1.5(R4): v0.20.0 release — NEOX RoPE + Qwen3 QK-norm root causes
Consolidates today's two transformer-level root-cause fixes into a public release on top of v0.19.0's BPE tokenizer fix: Pillar 1.5 R1 — pure Qwen3 QK-norm restored (R40 over-broad disable) Pillar 1.5 R3 — NEOX-ordering RoPE for Qwen3 full rotary + batched Together with v0.19.0's BPE stale-entry fix, all three structural root causes of the 30+ round "Qwen3 drift" investigation (R26-R50) are now closed in 6 rounds via HF reference-diff methodology. Headline numbers: Qwen3-0.6B 50-word synthetic BEFORE: UTF-8 garbage AFTER: " Let me try to understand" Qwen3.5-4B 31-word prose "Artificial intelligence is a field of computer science focused on..." Qwen3.6-35B 8-prompt matrix 0/8 garbage outputs (was 8/8 pre-v0.19.0) Methodology: OpenMythos-inspired reference diff principle — compare token-level + layer-level to HF ground truth BEFORE guessing at kernels. 6 rounds closed what 30+ empirical rounds hadn't. Bumps Python bindings 0.19.0 → 0.20.0. README.md + README.ko.md v3.13 blurbs. Full RELEASE_NOTES.md v0.20.0 entry with before/after evidence. Regression: 15/15 test_models + 4/4 test_tokenizer PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d0b25df commit 2d0401a

5 files changed

Lines changed: 111 additions & 2 deletions

File tree

README.ko.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
7676

7777
> **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
7878
79+
> **v3.13 ★★ NEOX RoPE + QK-norm — Qwen3 장문 coherence 복구 (2026-04-20)**: v3.12 BPE 수정 위에 2개 transformer-level 근본 원인 추가 해결. **(1)** `tq_transformer.c:1204` — 순수 Qwen3 (0.6B~32B)는 **q_norm/k_norm 필수**. R40가 GGUF arch="qwen" 전체에 QK-norm 비활성화한 것은 Qwen3.5/3.6 DeltaNet HYBRID에만 옳음. QK-norm 없으면 layer 2에서 residual stream 폭발 (norm 5400 vs HF 10). **(2)** `tq_ops.c`에 `tq_rope_neox` 신규 — llama.cpp가 `LLM_ARCH_QWEN3*`를 `LLAMA_ROPE_TYPE_NEOX/IMROPE` (half-split pairs)로 매핑하지만 우리 엔진의 `tq_rope`와 `tq_forward_batch`는 LLaMA 스타일 interleaved pairs 사용. R34가 partial-rotary만 고침; 순수 Qwen3 full rotary + batched prefill은 여전히 틀림. 이제 arch 감지 후 올바른 RoPE 디스패치. Qwen3-0.6B 50-word 합성 입력: **이전** `alyticsанcieaâ��à¹�…` UTF-8 쓰레기, **이후** `" Let me try to understand this"` coherent. Qwen3.5-4B 자연문: `"Artificial intelligence is a field of computer science…"`. Qwen3.6-35B 8-프롬프트 매트릭스: UTF-8 garbage 0건. 방법론 승리: HF reference diff (`tools/pillar1/`), `refs/OpenMythos`의 "ground truth와 먼저 비교" 원칙이 결정적. **6 라운드에 R26-R50의 30+ 경험적 라운드가 못 잡은 3개 근본 원인을 모두 종료**. Regression 15/15 + tokenizer 4/4. v0.20.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
80+
7981
> **v3.12 ★ BPE 근본 원인 수정 — Qwen3 family 완전 coherent (2026-04-20)**: `src/engine/tq_tokenizer.c:1442`에 한 줄 추가 (`if (tokens[top.pos] < 0) continue;`) 로 30+ 라운드의 "Qwen3 drift" 모든 증상의 진짜 원인 제거. 버그: BPE heap merge에서 다른 merge의 RIGHT neighbor로 죽은 position이 `gen[]` bump 없이 `tokens[P]=-1`만 되어, 오래된 heap entry가 죽은 slot을 부활시키며 linked list 오염 → 토큰 문자 중복/손실. HF reference 비교로 확인: 우리 엔진이 `"Hello"`를 `[32713="Hel", 654="ll"]` = **"Helll"** (문자 5개: H,e,l,l,**l** — 'o' 대신 'l')로 인코딩; HF는 `[9707="Hello"]` 정상. 다운스트림 결과: Qwen3.6-35B 40+ 단어 프롬프트가 **완벽한 Python 코드** + **완전한 서사문** 생성 (이전 garbage); Phi-3.5 "What is 2+2?"가 "The sum of 2 and 2 is equal to four." 정답 (이전 "tti" 환각). 방법론: Python HF reference diff가 3 라운드 만에 발견. Regression 15/15 + 새 토크나이저 테스트 4/4. 전체 before/after 증명: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` 단일 헤더는 naive O(n²) BPE 사용으로 영향 없음.
8082

8183
> **v3.11 TTFT/decode 분리 + 데일리 드라이버 픽 (2026-04-20)**: CLI 출력이 `TTFT | decode`로 분리되어, 짧은 쿼리에서 cold-start에 지배당하는 "overall tok/s" 대신 prefill latency와 sustained decode를 개별적으로 보여줍니다. 16 GB M1 Pro CPU-only warm 실측: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (짧은 대화), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot 코드/수학), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE 품질 장문). Decode는 모델 성질, TTFT는 warmup 성질 — 같은 명령 두 번 실행하면 두 번째부터 warm 수치. 3-모델 매트릭스 + 용도별 추천: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
167167
168168
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
169169
170+
> **v3.13 ★★ NEOX RoPE + QK-norm — Qwen3 long-prompt coherence restored (2026-04-20):** Two more root causes closed on top of v3.12's BPE fix. (1) `src/engine/tq_transformer.c:1204` — pure Qwen3 (0.6B..32B) REQUIRES q_norm/k_norm; R40 had disabled QK-norm for all GGUF arch="qwen" which was correct only for Qwen3.5/3.6 DeltaNet HYBRID. Without QK-norm the residual stream explodes at layer 2 (norm 5400 vs HF 10). (2) `tq_ops.c` new `tq_rope_neox` — llama.cpp maps `LLM_ARCH_QWEN3*` to `LLAMA_ROPE_TYPE_NEOX/IMROPE` (half-split pairs), our engine's `tq_rope` and `tq_forward_batch` both used LLaMA-style interleaved pairs. R34 had fixed partial-rotary only; pure Qwen3 full rotary and batched prefill were still wrong. Now arch-detected and dispatched to the right RoPE. Measured Qwen3-0.6B on 50-word synthetic input: before `alyticsанcieaâ��à¹�…` UTF-8 garbage, after `" Let me try to understand this"` coherent. Qwen3.5-4B natural prose: `"Artificial intelligence is a field of computer science…"`. Qwen3.6-35B 8-prompt matrix: zero garbage outputs. Methodology win: HF reference diff (`tools/pillar1/`), enabled by `refs/OpenMythos` insight "compare to ground truth FIRST". 6 rounds closed what 30+ empirical rounds hadn't. Regression 15/15 + tokenizer 4/4. v0.20.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
171+
170172
> **v3.12 ★ BPE root-cause FIXED — Qwen3 family now fully coherent (2026-04-20):** One line added to `src/engine/tq_tokenizer.c:1442` (`if (tokens[top.pos] < 0) continue;`) eliminates the BPE heap merge bug that caused every "Qwen3 drift" symptom we chased across 30+ rounds. The bug: positions that died as right-neighbor of a merge weren't having their `gen[]` bumped, so stale heap entries resurrected dead linked-list slots and produced corrupted tokens. Measured on Qwen3-0.6B with HF reference: our engine encoded `"Hello"` as `[32713="Hel", 654="ll"]` = literally **"Helll"** (extra 'l', missing 'o'); HF encoded as `[9707="Hello"]`. Fix makes our tokens match. Downstream: Qwen3.6-35B 40+ word prompts now produce **coherent Python code** and **full narrative text** (previously garbage); Phi-3.5 "What is 2+2?" now gives "The sum of 2 and 2 is equal to four." (previously hallucinated "tti"). Methodology: Python HF reference diff caught it in 3 rounds ([`tools/pillar1/`](tools/pillar1/)). Regression 15/15 + new tokenizer test 4/4. Full before/after proof: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` single-header unaffected (uses naive O(n²) BPE, correct by construction).
171173

172174
> **v3.11 TTFT/decode split + daily-driver picks (2026-04-20):** The CLI now prints `TTFT | decode` separately so individual devs see prefill latency vs sustained decode rate, not a blended "overall tok/s" that's dominated by cold-start on short queries. Measured warm on 16 GB M1 Pro CPU-only: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (snappy chat), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot code/math), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE quality long-form). Decode is a model property, TTFT is a warmup property — running twice makes the second call see the warm numbers. Full 3-model matrix + use-case picks: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).

bindings/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
77

88
[project]
99
name = "quantcpp"
10-
version = "0.19.0"
10+
version = "0.20.0"
1111
description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
1212
readme = "README.md"
1313
license = { text = "Apache-2.0" }

bindings/python/quantcpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
from importlib.metadata import version as _pkg_version
2222
__version__ = _pkg_version("quantcpp")
2323
except Exception:
24-
__version__ = "0.19.0" # fallback for editable / source-tree imports
24+
__version__ = "0.20.0" # fallback for editable / source-tree imports
2525

2626
import os
2727
import sys

docs/RELEASE_NOTES.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,111 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

77
---
88

9+
## [v0.20.0] — 2026-04-20 ★★ (NEOX RoPE ROOT-CAUSE — Qwen3 Long-Prompt Fix)
10+
11+
### Headline
12+
13+
**Two transformer-level bugs that blocked Qwen3 family long-prompt
14+
coherence are fixed.** Combined with v0.19.0's BPE tokenizer fix,
15+
all three root causes of the 30+ round "Qwen3 drift" investigation
16+
(R26-R50) are now closed. Discovered via HF reference diff
17+
methodology (`tools/pillar1/`) after `refs/OpenMythos` analysis
18+
crystallized the principle: compare to ground truth FIRST.
19+
20+
### Two fixes
21+
22+
**Fix 1 — Pure-Qwen3 QK-norm restored** (`tq_transformer.c:1204`):
23+
24+
R40 had disabled QK-norm for ALL GGUF arch strings matching "qwen".
25+
That was correct for Qwen3.5/3.6 HYBRID (DeltaNet + self-attn,
26+
`delta_n_heads > 0`) — those degrade with QK-norm applied. But
27+
pure Qwen3 (0.6B..32B) REQUIRES q_norm/k_norm per HF config. Without
28+
them, the residual stream explodes at layer 2 (norm ~5400 vs HF ~10).
29+
30+
Fix: restrict the QK-norm disable to `delta_n_heads > 0` only.
31+
Pure Qwen3 now applies QK-norm as HF does.
32+
33+
**Fix 2 — NEOX-ordering RoPE** (`tq_ops.c` + two sites in
34+
`tq_transformer.c`):
35+
36+
llama.cpp maps `LLM_ARCH_QWEN3 / QWEN3MOE / QWEN35 / QWEN35MOE` to
37+
`LLAMA_ROPE_TYPE_NEOX / IMROPE` — half-split pairs `(q[i],
38+
q[i+head_dim/2])`. Our engine used LLaMA-style interleaved pairs
39+
`(q[2i], q[2i+1])`. R34 had fixed this for the partial-rotary path
40+
(Qwen3.5/3.6 hybrid) but pure Qwen3 (full rotary) and
41+
`tq_forward_batch` were never converted.
42+
43+
Fix: new `tq_rope_neox` function + arch-detection at all three
44+
relevant call sites. Per-token full-rotary, batched learned-freq,
45+
batched fallback. TQ_ROPE_PAIRS=1 opt-out for legacy LLaMA/Qwen2.
46+
47+
### Symptom (before/after, Qwen3-0.6B Q4, 50-word synthetic input)
48+
49+
| Path | Before | After |
50+
|---|---|---|
51+
| Batched prefill | `alyticsанcieaâ��à¹�…` UTF-8 garbage | `" Let me try to understand this"` |
52+
| Per-token prefill | `lenameuously…catchØ�` | `" ... and so on… So, the problem is to find the number of possible ways"` |
53+
54+
### Natural prose — 31 words, "Summary:" continuation
55+
56+
| Model | Output (first 20 tok) |
57+
|---|---|
58+
| Qwen3-0.6B | `"The main features of AI technology are that it has the ability to process information…"`|
59+
| Qwen3.5-4B | `"Artificial intelligence is a field of computer science that focuses on the development of intelligent machines…"`|
60+
61+
### Qwen3.6-35B broad validation (8-prompt matrix, 40+ words max)
62+
63+
- Zero UTF-8 garbage outputs (was 100% on 40+ words before v0.19.0).
64+
- Short story, long essay, tech explanation, factual Q&A all coherent.
65+
- Remaining weak spots are chat-template-induced early EOS (0 tokens
66+
on some raw-completion prompts) — model behavior, not engine bug.
67+
68+
### Methodology — OpenMythos insights applied
69+
70+
`refs/OpenMythos` (RDT / MLA / ACT architecture reconstruction)
71+
crystallized the principle that ENABLED this session's breakthroughs:
72+
73+
> Compare to ground truth (HF reference diff) BEFORE guessing at
74+
> kernels or recurrence state. 30+ rounds R26-R50 had all been
75+
> empirical; Pillar 1 R1-R3 + Pillar 1.5 R1-R3 solved three distinct
76+
> root causes in 6 rounds by diffing against HF output.
77+
78+
Saved as `memory/project_openmythos_insights.md` for future sessions.
79+
80+
### Files changed
81+
82+
- `src/engine/tq_tokenizer.c` — BPE stale-entry check (v0.19.0 fix retained)
83+
- `src/engine/tq_transformer.c` — QK-norm scope + NEOX in 2 call sites
84+
- `src/engine/tq_ops.c` — new `tq_rope_neox` function
85+
- `include/turboquant/tq_engine.h` — export `tq_rope_neox`
86+
- `scripts/test_models.sh` + `scripts/test_tokenizer.sh` — regression expanded
87+
- `tools/pillar1/` — HF reference diff toolchain retained for follow-on
88+
- `bench/results/2026-04-20_bpe_fix_proof.md` — before/after evidence
89+
- `bench/results/2026-04-20_longseq_transformer_bug.md` — R7/R8 discovery trail
90+
91+
### Regression
92+
93+
- `test_models.sh`: **15/15 PASS** (unchanged through both fixes)
94+
- `test_tokenizer.sh`: 4/4 PASS
95+
96+
### Known remaining
97+
98+
- **Qwen3.6-35B DeltaNet state accumulation** on 40+ word natural
99+
prose can sometimes trigger repetition-loop detection. This is
100+
separate from the RoPE/QK-norm bugs and needs OpenMythos Insight
101+
#2 (spectral-radius monitoring of recurrent state) applied as
102+
diagnostic. Short-medium prompts fully coherent.
103+
- Chat-template interactions producing 0-token responses on some
104+
coding prompts (Qwen3.6's thinking-mode prefix consuming the tokens).
105+
106+
### Compatibility
107+
108+
No API change. Existing code using `tq_rope` continues to work for
109+
LLaMA/Qwen2. New `tq_rope_neox` opt-in for Qwen3 family (auto-
110+
detected via GGUF arch string).
111+
112+
---
113+
9114
## [v0.19.0] — 2026-04-20 ★ (BPE Stale-Entry ROOT-CAUSE Fix)
10115

11116
### Headline

0 commit comments

Comments
 (0)