pillar1.5(R4): v0.20.0 release — NEOX RoPE + Qwen3 QK-norm root causes

unamedkr · claude · unamedkr · commit 2d0401a335db · 2026-04-20T14:41:09.000+09:00
Consolidates today's two transformer-level root-cause fixes into a
public release on top of v0.19.0's BPE tokenizer fix:

  Pillar 1.5 R1 — pure Qwen3 QK-norm restored (R40 over-broad disable)
  Pillar 1.5 R3 — NEOX-ordering RoPE for Qwen3 full rotary + batched

Together with v0.19.0's BPE stale-entry fix, all three structural
root causes of the 30+ round "Qwen3 drift" investigation (R26-R50)
are now closed in 6 rounds via HF reference-diff methodology.

Headline numbers:
  Qwen3-0.6B  50-word synthetic   BEFORE: UTF-8 garbage
                                  AFTER:  " Let me try to understand"
  Qwen3.5-4B  31-word prose       "Artificial intelligence is a field
                                    of computer science focused on..."
  Qwen3.6-35B 8-prompt matrix     0/8 garbage outputs
                                  (was 8/8 pre-v0.19.0)

Methodology: OpenMythos-inspired reference diff principle — compare
token-level + layer-level to HF ground truth BEFORE guessing at
kernels. 6 rounds closed what 30+ empirical rounds hadn't.

Bumps Python bindings 0.19.0 → 0.20.0.
README.md + README.ko.md v3.13 blurbs.
Full RELEASE_NOTES.md v0.20.0 entry with before/after evidence.

Regression: 15/15 test_models + 4/4 test_tokenizer PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
 
 > **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
 
+> **v3.13 ★★ NEOX RoPE + QK-norm — Qwen3 장문 coherence 복구 (2026-04-20)**: v3.12 BPE 수정 위에 2개 transformer-level 근본 원인 추가 해결. **(1)** `tq_transformer.c:1204` — 순수 Qwen3 (0.6B~32B)는 **q_norm/k_norm 필수**. R40가 GGUF arch="qwen" 전체에 QK-norm 비활성화한 것은 Qwen3.5/3.6 DeltaNet HYBRID에만 옳음. QK-norm 없으면 layer 2에서 residual stream 폭발 (norm 5400 vs HF 10). **(2)** `tq_ops.c`에 `tq_rope_neox` 신규 — llama.cpp가 `LLM_ARCH_QWEN3*`를 `LLAMA_ROPE_TYPE_NEOX/IMROPE` (half-split pairs)로 매핑하지만 우리 엔진의 `tq_rope`와 `tq_forward_batch`는 LLaMA 스타일 interleaved pairs 사용. R34가 partial-rotary만 고침; 순수 Qwen3 full rotary + batched prefill은 여전히 틀림. 이제 arch 감지 후 올바른 RoPE 디스패치. Qwen3-0.6B 50-word 합성 입력: **이전** `alyticsÐ°Ð½cieaâ��à¹�…` UTF-8 쓰레기, **이후** `" Let me try to understand this"` coherent. Qwen3.5-4B 자연문: `"Artificial intelligence is a field of computer science…"`. Qwen3.6-35B 8-프롬프트 매트릭스: UTF-8 garbage 0건. 방법론 승리: HF reference diff (`tools/pillar1/`), `refs/OpenMythos`의 "ground truth와 먼저 비교" 원칙이 결정적. **6 라운드에 R26-R50의 30+ 경험적 라운드가 못 잡은 3개 근본 원인을 모두 종료**. Regression 15/15 + tokenizer 4/4. v0.20.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
+
 > **v3.12 ★ BPE 근본 원인 수정 — Qwen3 family 완전 coherent (2026-04-20)**: `src/engine/tq_tokenizer.c:1442`에 한 줄 추가 (`if (tokens[top.pos] < 0) continue;`) 로 30+ 라운드의 "Qwen3 drift" 모든 증상의 진짜 원인 제거. 버그: BPE heap merge에서 다른 merge의 RIGHT neighbor로 죽은 position이 `gen[]` bump 없이 `tokens[P]=-1`만 되어, 오래된 heap entry가 죽은 slot을 부활시키며 linked list 오염 → 토큰 문자 중복/손실. HF reference 비교로 확인: 우리 엔진이 `"Hello"`를 `[32713="Hel", 654="ll"]` = **"Helll"** (문자 5개: H,e,l,l,**l** — 'o' 대신 'l')로 인코딩; HF는 `[9707="Hello"]` 정상. 다운스트림 결과: Qwen3.6-35B 40+ 단어 프롬프트가 **완벽한 Python 코드** + **완전한 서사문** 생성 (이전 garbage); Phi-3.5 "What is 2+2?"가 "The sum of 2 and 2 is equal to four." 정답 (이전 "tti" 환각). 방법론: Python HF reference diff가 3 라운드 만에 발견. Regression 15/15 + 새 토크나이저 테스트 4/4. 전체 before/after 증명: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` 단일 헤더는 naive O(n²) BPE 사용으로 영향 없음.
 
 > **v3.11 TTFT/decode 분리 + 데일리 드라이버 픽 (2026-04-20)**: CLI 출력이 `TTFT | decode`로 분리되어, 짧은 쿼리에서 cold-start에 지배당하는 "overall tok/s" 대신 prefill latency와 sustained decode를 개별적으로 보여줍니다. 16 GB M1 Pro CPU-only warm 실측: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (짧은 대화), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot 코드/수학), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE 품질 장문). Decode는 모델 성질, TTFT는 warmup 성질 — 같은 명령 두 번 실행하면 두 번째부터 warm 수치. 3-모델 매트릭스 + 용도별 추천: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).
diff --git a/README.md b/README.md
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
 
 > **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
 
+> **v3.13 ★★ NEOX RoPE + QK-norm — Qwen3 long-prompt coherence restored (2026-04-20):** Two more root causes closed on top of v3.12's BPE fix. (1) `src/engine/tq_transformer.c:1204` — pure Qwen3 (0.6B..32B) REQUIRES q_norm/k_norm; R40 had disabled QK-norm for all GGUF arch="qwen" which was correct only for Qwen3.5/3.6 DeltaNet HYBRID. Without QK-norm the residual stream explodes at layer 2 (norm 5400 vs HF 10). (2) `tq_ops.c` new `tq_rope_neox` — llama.cpp maps `LLM_ARCH_QWEN3*` to `LLAMA_ROPE_TYPE_NEOX/IMROPE` (half-split pairs), our engine's `tq_rope` and `tq_forward_batch` both used LLaMA-style interleaved pairs. R34 had fixed partial-rotary only; pure Qwen3 full rotary and batched prefill were still wrong. Now arch-detected and dispatched to the right RoPE. Measured Qwen3-0.6B on 50-word synthetic input: before `alyticsÐ°Ð½cieaâ��à¹�…` UTF-8 garbage, after `" Let me try to understand this"` coherent. Qwen3.5-4B natural prose: `"Artificial intelligence is a field of computer science…"`. Qwen3.6-35B 8-prompt matrix: zero garbage outputs. Methodology win: HF reference diff (`tools/pillar1/`), enabled by `refs/OpenMythos` insight "compare to ground truth FIRST". 6 rounds closed what 30+ empirical rounds hadn't. Regression 15/15 + tokenizer 4/4. v0.20.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
+
 > **v3.12 ★ BPE root-cause FIXED — Qwen3 family now fully coherent (2026-04-20):** One line added to `src/engine/tq_tokenizer.c:1442` (`if (tokens[top.pos] < 0) continue;`) eliminates the BPE heap merge bug that caused every "Qwen3 drift" symptom we chased across 30+ rounds. The bug: positions that died as right-neighbor of a merge weren't having their `gen[]` bumped, so stale heap entries resurrected dead linked-list slots and produced corrupted tokens. Measured on Qwen3-0.6B with HF reference: our engine encoded `"Hello"` as `[32713="Hel", 654="ll"]` = literally **"Helll"** (extra 'l', missing 'o'); HF encoded as `[9707="Hello"]`. Fix makes our tokens match. Downstream: Qwen3.6-35B 40+ word prompts now produce **coherent Python code** and **full narrative text** (previously garbage); Phi-3.5 "What is 2+2?" now gives "The sum of 2 and 2 is equal to four." (previously hallucinated "tti"). Methodology: Python HF reference diff caught it in 3 rounds ([`tools/pillar1/`](tools/pillar1/)). Regression 15/15 + new tokenizer test 4/4. Full before/after proof: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` single-header unaffected (uses naive O(n²) BPE, correct by construction).
 
 > **v3.11 TTFT/decode split + daily-driver picks (2026-04-20):** The CLI now prints `TTFT | decode` separately so individual devs see prefill latency vs sustained decode rate, not a blended "overall tok/s" that's dominated by cold-start on short queries. Measured warm on 16 GB M1 Pro CPU-only: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (snappy chat), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot code/math), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE quality long-form). Decode is a model property, TTFT is a warmup property — running twice makes the second call see the warm numbers. Full 3-model matrix + use-case picks: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).
diff --git a/bindings/python/pyproject.toml b/bindings/python/pyproject.toml
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "quantcpp"
-version = "0.19.0"
+version = "0.20.0"
 description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
 readme = "README.md"
 license = { text = "Apache-2.0" }
diff --git a/bindings/python/quantcpp/__init__.py b/bindings/python/quantcpp/__init__.py
@@ -21,7 +21,7 @@
     from importlib.metadata import version as _pkg_version
     __version__ = _pkg_version("quantcpp")
 except Exception:
-    __version__ = "0.19.0"  # fallback for editable / source-tree imports
+    __version__ = "0.20.0"  # fallback for editable / source-tree imports
 
 import os
 import sys
diff --git a/docs/RELEASE_NOTES.md b/docs/RELEASE_NOTES.md
@@ -6,6 +6,111 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ---
 
+## [v0.20.0] — 2026-04-20 ★★ (NEOX RoPE ROOT-CAUSE — Qwen3 Long-Prompt Fix)
+
+### Headline
+
+**Two transformer-level bugs that blocked Qwen3 family long-prompt
+coherence are fixed.** Combined with v0.19.0's BPE tokenizer fix,
+all three root causes of the 30+ round "Qwen3 drift" investigation
+(R26-R50) are now closed. Discovered via HF reference diff
+methodology (`tools/pillar1/`) after `refs/OpenMythos` analysis
+crystallized the principle: compare to ground truth FIRST.
+
+### Two fixes
+
+**Fix 1 — Pure-Qwen3 QK-norm restored** (`tq_transformer.c:1204`):
+
+R40 had disabled QK-norm for ALL GGUF arch strings matching "qwen".
+That was correct for Qwen3.5/3.6 HYBRID (DeltaNet + self-attn,
+`delta_n_heads > 0`) — those degrade with QK-norm applied. But
+pure Qwen3 (0.6B..32B) REQUIRES q_norm/k_norm per HF config. Without
+them, the residual stream explodes at layer 2 (norm ~5400 vs HF ~10).
+
+Fix: restrict the QK-norm disable to `delta_n_heads > 0` only.
+Pure Qwen3 now applies QK-norm as HF does.
+
+**Fix 2 — NEOX-ordering RoPE** (`tq_ops.c` + two sites in
+`tq_transformer.c`):
+
+llama.cpp maps `LLM_ARCH_QWEN3 / QWEN3MOE / QWEN35 / QWEN35MOE` to
+`LLAMA_ROPE_TYPE_NEOX / IMROPE` — half-split pairs `(q[i],
+q[i+head_dim/2])`. Our engine used LLaMA-style interleaved pairs
+`(q[2i], q[2i+1])`. R34 had fixed this for the partial-rotary path
+(Qwen3.5/3.6 hybrid) but pure Qwen3 (full rotary) and
+`tq_forward_batch` were never converted.
+
+Fix: new `tq_rope_neox` function + arch-detection at all three
+relevant call sites. Per-token full-rotary, batched learned-freq,
+batched fallback. TQ_ROPE_PAIRS=1 opt-out for legacy LLaMA/Qwen2.
+
+### Symptom (before/after, Qwen3-0.6B Q4, 50-word synthetic input)
+
+| Path | Before | After |
+|---|---|---|
+| Batched prefill | `alyticsÐ°Ð½cieaâ��à¹�…` UTF-8 garbage | `" Let me try to understand this"` |
+| Per-token prefill | `lenameuously…catchØ�` | `" ... and so on… So, the problem is to find the number of possible ways"` |
+
+### Natural prose — 31 words, "Summary:" continuation
+
+| Model | Output (first 20 tok) |
+|---|---|
+| Qwen3-0.6B | `"The main features of AI technology are that it has the ability to process information…"` ✓ |
+| Qwen3.5-4B | `"Artificial intelligence is a field of computer science that focuses on the development of intelligent machines…"` ✓ |
+
+### Qwen3.6-35B broad validation (8-prompt matrix, 40+ words max)
+
+- Zero UTF-8 garbage outputs (was 100% on 40+ words before v0.19.0).
+- Short story, long essay, tech explanation, factual Q&A all coherent.
+- Remaining weak spots are chat-template-induced early EOS (0 tokens
+  on some raw-completion prompts) — model behavior, not engine bug.
+
+### Methodology — OpenMythos insights applied
+
+`refs/OpenMythos` (RDT / MLA / ACT architecture reconstruction)
+crystallized the principle that ENABLED this session's breakthroughs:
+
+> Compare to ground truth (HF reference diff) BEFORE guessing at
+> kernels or recurrence state. 30+ rounds R26-R50 had all been
+> empirical; Pillar 1 R1-R3 + Pillar 1.5 R1-R3 solved three distinct
+> root causes in 6 rounds by diffing against HF output.
+
+Saved as `memory/project_openmythos_insights.md` for future sessions.
+
+### Files changed
+
+- `src/engine/tq_tokenizer.c` — BPE stale-entry check (v0.19.0 fix retained)
+- `src/engine/tq_transformer.c` — QK-norm scope + NEOX in 2 call sites
+- `src/engine/tq_ops.c` — new `tq_rope_neox` function
+- `include/turboquant/tq_engine.h` — export `tq_rope_neox`
+- `scripts/test_models.sh` + `scripts/test_tokenizer.sh` — regression expanded
+- `tools/pillar1/` — HF reference diff toolchain retained for follow-on
+- `bench/results/2026-04-20_bpe_fix_proof.md` — before/after evidence
+- `bench/results/2026-04-20_longseq_transformer_bug.md` — R7/R8 discovery trail
+
+### Regression
+
+- `test_models.sh`: **15/15 PASS** (unchanged through both fixes)
+- `test_tokenizer.sh`: 4/4 PASS
+
+### Known remaining
+
+- **Qwen3.6-35B DeltaNet state accumulation** on 40+ word natural
+  prose can sometimes trigger repetition-loop detection. This is
+  separate from the RoPE/QK-norm bugs and needs OpenMythos Insight
+  #2 (spectral-radius monitoring of recurrent state) applied as
+  diagnostic. Short-medium prompts fully coherent.
+- Chat-template interactions producing 0-token responses on some
+  coding prompts (Qwen3.6's thinking-mode prefix consuming the tokens).
+
+### Compatibility
+
+No API change. Existing code using `tq_rope` continues to work for
+LLaMA/Qwen2. New `tq_rope_neox` opt-in for Qwen3 family (auto-
+detected via GGUF arch string).
+
+---
+
 ## [v0.19.0] — 2026-04-20 ★ (BPE Stale-Entry ROOT-CAUSE Fix)
 
 ### Headline