You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consolidates today's two transformer-level root-cause fixes into a
public release on top of v0.19.0's BPE tokenizer fix:
Pillar 1.5 R1 — pure Qwen3 QK-norm restored (R40 over-broad disable)
Pillar 1.5 R3 — NEOX-ordering RoPE for Qwen3 full rotary + batched
Together with v0.19.0's BPE stale-entry fix, all three structural
root causes of the 30+ round "Qwen3 drift" investigation (R26-R50)
are now closed in 6 rounds via HF reference-diff methodology.
Headline numbers:
Qwen3-0.6B 50-word synthetic BEFORE: UTF-8 garbage
AFTER: " Let me try to understand"
Qwen3.5-4B 31-word prose "Artificial intelligence is a field
of computer science focused on..."
Qwen3.6-35B 8-prompt matrix 0/8 garbage outputs
(was 8/8 pre-v0.19.0)
Methodology: OpenMythos-inspired reference diff principle — compare
token-level + layer-level to HF ground truth BEFORE guessing at
kernels. 6 rounds closed what 30+ empirical rounds hadn't.
Bumps Python bindings 0.19.0 → 0.20.0.
README.md + README.ko.md v3.13 blurbs.
Full RELEASE_NOTES.md v0.20.0 entry with before/after evidence.
Regression: 15/15 test_models + 4/4 test_tokenizer PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.ko.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
76
76
77
77
> **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
78
78
79
+
> **v3.13 ★★ NEOX RoPE + QK-norm — Qwen3 장문 coherence 복구 (2026-04-20)**: v3.12 BPE 수정 위에 2개 transformer-level 근본 원인 추가 해결. **(1)** `tq_transformer.c:1204` — 순수 Qwen3 (0.6B~32B)는 **q_norm/k_norm 필수**. R40가 GGUF arch="qwen" 전체에 QK-norm 비활성화한 것은 Qwen3.5/3.6 DeltaNet HYBRID에만 옳음. QK-norm 없으면 layer 2에서 residual stream 폭발 (norm 5400 vs HF 10). **(2)** `tq_ops.c`에 `tq_rope_neox` 신규 — llama.cpp가 `LLM_ARCH_QWEN3*`를 `LLAMA_ROPE_TYPE_NEOX/IMROPE` (half-split pairs)로 매핑하지만 우리 엔진의 `tq_rope`와 `tq_forward_batch`는 LLaMA 스타일 interleaved pairs 사용. R34가 partial-rotary만 고침; 순수 Qwen3 full rotary + batched prefill은 여전히 틀림. 이제 arch 감지 후 올바른 RoPE 디스패치. Qwen3-0.6B 50-word 합성 입력: **이전** `alyticsанcieaâ��à¹�…` UTF-8 쓰레기, **이후** `" Let me try to understand this"` coherent. Qwen3.5-4B 자연문: `"Artificial intelligence is a field of computer science…"`. Qwen3.6-35B 8-프롬프트 매트릭스: UTF-8 garbage 0건. 방법론 승리: HF reference diff (`tools/pillar1/`), `refs/OpenMythos`의 "ground truth와 먼저 비교" 원칙이 결정적. **6 라운드에 R26-R50의 30+ 경험적 라운드가 못 잡은 3개 근본 원인을 모두 종료**. Regression 15/15 + tokenizer 4/4. v0.20.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
80
+
79
81
> **v3.12 ★ BPE 근본 원인 수정 — Qwen3 family 완전 coherent (2026-04-20)**: `src/engine/tq_tokenizer.c:1442`에 한 줄 추가 (`if (tokens[top.pos] < 0) continue;`) 로 30+ 라운드의 "Qwen3 drift" 모든 증상의 진짜 원인 제거. 버그: BPE heap merge에서 다른 merge의 RIGHT neighbor로 죽은 position이 `gen[]` bump 없이 `tokens[P]=-1`만 되어, 오래된 heap entry가 죽은 slot을 부활시키며 linked list 오염 → 토큰 문자 중복/손실. HF reference 비교로 확인: 우리 엔진이 `"Hello"`를 `[32713="Hel", 654="ll"]` = **"Helll"** (문자 5개: H,e,l,l,**l** — 'o' 대신 'l')로 인코딩; HF는 `[9707="Hello"]` 정상. 다운스트림 결과: Qwen3.6-35B 40+ 단어 프롬프트가 **완벽한 Python 코드** + **완전한 서사문** 생성 (이전 garbage); Phi-3.5 "What is 2+2?"가 "The sum of 2 and 2 is equal to four." 정답 (이전 "tti" 환각). 방법론: Python HF reference diff가 3 라운드 만에 발견. Regression 15/15 + 새 토크나이저 테스트 4/4. 전체 before/after 증명: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` 단일 헤더는 naive O(n²) BPE 사용으로 영향 없음.
80
82
81
83
> **v3.11 TTFT/decode 분리 + 데일리 드라이버 픽 (2026-04-20)**: CLI 출력이 `TTFT | decode`로 분리되어, 짧은 쿼리에서 cold-start에 지배당하는 "overall tok/s" 대신 prefill latency와 sustained decode를 개별적으로 보여줍니다. 16 GB M1 Pro CPU-only warm 실측: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (짧은 대화), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot 코드/수학), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE 품질 장문). Decode는 모델 성질, TTFT는 warmup 성질 — 같은 명령 두 번 실행하면 두 번째부터 warm 수치. 3-모델 매트릭스 + 용도별 추천: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
167
167
168
168
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
169
169
170
+
> **v3.13 ★★ NEOX RoPE + QK-norm — Qwen3 long-prompt coherence restored (2026-04-20):** Two more root causes closed on top of v3.12's BPE fix. (1) `src/engine/tq_transformer.c:1204` — pure Qwen3 (0.6B..32B) REQUIRES q_norm/k_norm; R40 had disabled QK-norm for all GGUF arch="qwen" which was correct only for Qwen3.5/3.6 DeltaNet HYBRID. Without QK-norm the residual stream explodes at layer 2 (norm 5400 vs HF 10). (2) `tq_ops.c` new `tq_rope_neox` — llama.cpp maps `LLM_ARCH_QWEN3*` to `LLAMA_ROPE_TYPE_NEOX/IMROPE` (half-split pairs), our engine's `tq_rope` and `tq_forward_batch` both used LLaMA-style interleaved pairs. R34 had fixed partial-rotary only; pure Qwen3 full rotary and batched prefill were still wrong. Now arch-detected and dispatched to the right RoPE. Measured Qwen3-0.6B on 50-word synthetic input: before `alyticsанcieaâ��à¹�…` UTF-8 garbage, after `" Let me try to understand this"` coherent. Qwen3.5-4B natural prose: `"Artificial intelligence is a field of computer science…"`. Qwen3.6-35B 8-prompt matrix: zero garbage outputs. Methodology win: HF reference diff (`tools/pillar1/`), enabled by `refs/OpenMythos` insight "compare to ground truth FIRST". 6 rounds closed what 30+ empirical rounds hadn't. Regression 15/15 + tokenizer 4/4. v0.20.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
171
+
170
172
> **v3.12 ★ BPE root-cause FIXED — Qwen3 family now fully coherent (2026-04-20):** One line added to `src/engine/tq_tokenizer.c:1442` (`if (tokens[top.pos] < 0) continue;`) eliminates the BPE heap merge bug that caused every "Qwen3 drift" symptom we chased across 30+ rounds. The bug: positions that died as right-neighbor of a merge weren't having their `gen[]` bumped, so stale heap entries resurrected dead linked-list slots and produced corrupted tokens. Measured on Qwen3-0.6B with HF reference: our engine encoded `"Hello"` as `[32713="Hel", 654="ll"]` = literally **"Helll"** (extra 'l', missing 'o'); HF encoded as `[9707="Hello"]`. Fix makes our tokens match. Downstream: Qwen3.6-35B 40+ word prompts now produce **coherent Python code** and **full narrative text** (previously garbage); Phi-3.5 "What is 2+2?" now gives "The sum of 2 and 2 is equal to four." (previously hallucinated "tti"). Methodology: Python HF reference diff caught it in 3 rounds ([`tools/pillar1/`](tools/pillar1/)). Regression 15/15 + new tokenizer test 4/4. Full before/after proof: [`bench/results/2026-04-20_bpe_fix_proof.md`](bench/results/2026-04-20_bpe_fix_proof.md). `quant.h` single-header unaffected (uses naive O(n²) BPE, correct by construction).
171
173
172
174
> **v3.11 TTFT/decode split + daily-driver picks (2026-04-20):** The CLI now prints `TTFT | decode` separately so individual devs see prefill latency vs sustained decode rate, not a blended "overall tok/s" that's dominated by cold-start on short queries. Measured warm on 16 GB M1 Pro CPU-only: **Phi-3.5 Q4_K_M** TTFT **2.3s** / decode **14.5 t/s** (snappy chat), **Llama-3.2-3B Q8→Q4** TTFT **0.97s** / decode **29.0 t/s** (one-shot code/math), **Qwen3.6-35B IQ4_XS** TTFT **1.83s** / decode **10.5 t/s** (35B MoE quality long-form). Decode is a model property, TTFT is a warmup property — running twice makes the second call see the warm numbers. Full 3-model matrix + use-case picks: [`bench/results/2026-04-20_ttft_daily_driver.md`](bench/results/2026-04-20_ttft_daily_driver.md).
0 commit comments