docs(README): v3.20 BPE UTF-8 fix blurb + 35B user recipe

unamedkr · claude · unamedkr · commit 657f203538b0 · 2026-04-21T14:39:50.000+09:00
v3.20 announces the v0.27.0 BPE encode/decode double-UTF-8 fix with the
measured HF token parity table and scope (GPT-2 byte-level BPE family).

Also adds a practical recipe callout: Qwen3.6-35B Q5_K_M + --rep-penalty 1.3
is the best user-facing config today for long-form coherence on 16 GB Mac
(200 token budget vs 117-token default drift, from R11 validation).

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -167,6 +167,10 @@ The bug was using the same tool for both. The fix is using each for what it's go
 
 > **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
 
+> **v3.20 ★★ BPE encode/decode UTF-8 fix — international text silent quality disaster resolved (2026-04-21):** Two symmetric bugs in `encode_byte_to_bpe_char` / `decode_bpe_token` were silently corrupting every prompt and every output containing non-ASCII chars (accents, CJK, Cyrillic, byte-fallback emoji) on all Llama-3 / Qwen3 family models. **Encode** emitted raw bytes ≥ 0x80 (invalid UTF-8) for GPT-2 direct-byte codepoints so `str_lookup` never matched; characters silently fell back to wrong low-id tokens. **Decode** emitted the UTF-8 encoding of codepoints U+0080-U+00FF instead of the raw byte they represent; output got double-encoded ("café" → "cafÃ©"). Token-level HF parity verified: `café`/`naïve`/`日本語`/`привет` all now tokenize byte-for-byte identical to HF `AutoTokenizer` on Qwen3. Discovered via the new [`tools/refparity/`](tools/refparity/) framework's A/B output diff. Also synced to `quant.h` single-header. Added `scripts/test_tokenizer.sh` fixtures so future refactors fail loudly. Scope: GPT-2-style byte-level BPE (Llama-3.x, Qwen2.5/3.x/3.5/3.6); Gemma/Phi-3 SentencePiece path unaffected. Regression 15/15 + tokenizer 8/8 PASS. v0.27.0.
+
+> **Practical Qwen3.6-35B recipe on 16 GB Mac**: for best long-form coherence, pair `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` with `--rep-penalty 1.3`. Measured on "Once upon a time in a faraway land" (-n 200, T=0): default config hits a repeat loop at 117 tokens; Q5_K_M + rep-penalty runs the full 200-token budget with graceful degrade only near the end. 35B DeltaNet drift remains an open architectural investigation; this is the best user-facing config today.
+
 > **v3.19 ★ DeltaNet L2-norm formulation matches ggml — Qwen3.6 +36% coherence (2026-04-21):** R26's "eps fix" had the right diagnosis but wrong formulation. We used `1/sqrt(ss + eps)` but llama.cpp's `ggml_l2_norm` uses `1/max(sqrt(ss), eps)` — for near-zero inputs these differ by **3 orders of magnitude** (1e3 vs 1e6). Over 30 DeltaNet layers × position, systematic K/Q under-scaling compounds into decode-length degradation. Fix: match ggml exactly. Measured on Qwen3.6-35B IQ4_XS auto-serial "Write a 300-word essay": **117 → 160 tokens** (+36%), coherent content 45 → 110 tokens before drift. Discovered via direct diff of `refs/llama.cpp/ggml/src/ggml-cpu/ops.cpp::ggml_compute_forward_l2_norm_f32` against our `l2_normalize`. 15/15 regression PASS. v0.26.0.
 
 > **v3.18 Qwen3.6 auto-serial quality mode — determinism + longer coherence (2026-04-20):** Discovery: Qwen3.6-35B multi-thread matmul is **non-deterministic at T=0** (same prompt two runs = different output). Parallel FP reduction order variance compounds over 30 MoE layers × position feedback → top-1 argmax flips. Fix: auto-detect qwen35moe+DeltaNet hybrid and force `-j 1`. **Before**: repeats differ run-to-run, degrades 60-70 tokens. **After**: deterministic, extends coherent window to ~95 tokens. Cost: ~2-3× slower decode (3 t/s vs 8 t/s). Opt-out: `TQ_NO_AUTO_SERIAL=1`. Honest limit: **still not a full fix for 1000+ char generation** — numerical precision accumulation over 40 layers × 8-expert weighted sum × IQ4_XS quantization drifts into repetition eventually. Session arc day summary: 7 releases closing 7 distinct Qwen3.6 bug classes. Still worth shipping because deterministic output is usable, non-det was not. v0.25.0.