@@ -6,6 +6,75 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66
77---
88
9+ ## [ v0.27.0] — 2026-04-21 ★★ (BPE encode+decode UTF-8 fix — international text silent quality disaster RESOLVED)
10+
11+ ### Headline
12+
13+ ** Two symmetric BPE bugs were silently corrupting every prompt and every
14+ output containing international characters (accents, CJK, Cyrillic,
15+ byte-fallback emoji) on all Llama-3 and Qwen3 family models.** Fixed both
16+ sides of the GPT-2 byte-to-unicode mapping. Token-level parity with HF
17+ reference now 100% on tested inputs.
18+
19+ ### The bugs
20+
21+ Both ` tq_tokenizer.c ` (split-source) and ` quant.h ` (single-header) had
22+ mirrored bugs on the encode and decode paths for GPT-2-style BPE vocabs.
23+
24+ ** Encode** (` encode_byte_to_bpe_char ` ): for "direct" bytes in the range
25+ 0xA1-0xAC and 0xAE-0xFF, we emitted the raw byte into the lookup string.
26+ Standalone bytes ≥ 0x80 are invalid UTF-8, so ` str_lookup ` never matched
27+ the vocab (which stores these as proper UTF-8 strings: byte 0xC3 → "Ã" =
28+ UTF-8 ` c3 83 ` ). The character silently fell back to a wrong low-id token.
29+
30+ ** Decode** (` decode_bpe_token ` ): for codepoints U+00A1-U+00AC and
31+ U+00AE-U+00FF found in vocab pieces, we emitted the UTF-8 encoding of the
32+ codepoint (2 bytes ` c3 83 ` for U+00C3 "Ã") instead of the raw byte 0xC3
33+ that the codepoint represents in GPT-2's byte-to-unicode mapping. Output
34+ got double-encoded: "café" (3 bytes ` 63 61 66 c3 a9 ` ) became `63 61 66
35+ c3 83 c2 a9` (6 bytes, renders as "café").
36+
37+ ### Measured impact
38+
39+ HF Qwen3 reference tokenization vs ours, before/after:
40+
41+ | Input | HF reference | Before | After |
42+ | ---| ---| ---| ---|
43+ | café | [ 924, 58858] | [ 68796] | [ 924, 58858] ✓ |
44+ | naïve | [ 3376, 37572, 586] | [ 77, 523] | [ 3376, 37572, 586] ✓ |
45+ | 日本語 | [ 101059, 102819] | [ 245, 250, 252] | [ 101059, 102819] ✓ |
46+ | привет | [ 124436, 26991, 8178] | [ 222, 224] | [ 124436, 26991, 8178] ✓ |
47+
48+ All four strings now tokenize byte-for-byte identical to the HF tokenizer.
49+ Before: model saw a completely different sequence than its training
50+ distribution — silent quality degradation proportional to share of
51+ non-ASCII content in the prompt.
52+
53+ ### Discovery
54+
55+ Both bugs surfaced from the ` tools/refparity/ ` framework added earlier
56+ this session. The decode bug was flagged first by an A/B output diff
57+ ("café" artifact on Llama-3.2-1B); once fixed, a targeted encode
58+ comparison vs HF tokenizer surfaced the symmetric encode bug.
59+
60+ ### Files changed
61+
62+ - ` src/engine/tq_tokenizer.c ` : ` encode_byte_to_bpe_char ` and
63+ ` decode_bpe_token ` each get a direct-byte branch
64+ - ` quant.h ` : synced (single-header had identical bugs)
65+ - Regression: 15/15 PASS unchanged
66+
67+ ### Scope
68+
69+ - ** Affected** : Llama-3.x, Qwen2.5, Qwen3.x, Qwen3.5, Qwen3.6, any model
70+ using GPT-2-style byte-level BPE (log line shows ` is_sentencepiece=0 ` )
71+ - ** Not affected** : Gemma (SentencePiece), Phi-3 (SentencePiece path)
72+
73+ Latent silent-quality bug for users whose prompts touch international text.
74+ Now unblocked.
75+
76+ ---
77+
978## [ v0.26.0] — 2026-04-21 ★ (L2-norm formulation matches ggml — Qwen3.6 +36% coherence window)
1079
1180### Headline
0 commit comments