fix(tokenizer): BPE encode symmetric bug for direct byte codepoints

unamedkr · claude · unamedkr · commit 58d3925ce741 · 2026-04-21T14:17:33.000+09:00
encode_byte_to_bpe_char was emitting the raw byte for GPT-2 direct-byte
codepoints 0x80-0xFF (byte 0xC3 → output 0xC3). But the vocab stores these
as proper UTF-8 (0xC3 → 'Ã' = UTF-8 c3 83). A standalone 0x80+ byte is
invalid UTF-8, so str_lookup failed and the character dropped or fell back
to a wrong low-id token.

HF vs ours before/after:

  café    HF [924, 58858]           ours [68796]        → [924, 58858]    ✓
  naïve   HF [3376, 37572, 586]     ours [77, 523]      → [3376,37572,586] ✓
  日本語   HF [101059, 102819]      ours [245,250,252]  → [101059,102819]  ✓
  привет  HF [124436, 26991, 8178]  ours [222, 224]     → [124436,26991,8178] ✓

100% token-level parity with HF on Qwen3 for international input after fix.

Symmetric to R6's decode fix — both sides now correctly apply GPT-2's
byte-to-unicode mapping for codepoints in the direct-byte ranges
(U+00A1-U+00AC and U+00AE-U+00FF).

Prior impact: silent quality disaster. Any Llama-3/Qwen model fed a prompt
with accented chars, CJK, Cyrillic, or byte-fallback emoji saw a completely
different token sequence than training distribution.

Regression: 15/15 PASS unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,33 @@
 **Last updated**: 2026-04-21 (Phase 1 refparity ★)
 **Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
 
+## ★★ Phase 1 R7 — BPE ENCODE symmetric bug FIXED (2026-04-21) ★★
+
+The decode fix from R6 had a **symmetric encode-side bug**. For input text
+containing international chars, `encode_byte_to_bpe_char` emitted raw byte
+for GPT-2 direct-byte codepoints 0x80-0xFF, producing INVALID UTF-8 that
+couldn't match the vocab's proper UTF-8 entries.
+
+Result: international text got dropped / mis-matched via byte-fallback.
+
+HF vs ours tokenization, BEFORE:
+
+| Input | HF reference | Ours (broken) |
+|---|---|---|
+| café | [924, 58858] | [68796] |
+| naïve | [3376, 37572, 586] | [77, 523] |
+| 日本語 | [101059, 102819] | [245, 250, 252] |
+| привет | [124436, 26991, 8178] | [222, 224] |
+
+AFTER fix: **all four match HF exactly** (100% token-level parity on Qwen3).
+
+Impact: any prompt with accented chars / CJK / Cyrillic / emoji previously
+fed the model a completely different token sequence than it was trained on.
+Silent quality disaster. Combined with R6 (decode fix) now full round-trip
+clean for international text.
+
+Regression: 15/15 PASS unchanged.
+
 ## ★ Phase 1 R6 — BPE decode double-UTF-8 bug FIXED (2026-04-21) ★
 
 `src/engine/tq_tokenizer.c:1089-1093`: decode_bpe_token for codepoints
diff --git a/src/engine/tq_tokenizer.c b/src/engine/tq_tokenizer.c
@@ -1141,9 +1141,23 @@ static int encode_byte_to_bpe_char(unsigned char byte, char* out) {
     if (byte >= 174) direct = 1; /* upper range always fits in uint8 */
 
     if (direct) {
-        out[0] = (char)byte;
-        out[1] = '\0';
-        return 1;
+        /* Codepoint = byte value. For bytes < 0x80 emit as 1-byte UTF-8;
+         * for bytes >= 0x80 (161-172, 174-255) emit the 2-byte UTF-8 encoding
+         * of the same codepoint (e.g. byte 0xC3 -> UTF-8 'Ã' c3 83). The
+         * vocab stores these as UTF-8 strings, so str_lookup only matches
+         * with the proper UTF-8 form. Emitting the raw byte (a standalone
+         * 0x80+ byte is invalid UTF-8) silently dropped international
+         * characters via byte-fallback mismatch. */
+        if (byte < 0x80) {
+            out[0] = (char)byte;
+            out[1] = '\0';
+            return 1;
+        } else {
+            out[0] = (char)(0xC0 | (byte >> 6));
+            out[1] = (char)(0x80 | (byte & 0x3F));
+            out[2] = '\0';
+            return 2;
+        }
     }
 
     /* Indirect bytes -> codepoint 256 + index */