Commit 58d3925
fix(tokenizer): BPE encode symmetric bug for direct byte codepoints
encode_byte_to_bpe_char was emitting the raw byte for GPT-2 direct-byte
codepoints 0x80-0xFF (byte 0xC3 → output 0xC3). But the vocab stores these
as proper UTF-8 (0xC3 → 'Ã' = UTF-8 c3 83). A standalone 0x80+ byte is
invalid UTF-8, so str_lookup failed and the character dropped or fell back
to a wrong low-id token.
HF vs ours before/after:
café HF [924, 58858] ours [68796] → [924, 58858] ✓
naïve HF [3376, 37572, 586] ours [77, 523] → [3376,37572,586] ✓
日本語 HF [101059, 102819] ours [245,250,252] → [101059,102819] ✓
привет HF [124436, 26991, 8178] ours [222, 224] → [124436,26991,8178] ✓
100% token-level parity with HF on Qwen3 for international input after fix.
Symmetric to R6's decode fix — both sides now correctly apply GPT-2's
byte-to-unicode mapping for codepoints in the direct-byte ranges
(U+00A1-U+00AC and U+00AE-U+00FF).
Prior impact: silent quality disaster. Any Llama-3/Qwen model fed a prompt
with accented chars, CJK, Cyrillic, or byte-fallback emoji saw a completely
different token sequence than training distribution.
Regression: 15/15 PASS unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 9c53491 commit 58d3925
2 files changed
Lines changed: 44 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
6 | 33 | | |
7 | 34 | | |
8 | 35 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1141 | 1141 | | |
1142 | 1142 | | |
1143 | 1143 | | |
1144 | | - | |
1145 | | - | |
1146 | | - | |
| 1144 | + | |
| 1145 | + | |
| 1146 | + | |
| 1147 | + | |
| 1148 | + | |
| 1149 | + | |
| 1150 | + | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
| 1158 | + | |
| 1159 | + | |
| 1160 | + | |
1147 | 1161 | | |
1148 | 1162 | | |
1149 | 1163 | | |
| |||
0 commit comments