Skip to content

Commit 58d3925

Browse files
unamedkrclaude
andcommitted
fix(tokenizer): BPE encode symmetric bug for direct byte codepoints
encode_byte_to_bpe_char was emitting the raw byte for GPT-2 direct-byte codepoints 0x80-0xFF (byte 0xC3 → output 0xC3). But the vocab stores these as proper UTF-8 (0xC3 → 'Ã' = UTF-8 c3 83). A standalone 0x80+ byte is invalid UTF-8, so str_lookup failed and the character dropped or fell back to a wrong low-id token. HF vs ours before/after: café HF [924, 58858] ours [68796] → [924, 58858] ✓ naïve HF [3376, 37572, 586] ours [77, 523] → [3376,37572,586] ✓ 日本語 HF [101059, 102819] ours [245,250,252] → [101059,102819] ✓ привет HF [124436, 26991, 8178] ours [222, 224] → [124436,26991,8178] ✓ 100% token-level parity with HF on Qwen3 for international input after fix. Symmetric to R6's decode fix — both sides now correctly apply GPT-2's byte-to-unicode mapping for codepoints in the direct-byte ranges (U+00A1-U+00AC and U+00AE-U+00FF). Prior impact: silent quality disaster. Any Llama-3/Qwen model fed a prompt with accented chars, CJK, Cyrillic, or byte-fallback emoji saw a completely different token sequence than training distribution. Regression: 15/15 PASS unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9c53491 commit 58d3925

2 files changed

Lines changed: 44 additions & 3 deletions

File tree

.claude/state.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,33 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## ★★ Phase 1 R7 — BPE ENCODE symmetric bug FIXED (2026-04-21) ★★
7+
8+
The decode fix from R6 had a **symmetric encode-side bug**. For input text
9+
containing international chars, `encode_byte_to_bpe_char` emitted raw byte
10+
for GPT-2 direct-byte codepoints 0x80-0xFF, producing INVALID UTF-8 that
11+
couldn't match the vocab's proper UTF-8 entries.
12+
13+
Result: international text got dropped / mis-matched via byte-fallback.
14+
15+
HF vs ours tokenization, BEFORE:
16+
17+
| Input | HF reference | Ours (broken) |
18+
|---|---|---|
19+
| café | [924, 58858] | [68796] |
20+
| naïve | [3376, 37572, 586] | [77, 523] |
21+
| 日本語 | [101059, 102819] | [245, 250, 252] |
22+
| привет | [124436, 26991, 8178] | [222, 224] |
23+
24+
AFTER fix: **all four match HF exactly** (100% token-level parity on Qwen3).
25+
26+
Impact: any prompt with accented chars / CJK / Cyrillic / emoji previously
27+
fed the model a completely different token sequence than it was trained on.
28+
Silent quality disaster. Combined with R6 (decode fix) now full round-trip
29+
clean for international text.
30+
31+
Regression: 15/15 PASS unchanged.
32+
633
## ★ Phase 1 R6 — BPE decode double-UTF-8 bug FIXED (2026-04-21) ★
734

835
`src/engine/tq_tokenizer.c:1089-1093`: decode_bpe_token for codepoints

src/engine/tq_tokenizer.c

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1141,9 +1141,23 @@ static int encode_byte_to_bpe_char(unsigned char byte, char* out) {
11411141
if (byte >= 174) direct = 1; /* upper range always fits in uint8 */
11421142

11431143
if (direct) {
1144-
out[0] = (char)byte;
1145-
out[1] = '\0';
1146-
return 1;
1144+
/* Codepoint = byte value. For bytes < 0x80 emit as 1-byte UTF-8;
1145+
* for bytes >= 0x80 (161-172, 174-255) emit the 2-byte UTF-8 encoding
1146+
* of the same codepoint (e.g. byte 0xC3 -> UTF-8 'Ã' c3 83). The
1147+
* vocab stores these as UTF-8 strings, so str_lookup only matches
1148+
* with the proper UTF-8 form. Emitting the raw byte (a standalone
1149+
* 0x80+ byte is invalid UTF-8) silently dropped international
1150+
* characters via byte-fallback mismatch. */
1151+
if (byte < 0x80) {
1152+
out[0] = (char)byte;
1153+
out[1] = '\0';
1154+
return 1;
1155+
} else {
1156+
out[0] = (char)(0xC0 | (byte >> 6));
1157+
out[1] = (char)(0x80 | (byte & 0x3F));
1158+
out[2] = '\0';
1159+
return 2;
1160+
}
11471161
}
11481162

11491163
/* Indirect bytes -> codepoint 256 + index */

0 commit comments

Comments
 (0)