Commit 2a1d40d
test(tokenizer): add 4-byte UTF-8 emoji + CJK fixtures
Expands v0.27.0 regression from 4 to 7 international fixtures. Adds:
'🎉' → [144841] (4-byte UTF-8 emoji)
'I❤️code' → [40, 141390, 30543, 1851] (mixed ASCII + 4-byte emoji)
'한글 테스트' → [23573, 83291, 10764, (Korean, 3-byte UTF-8)
72509, 53189]
Exercises every UTF-8 branch in encode_byte_to_bpe_char — the
direct-byte-in-multibyte case (previously silently broken) and the
3/4-byte sequences (previously correct but untested).
All three match HF AutoTokenizer byte-for-byte on Qwen3-0.6B vocab.
11/11 PASS overall.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent f912c32 commit 2a1d40d
1 file changed
Lines changed: 8 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
93 | 101 | | |
94 | 102 | | |
95 | 103 | | |
| |||
0 commit comments