Skip to content

Commit 2a1d40d

Browse files
unamedkrclaude
andcommitted
test(tokenizer): add 4-byte UTF-8 emoji + CJK fixtures
Expands v0.27.0 regression from 4 to 7 international fixtures. Adds: '🎉' → [144841] (4-byte UTF-8 emoji) 'I❤️code' → [40, 141390, 30543, 1851] (mixed ASCII + 4-byte emoji) '한글 테스트' → [23573, 83291, 10764, (Korean, 3-byte UTF-8) 72509, 53189] Exercises every UTF-8 branch in encode_byte_to_bpe_char — the direct-byte-in-multibyte case (previously silently broken) and the 3/4-byte sequences (previously correct but untested). All three match HF AutoTokenizer byte-for-byte on Qwen3-0.6B vocab. 11/11 PASS overall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f912c32 commit 2a1d40d

1 file changed

Lines changed: 8 additions & 0 deletions

File tree

scripts/test_tokenizer.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,14 @@ check_tokens "Qwen3-0.6B-Q4_K_M.gguf" "日本語" "101059 102819" \
9090
"TQ_NO_METAL=1 TQ_NO_MLOCK=1"
9191
check_tokens "Qwen3-0.6B-Q4_K_M.gguf" "привет" "124436 26991 8178" \
9292
"TQ_NO_METAL=1 TQ_NO_MLOCK=1"
93+
# 4-byte UTF-8 (emoji) and 3-byte (CJK) — exercises every branch in
94+
# encode_byte_to_bpe_char including direct bytes inside multibyte chars.
95+
check_tokens "Qwen3-0.6B-Q4_K_M.gguf" "🎉" "144841" \
96+
"TQ_NO_METAL=1 TQ_NO_MLOCK=1"
97+
check_tokens "Qwen3-0.6B-Q4_K_M.gguf" "I❤️code" "40 141390 30543 1851" \
98+
"TQ_NO_METAL=1 TQ_NO_MLOCK=1"
99+
check_tokens "Qwen3-0.6B-Q4_K_M.gguf" "한글 테스트" "23573 83291 10764 72509 53189" \
100+
"TQ_NO_METAL=1 TQ_NO_MLOCK=1"
93101

94102
echo ""
95103
echo "--- Summary --- PASS=$PASS FAIL=$FAIL SKIP=$SKIP"

0 commit comments

Comments
 (0)