Skip to content

Commit 58a9d48

Browse files
unamedkrclaude
andcommitted
fix(quant.h): sync BPE encode/decode UTF-8 fix from split-source
quant.h single-header had the same two BPE bugs as split-source (tq_tokenizer.c): - decode: codepoints U+00A1-U+00FF emitted as raw UTF-8 bytes → double encoding - encode: direct bytes ≥ 0x80 emitted as raw byte → invalid UTF-8 mismatch Synced both fixes. scripts/check_sync.sh passes. single_header_example builds and runs clean. See commits 9c53491 (decode) and 58d3925 (encode) for root-cause analysis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 58d3925 commit 58a9d48

1 file changed

Lines changed: 25 additions & 4 deletions

File tree

quant.h

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8279,8 +8279,15 @@ static const char* decode_bpe_token(const char* piece) {
82798279
decode_buf[out++] = (char)p[0];
82808280
decode_buf[out++] = (char)p[1];
82818281
}
8282+
} else if ((cp >= 0xA1 && cp <= 0xAC) || (cp >= 0xAE && cp <= 0xFF)) {
8283+
/* GPT-2 direct-byte mapping: codepoints U+00A1-U+00AC and
8284+
* U+00AE-U+00FF represent raw bytes of the same value. The
8285+
* BPE vocab stores these as UTF-8 (e.g. 'Ã' for byte 0xC3)
8286+
* so emit the raw byte to reconstruct the intended UTF-8
8287+
* character (e.g. 'Ã'+'©' → bytes 0xC3 0xA9 = 'é'). */
8288+
decode_buf[out++] = (char)(unsigned char)cp;
82828289
} else {
8283-
/* Regular 2-byte UTF-8 char (e.g., accented letters) */
8290+
/* Regular 2-byte UTF-8 char (rare in GPT-2-style BPE) */
82848291
decode_buf[out++] = (char)p[0];
82858292
decode_buf[out++] = (char)p[1];
82868293
}
@@ -8327,9 +8334,23 @@ static int encode_byte_to_bpe_char(unsigned char byte, char* out) {
83278334
if (byte >= 174) direct = 1; /* upper range always fits in uint8 */
83288335

83298336
if (direct) {
8330-
out[0] = (char)byte;
8331-
out[1] = '\0';
8332-
return 1;
8337+
/* Codepoint = byte value. For bytes < 0x80 emit as 1-byte UTF-8;
8338+
* for bytes >= 0x80 (161-172, 174-255) emit the 2-byte UTF-8 encoding
8339+
* of the same codepoint (e.g. byte 0xC3 -> UTF-8 'Ã' c3 83). The
8340+
* vocab stores these as UTF-8 strings, so str_lookup only matches
8341+
* with the proper UTF-8 form. Emitting the raw byte (a standalone
8342+
* 0x80+ byte is invalid UTF-8) silently dropped international
8343+
* characters via byte-fallback mismatch. */
8344+
if (byte < 0x80) {
8345+
out[0] = (char)byte;
8346+
out[1] = '\0';
8347+
return 1;
8348+
} else {
8349+
out[0] = (char)(0xC0 | (byte >> 6));
8350+
out[1] = (char)(0x80 | (byte & 0x3F));
8351+
out[2] = '\0';
8352+
return 2;
8353+
}
83338354
}
83348355

83358356
/* Indirect bytes -> codepoint 256 + index */

0 commit comments

Comments
 (0)