Skip to content

Commit 34661a8

Browse files
unamedkrclaude
andcommitted
docs: v0.27.0 release notes — BPE UTF-8 fix bundle
Bundles commits 9c53491 (decode), 58d3925 (encode), 58a9d48 (quant.h sync) into a single version bump. Details the two symmetric BPE bugs, the 100% HF-reference token parity result, and the scope (Llama-3/Qwen3 family). Bumps python bindings pyproject.toml 0.26.0 → 0.27.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 58a9d48 commit 34661a8

2 files changed

Lines changed: 70 additions & 1 deletion

File tree

bindings/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
77

88
[project]
99
name = "quantcpp"
10-
version = "0.26.0"
10+
version = "0.27.0"
1111
description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
1212
readme = "README.md"
1313
license = { text = "Apache-2.0" }

docs/RELEASE_NOTES.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,75 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

77
---
88

9+
## [v0.27.0] — 2026-04-21 ★★ (BPE encode+decode UTF-8 fix — international text silent quality disaster RESOLVED)
10+
11+
### Headline
12+
13+
**Two symmetric BPE bugs were silently corrupting every prompt and every
14+
output containing international characters (accents, CJK, Cyrillic,
15+
byte-fallback emoji) on all Llama-3 and Qwen3 family models.** Fixed both
16+
sides of the GPT-2 byte-to-unicode mapping. Token-level parity with HF
17+
reference now 100% on tested inputs.
18+
19+
### The bugs
20+
21+
Both `tq_tokenizer.c` (split-source) and `quant.h` (single-header) had
22+
mirrored bugs on the encode and decode paths for GPT-2-style BPE vocabs.
23+
24+
**Encode** (`encode_byte_to_bpe_char`): for "direct" bytes in the range
25+
0xA1-0xAC and 0xAE-0xFF, we emitted the raw byte into the lookup string.
26+
Standalone bytes ≥ 0x80 are invalid UTF-8, so `str_lookup` never matched
27+
the vocab (which stores these as proper UTF-8 strings: byte 0xC3 → "Ã" =
28+
UTF-8 `c3 83`). The character silently fell back to a wrong low-id token.
29+
30+
**Decode** (`decode_bpe_token`): for codepoints U+00A1-U+00AC and
31+
U+00AE-U+00FF found in vocab pieces, we emitted the UTF-8 encoding of the
32+
codepoint (2 bytes `c3 83` for U+00C3 "Ã") instead of the raw byte 0xC3
33+
that the codepoint represents in GPT-2's byte-to-unicode mapping. Output
34+
got double-encoded: "café" (3 bytes `63 61 66 c3 a9`) became `63 61 66
35+
c3 83 c2 a9` (6 bytes, renders as "café").
36+
37+
### Measured impact
38+
39+
HF Qwen3 reference tokenization vs ours, before/after:
40+
41+
| Input | HF reference | Before | After |
42+
|---|---|---|---|
43+
| café | [924, 58858] | [68796] | [924, 58858]|
44+
| naïve | [3376, 37572, 586] | [77, 523] | [3376, 37572, 586]|
45+
| 日本語 | [101059, 102819] | [245, 250, 252] | [101059, 102819]|
46+
| привет | [124436, 26991, 8178] | [222, 224] | [124436, 26991, 8178]|
47+
48+
All four strings now tokenize byte-for-byte identical to the HF tokenizer.
49+
Before: model saw a completely different sequence than its training
50+
distribution — silent quality degradation proportional to share of
51+
non-ASCII content in the prompt.
52+
53+
### Discovery
54+
55+
Both bugs surfaced from the `tools/refparity/` framework added earlier
56+
this session. The decode bug was flagged first by an A/B output diff
57+
("café" artifact on Llama-3.2-1B); once fixed, a targeted encode
58+
comparison vs HF tokenizer surfaced the symmetric encode bug.
59+
60+
### Files changed
61+
62+
- `src/engine/tq_tokenizer.c`: `encode_byte_to_bpe_char` and
63+
`decode_bpe_token` each get a direct-byte branch
64+
- `quant.h`: synced (single-header had identical bugs)
65+
- Regression: 15/15 PASS unchanged
66+
67+
### Scope
68+
69+
- **Affected**: Llama-3.x, Qwen2.5, Qwen3.x, Qwen3.5, Qwen3.6, any model
70+
using GPT-2-style byte-level BPE (log line shows `is_sentencepiece=0`)
71+
- **Not affected**: Gemma (SentencePiece), Phi-3 (SentencePiece path)
72+
73+
Latent silent-quality bug for users whose prompts touch international text.
74+
Now unblocked.
75+
76+
---
77+
978
## [v0.26.0] — 2026-04-21 ★ (L2-norm formulation matches ggml — Qwen3.6 +36% coherence window)
1079

1180
### Headline

0 commit comments

Comments
 (0)