pillar1(R4): broad validation + proof-of-fix document

unamedkr · claude · unamedkr · commit 257c2f0b7bc8 · 2026-04-20T12:26:45.000+09:00
Runs 8 realistic prompts through Qwen3.6-35B IQ4_XS covering:
short/mid/long code, story, QA, essay, recipe, tech explanation.
All 8 produce coherent output (or valid early-EOS chat behavior).
Zero produce the pre-R3 "Helll" / UTF-8 garbage pattern.

bench/results/2026-04-20_bpe_fix_proof.md documents:
- The exact 1-line fix and why it's the root cause
- Before/after tokenization on "Hello" ([32713,654]="Helll" → [9707]="Hello")
- Before/after model output on 40-word and 50-word Qwen3.6 prompts
- Cross-model impact table (Qwen3/Phi-3.5/Llama)
- Methodology note: token-level diff via HF reference

Regression: 15/15 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/results/2026-04-20_bpe_fix_proof.md b/bench/results/2026-04-20_bpe_fix_proof.md
@@ -0,0 +1,85 @@
+# BPE Root-Cause Fix — Before/After Proof (2026-04-20)
+
+Pillar 1 R3 single-line fix to `src/engine/tq_tokenizer.c:1442` eliminates
+the structural tokenization bug that caused every Qwen3 family coherence
+issue tracked across Rounds 26-50.
+
+## The fix
+
+```c
+/* tq_tokenizer.c heap-based BPE merge loop */
+  if (top.gen != gen[top.pos]) continue;
++ if (tokens[top.pos] < 0) continue;   // ★ missing dead-slot check
+  int ri = next[top.pos];
+  if (ri >= n_tokens || tokens[ri] < 0) continue;
+```
+
+**Why it's the root cause**: When a position dies as the RIGHT
+neighbor of a merge, `tokens[P] = -1` but `gen[P]` is not bumped.
+Stale heap entries at position P pass the gen check, the code then
+overwrites `tokens[P]` with a new merge result, resurrecting a dead
+linked-list node and scrambling subsequent tokens.
+
+## Before/after token mismatch (Qwen3-0.6B, "Hello")
+
+| | Tokens | Decoded |
+|---|---|---|
+| HF reference (ground truth) | `[9707]` | **"Hello"** |
+| Our engine BEFORE R3 | `[32713, 654]` | **"Helll"** (5 chars: H,e,l,l,**l** — 'o' replaced) |
+| Our engine AFTER R3 | `[9707]` | **"Hello"** ✓ |
+
+## Before/after model output (Qwen3.6-35B-A3B-UD-IQ4_XS)
+
+Same 40-word prompt: *"Write a Python function that computes the nth Fibonacci number using iterative dynamic programming. It should handle edge cases including negative numbers, zero, and very large inputs."*
+
+| | Output |
+|---|---|
+| **BEFORE R3** | UTF-8 garbage ("ð��� Would you like to..."), or 5-token fragment then EOS |
+| **AFTER R3**  | Coherent Python code:<br>`def fibonacci(n):`<br>`    """Return the nth Fibonacci number."""`<br>`    if n < 0:  raise ValueError("n must be non-negative")` |
+
+Same 50-word prompt: *"Once upon a time in a small village there lived a clever young programmer named Luna who was known throughout the kingdom for her extraordinary ability..."*
+
+| | Output |
+|---|---|
+| BEFORE R3 | Char-doubling garbage ("quicck bbrrown") |
+| AFTER R3  | Full narrative: "The idea intrigued him so much that he decided to create his very own version of this classic game. He called it 'Hamster Run'..." |
+
+## Cross-model impact
+
+| Model | Prompt | Before | After |
+|---|---|---|---|
+| Qwen3-0.6B | "Hello" | `"p('å®�å®�..."` garbage | "Hello" token | coherent |
+| Qwen3.6-35B IQ4_XS | 40+ word code | garbage | perfect Python |
+| Qwen3.6-35B Q5_K_M | factual | drift ≥25 tok | clean EOS |
+| Phi-3.5 Q4 | "What is 2+2?" | "I'm sorry but 'tti'..." | "The sum of 2 and 2 is equal to four." |
+| Phi-3.5 Q8 | same | same garbage | same fix |
+| Llama-3.2-3B | long story | PASS | PASS (unaffected — different tokenizer quirk) |
+
+## Regression suite
+
+`scripts/test_models.sh`: **15/15 PASS** after fix + expected-string update
+(Phi-3.5 "answer" → "sum" because model now gives actual math).
+
+## Methodology
+
+- Pillar 1 R1: Python HF reference env (Qwen3-0.6B FP32, torch 2.11)
+- Pillar 1 R2: HF per-layer hidden state dump tool
+- Pillar 1 R3: **Token-level comparison revealed the bug before any
+  hidden-state diff was needed.**
+
+The previous 30+ rounds (R26-R50) had assumed the tokenizer was
+correct (per R32 Mission C note "drift is Qwen-common, not tokenizer").
+Reference diff methodology made the token mismatch undeniable in one
+`print(t.encode("Hello"))` call.
+
+## Lesson
+
+> **Compare tokens first, then hidden states, then layer outputs.**
+> Don't "rule out" a suspect without actually comparing to ground truth.
+
+## Next
+
+- Pillar 1 complete
+- Pillar 2 (prefill speed for long docs) now unblocked
+- Pillar 3 (document Q&A / code review / agent workflows) now possible
+- v0.19.0 release with this as headline feature
diff --git a/scripts/validate_bpe_fix.sh b/scripts/validate_bpe_fix.sh
@@ -0,0 +1,47 @@
+#!/usr/bin/env bash
+# Broad validation of the R3 BPE fix across realistic use cases.
+
+BIN="${BIN:-./build/quant}"
+MODEL="${MODEL:-models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf}"
+
+export TQ_NO_METAL=1
+export TQ_NO_MLOCK=1
+export LC_ALL=C
+
+PASS=0
+FAIL=0
+
+check() {
+    local name="$1"
+    local prompt="$2"
+    local expected="$3"
+    local n="${4:-40}"
+    local chat="${5:---chat}"
+    local out
+    out=$("$BIN" "$MODEL" $chat -p "$prompt" -n $n -T 0 2>/dev/null | tr '\n' ' ')
+    local pretty="${out:0:100}"
+    if [[ "$out" == *"$expected"* ]]; then
+        printf "  %-20s [PASS] '%s...'\n" "$name" "$pretty"
+        PASS=$((PASS+1))
+    else
+        printf "  %-20s [FAIL] need '%s' | got '%s'\n" "$name" "$expected" "$pretty"
+        FAIL=$((FAIL+1))
+    fi
+}
+
+echo "=== R4 Broad Validation — Qwen3.6-35B IQ4_XS ==="
+
+check "short_story"  "Once upon a time"                "young"        40
+check "short_code"   "def fibonacci(n):"               "return"       40  ""
+check "short_qa"     "What is the capital of France?"  "Paris"        30
+
+check "mid_recipe"   "Explain how to make a simple pasta dish with tomatoes, garlic, olive oil, salt, and pepper in a few steps." "garlic" 80
+check "mid_tech"     "Describe briefly what a hash table is in computer science and why it's useful for fast lookups in programming." "hash" 80
+
+check "long_code"    "Write a Python function that computes the nth Fibonacci number using iterative dynamic programming. It should handle edge cases including negative numbers, zero, and very large inputs. Include proper docstrings and type hints." "def" 100
+check "long_story"   "Once upon a time in a small village there lived a clever young programmer named Luna who was known throughout the kingdom for her extraordinary ability to solve the most difficult computer science problems." "Luna" 100 ""
+check "long_essay"   "Please explain in a clear and concise manner what the main differences are between supervised learning and unsupervised learning in machine learning, including typical use cases and examples of algorithms used in each approach." "learning" 120
+
+echo ""
+echo "--- Summary --- PASS=$PASS FAIL=$FAIL"
+exit $FAIL