|
| 1 | +# Long-Sequence Transformer Bug — Discovered 2026-04-20 |
| 2 | + |
| 3 | +**After R3 BPE fix**, tokenization is now correct. But a **SEPARATE |
| 4 | +transformer-forward bug** manifests on inputs of ~100+ tokens. This |
| 5 | +document records the reproducer for follow-on debugging. |
| 6 | + |
| 7 | +## Reproducer (Qwen3-0.6B Q4_K_M, deterministic) |
| 8 | + |
| 9 | +```bash |
| 10 | +# Generate 50 synthetic words = ~144 tokens after tokenization |
| 11 | +WORDS=$(python -c "print(' '.join(f'word{i}' for i in range(1,51)))") |
| 12 | +./build/quant models/Qwen3-0.6B-Q4_K_M.gguf -p "$WORDS. Continue:" -n 12 -T 0 |
| 13 | +``` |
| 14 | + |
| 15 | +## Observed behavior (binary search by prompt length) |
| 16 | + |
| 17 | +| Prompt words | Token count | Our output | HF output (reference) | |
| 18 | +|---|---:|---|---| |
| 19 | +| 20 | 54 | `"word., check ( (-lang PL"` | (coherent) | |
| 20 | +| 30 | 84 | `"nahme innocence thisds..."` | (coherent) | |
| 21 | +| 40 | 114 | **UTF-8 garbage** (`� � �egan...`) | (coherent) | |
| 22 | +| 50 | 144 | UTF-8 garbage | `" word51 word52 word53 word54"` ✓ | |
| 23 | +| 70 | 204 | UTF-8 garbage | (coherent) | |
| 24 | +| 90 | 264 | UTF-8 garbage | (coherent) | |
| 25 | + |
| 26 | +**Break threshold: ~100-120 prompt tokens** where Qwen3-0.6B Q4_K_M |
| 27 | +transitions from semi-coherent to UTF-8 byte garbage. |
| 28 | + |
| 29 | +## Why this is NOT the R3 BPE bug |
| 30 | + |
| 31 | +The R3 fix (tq_tokenizer.c:1442) made our tokens match HF. Token IDs |
| 32 | +are correct across all prompt lengths. So any remaining divergence is |
| 33 | +in the **transformer forward** or **KV cache** pipeline. |
| 34 | + |
| 35 | +Confirmed: our Qwen3-0.6B produces token 9707 for "Hello" (matching HF). |
| 36 | + |
| 37 | +## Why this is NOT just a small-model artifact |
| 38 | + |
| 39 | +- Qwen3-0.6B HF at FP32 produces coherent output on 144-token input. |
| 40 | +- Qwen3.6-35B on 235-word clean English also garbles (via repetition |
| 41 | + loop detection). |
| 42 | +- Both models fail at similar relative thresholds when run through |
| 43 | + our engine. |
| 44 | + |
| 45 | +## Candidate root causes (for follow-on debugging) |
| 46 | + |
| 47 | +1. **KV cache quantization degradation past N tokens** |
| 48 | + - Default `turbo_kv_4b` KV compression. Test: `TQ_KV_TYPE=fp32` to |
| 49 | + isolate. |
| 50 | +2. **Batched prefill path** (`tq_forward_batch`) |
| 51 | + - Check per-token-baseline: `TQ_NO_BATCH_PREFILL=1` to force single- |
| 52 | + token forward, compare if threshold changes. |
| 53 | +3. **Partial rotary RoPE with growing position** at pos≥ threshold |
| 54 | + - Qwen3 has `rope_theta=1000000`; large `pos` values may hit a |
| 55 | + numerical issue in sin/cos table. |
| 56 | +4. **Attention dispatch branch at seq_len > 128 / > 256** |
| 57 | + - Some kernels have special-case paths for long sequences. |
| 58 | + |
| 59 | +## Next steps (methodology) |
| 60 | + |
| 61 | +Apply Pillar 1 methodology to transformer forward: |
| 62 | +- Run HF Qwen3-0.6B on the 144-token reproducer, capture all 28 |
| 63 | + post-layer hidden states + logits (at LAST position). |
| 64 | +- Add `TQ_DUMP_POS=last` to our engine so dump fires at the LAST |
| 65 | + prefill position instead of pos=0. |
| 66 | +- Diff layer-by-layer. First layer with L2 diff > 1% of the HF norm |
| 67 | + is the divergence point. Bisect further into attn/FFN/norm sub-steps. |
| 68 | + |
| 69 | +## Impact |
| 70 | + |
| 71 | +R3 BPE fix still delivered real value: short-prompt coherence on all |
| 72 | +Qwen3 models + Phi-3 math + improved overall engine quality. |
| 73 | + |
| 74 | +Long-doc use cases (document Q&A, code review, long narrative |
| 75 | +continuation) remain blocked by this separate bug until fixed. |
| 76 | + |
| 77 | +Pillar 2 (long prefill speed) is also affected — even if we speed it |
| 78 | +up 10×, the output would still be garbage, so speed work should wait |
| 79 | +until this is fixed. |
0 commit comments