Skip to content

Commit db4d87a

Browse files
unamedkrclaude
andcommitted
pillar1(R7): real-use validation uncovers long-sequence transformer bug
Real-world test: Qwen3.6-35B on a 235-word English document prefill → UTF-8 garbage + repetition-loop detection. TTFT 112s (slow too). Qwen3-0.6B reproduces the pattern at ~100-120 prompt tokens via binary search (n=20/30/40/50/70/90 words). HF reference handles the same 144-token input cleanly. Tokens match HF after R3 fix — so this is a SEPARATE bug in the transformer forward / KV cache path, not tokenizer. bench/results/2026-04-20_longseq_transformer_bug.md records the reproducer, the threshold evidence, candidate root causes (KV quant, batched prefill path, RoPE at growing pos, long-seq attention branch), and the next-step methodology: apply the same HF reference diff now to transformer outputs (not just tokens) to find the divergence layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> EOF
1 parent 1e0c3d7 commit db4d87a

1 file changed

Lines changed: 79 additions & 0 deletions

File tree

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Long-Sequence Transformer Bug — Discovered 2026-04-20
2+
3+
**After R3 BPE fix**, tokenization is now correct. But a **SEPARATE
4+
transformer-forward bug** manifests on inputs of ~100+ tokens. This
5+
document records the reproducer for follow-on debugging.
6+
7+
## Reproducer (Qwen3-0.6B Q4_K_M, deterministic)
8+
9+
```bash
10+
# Generate 50 synthetic words = ~144 tokens after tokenization
11+
WORDS=$(python -c "print(' '.join(f'word{i}' for i in range(1,51)))")
12+
./build/quant models/Qwen3-0.6B-Q4_K_M.gguf -p "$WORDS. Continue:" -n 12 -T 0
13+
```
14+
15+
## Observed behavior (binary search by prompt length)
16+
17+
| Prompt words | Token count | Our output | HF output (reference) |
18+
|---|---:|---|---|
19+
| 20 | 54 | `"word., check ( (-lang PL"` | (coherent) |
20+
| 30 | 84 | `"nahme innocence thisds..."` | (coherent) |
21+
| 40 | 114 | **UTF-8 garbage** (`� � �egan...`) | (coherent) |
22+
| 50 | 144 | UTF-8 garbage | `" word51 word52 word53 word54"`|
23+
| 70 | 204 | UTF-8 garbage | (coherent) |
24+
| 90 | 264 | UTF-8 garbage | (coherent) |
25+
26+
**Break threshold: ~100-120 prompt tokens** where Qwen3-0.6B Q4_K_M
27+
transitions from semi-coherent to UTF-8 byte garbage.
28+
29+
## Why this is NOT the R3 BPE bug
30+
31+
The R3 fix (tq_tokenizer.c:1442) made our tokens match HF. Token IDs
32+
are correct across all prompt lengths. So any remaining divergence is
33+
in the **transformer forward** or **KV cache** pipeline.
34+
35+
Confirmed: our Qwen3-0.6B produces token 9707 for "Hello" (matching HF).
36+
37+
## Why this is NOT just a small-model artifact
38+
39+
- Qwen3-0.6B HF at FP32 produces coherent output on 144-token input.
40+
- Qwen3.6-35B on 235-word clean English also garbles (via repetition
41+
loop detection).
42+
- Both models fail at similar relative thresholds when run through
43+
our engine.
44+
45+
## Candidate root causes (for follow-on debugging)
46+
47+
1. **KV cache quantization degradation past N tokens**
48+
- Default `turbo_kv_4b` KV compression. Test: `TQ_KV_TYPE=fp32` to
49+
isolate.
50+
2. **Batched prefill path** (`tq_forward_batch`)
51+
- Check per-token-baseline: `TQ_NO_BATCH_PREFILL=1` to force single-
52+
token forward, compare if threshold changes.
53+
3. **Partial rotary RoPE with growing position** at pos≥ threshold
54+
- Qwen3 has `rope_theta=1000000`; large `pos` values may hit a
55+
numerical issue in sin/cos table.
56+
4. **Attention dispatch branch at seq_len > 128 / > 256**
57+
- Some kernels have special-case paths for long sequences.
58+
59+
## Next steps (methodology)
60+
61+
Apply Pillar 1 methodology to transformer forward:
62+
- Run HF Qwen3-0.6B on the 144-token reproducer, capture all 28
63+
post-layer hidden states + logits (at LAST position).
64+
- Add `TQ_DUMP_POS=last` to our engine so dump fires at the LAST
65+
prefill position instead of pos=0.
66+
- Diff layer-by-layer. First layer with L2 diff > 1% of the HF norm
67+
is the divergence point. Bisect further into attn/FFN/norm sub-steps.
68+
69+
## Impact
70+
71+
R3 BPE fix still delivered real value: short-prompt coherence on all
72+
Qwen3 models + Phi-3 math + improved overall engine quality.
73+
74+
Long-doc use cases (document Q&A, code review, long narrative
75+
continuation) remain blocked by this separate bug until fixed.
76+
77+
Pillar 2 (long prefill speed) is also affected — even if we speed it
78+
up 10×, the output would still be garbage, so speed work should wait
79+
until this is fixed.

0 commit comments

Comments
 (0)