|
3 | 3 | **Last updated**: 2026-04-21 (Phase 1 refparity ★) |
4 | 4 | **Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware. |
5 | 5 |
|
| 6 | +## Phase 1 R3 — FFN magnitude error correlates with activation magnitude (2026-04-21) |
| 7 | + |
| 8 | +Extended diagnosis: the FFN magnitude drift **scales with input activation magnitude**. |
| 9 | + |
| 10 | +| layer | preffn norm | ffn_out ratio (us/hf) | ffn_out cos | |
| 11 | +|---:|---:|---:|---:| |
| 12 | +| 0 | 14.1 | 0.977 | 0.9765 | |
| 13 | +| 13 | 2.5 | 1.090 | 0.9178 | |
| 14 | +| 26 | 63.6 | 0.813 | 0.9758 | |
| 15 | +| **27** | **480.4** | **0.527** | 0.8915 | |
| 16 | + |
| 17 | +Direction (cosine) is mostly preserved; **magnitude loss is the primary symptom** |
| 18 | +and correlates with preffn norm. This fits classic Q8 activation quantization |
| 19 | +saturation: when a 32-element block has outlier magnitudes, the absmax-per-32 |
| 20 | +scale favors the outlier and truncates smaller companions. |
| 21 | + |
| 22 | +Preffn input cosine is 0.9999 at L27 → divergence is purely inside the FFN |
| 23 | +matmul chain (gate/up/silu/down), not upstream. |
| 24 | + |
| 25 | +### Why this matters (strategic) |
| 26 | + |
| 27 | +- The bug is quant-method-level, not per-layer logic. Q4 internal recompression |
| 28 | + from Q4_K/Q6_K GGUF loses precision asymmetrically with activation range. |
| 29 | +- `TQ_NO_Q4=1` swings the opposite direction (1.54× HF) — native GGUF dequant |
| 30 | + also systematically off. Both paths bias magnitude. |
| 31 | +- Not a one-line fix. Candidates for next round: |
| 32 | + - Per-block scale + `min` tracking (not just absmax) to preserve small values |
| 33 | + - Selective bypass: use FP32 matmul for high-magnitude-activation layers |
| 34 | + - Q4_K native matmul with 6-bit sub-block scales preserved (avoids recomp) |
| 35 | + |
| 36 | +### What does NOT affect 35B |
| 37 | + |
| 38 | +Load-time Q4 recompression is already auto-skipped for 35B MoE (Q8_0 attn |
| 39 | +path). So this specific Qwen3-0.6B bug does not cause 35B long-gen drift. |
| 40 | +The bug methodology (activation-magnitude sensitivity) may apply to 35B's |
| 41 | +DeltaNet recurrent state though — worth testing once refparity is extended |
| 42 | +to 4B-class hybrid models. |
| 43 | + |
6 | 44 | ## ★ Phase 1 R2 — Intermediate FFN dumps + Qwen3-0.6B FFN bug signature (2026-04-21) ★ |
7 | 45 |
|
8 | 46 | ### Finding |
|
0 commit comments