refparity: TQ_DUMP_INTERMEDIATE + FFN magnitude drift diagnosis

unamedkr · claude · unamedkr · commit 5bc50b1055ab · 2026-04-21T13:25:21.000+09:00
Added per-layer sub-stage dumps (h{l}_in/postattn/preffn/ffnout) gated behind
TQ_DUMP_INTERMEDIATE=1 env. Zero impact on default dump output; opt-in for
finer-grained bisection of per-layer divergences.

Applied to Qwen3-0.6B Q4_K_M "Hello" prompt: attention and pre-FFN norm all
match HF at cos≥0.9996; divergence isolated to FFN matmul chain. FFN output
magnitude ratio vs HF:

  L0/L1/L13: ~1.0   (matches HF)
  L26:       0.81   (drifting)
  L27:       0.53   (catastrophic — causes post_norm cos=0.24)

us.h27 ≈ hf.h27 + 0.334·hf.h26 (residual-leak fit). TQ_NO_Q4=1 swings error
to 1.54× — both Q4-converted and GGUF-native paths systematically wrong
beyond quantization noise. Not a one-liner; tracked for follow-up round.

The methodology (not the fix) transfers to 35B: extending refparity to
Qwen3.5-4B (DeltaNet hybrid, fits 16GB HF FP32) would diagnose the Qwen3.6
long-gen DeltaNet drift the same way.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,44 @@
 **Last updated**: 2026-04-21 (Phase 1 refparity ★)
 **Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
 
+## ★ Phase 1 R2 — Intermediate FFN dumps + Qwen3-0.6B FFN bug signature (2026-04-21) ★
+
+### Finding
+
+Extended refparity with `TQ_DUMP_INTERMEDIATE=1` env that produces 5
+sub-layer dumps per layer (`h{l}_in/postattn/preffn/ffnout` + final `h{l}`).
+Default off; no impact on existing dump output.
+
+Using these on Qwen3-0.6B Q4_K_M ("Hello" prompt):
+- Attention output matches HF within noise at all layers.
+- Pre-FFN (`h27_preffn`) matches HF at cos 0.9999 (5.9% L2_rel).
+- **FFN output magnitude drifts layer-wise**:
+  - Layer 0-13: ratio ~1.0 vs HF
+  - Layer 26: 0.81× HF
+  - Layer 27: **0.53× HF** (catastrophic; causes post_norm cosine 0.24)
+
+Residual-leak regression fits: `us.h27 ≈ hf.h27 + 0.334·hf.h26`.
+
+`TQ_NO_Q4=1` flips the error: ratio 1.54 instead of 0.53. Both paths
+systematically off ⇒ beyond pure Q4 noise.
+
+### Why this matters (per grow loop)
+
+- Confirmed the framework reveals class-of-bugs invisible to test_models.sh
+  (engine output still parses as English).
+- For 35B mission: this specific bug is Q4_K_M-load-time-conversion-related
+  and doesn't touch 35B (which auto-skips Q4 conversion due to Q8_0 attn).
+  But the METHODOLOGY transfers — if we extend refparity to Qwen3.5-4B
+  (DeltaNet hybrid, still fits in 16 GB), we can diagnose the Qwen3.6 35B
+  DeltaNet drift in FP32.
+
+### Not yet landed
+
+Fix for the FFN magnitude drift on Qwen3-0.6B Q4_K_M. Bisected to FFN
+matmul chain (after input norm matches HF). Both Q4-converted and GGUF-native
+paths have opposite-sign errors ⇒ deeper investigation needed in a later
+round — not a one-liner.
+
 ## ★ Phase 1 — Reference-parity framework (2026-04-21) ★
 
 ### Delivered
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -2774,6 +2774,10 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
             for (int _i = 0; _i < dim; _i++) xs += s->x[_i] * s->x[_i];
             fprintf(stderr, "[WQ-DBG] L0 PRE-NORM s->x rms=%.4f dim=%d\n", sqrtf((float)(xs/dim)), dim);
         }
+        if (getenv("TQ_DUMP_INTERMEDIATE")) {
+            char _slot[24]; snprintf(_slot, sizeof(_slot), "h%d_in", l);
+            tq_dump_hidden(_slot, s->x, dim, pos);
+        }
         tq_rmsnorm(s->xb, s->x, layer->attn_norm, dim, c->rms_norm_eps);
 
         /* Begin layer-level GPU batch scope: all GGUF matmuls in this layer
@@ -2840,6 +2844,10 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
         /* else: skip (should not happen for valid models) */
 
         /* FFN Block — MoE or Dense SwiGLU/GeGLU */
+        if (getenv("TQ_DUMP_INTERMEDIATE")) {
+            char _slot[24]; snprintf(_slot, sizeof(_slot), "h%d_postattn", l);
+            tq_dump_hidden(_slot, s->x, dim, pos);
+        }
 
         /* Gemma 4 dual-FFN: Dense (shared MLP) and MoE run IN PARALLEL from same input,
          * outputs summed, then final post_ffw_norm, then residual add.
@@ -3008,6 +3016,10 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
                 ffn_norm_w = layer->pre_ffn_norm;
             }
             tq_rmsnorm(s->xb, s->x, ffn_norm_w, dim, c->rms_norm_eps);
+            if (getenv("TQ_DUMP_INTERMEDIATE")) {
+                char _slot[24]; snprintf(_slot, sizeof(_slot), "h%d_preffn", l);
+                tq_dump_hidden(_slot, s->xb, dim, pos);
+            }
 
             /* Per-layer intermediate dim (Gemma 4 E2B has variable FFN dim) */
             int inter = c->per_layer_inter_dim ? c->per_layer_inter_dim[l] : c->intermediate_dim;
@@ -3108,6 +3120,10 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
                     tq_rmsnorm(s->xb2, s->xb2, dense_post_norm, dim, c->rms_norm_eps);
             }
 
+            if (getenv("TQ_DUMP_INTERMEDIATE")) {
+                char _slot[24]; snprintf(_slot, sizeof(_slot), "h%d_ffnout", l);
+                tq_dump_hidden(_slot, s->xb2, dim, pos);
+            }
             tq_add(s->x, s->x, s->xb2, dim);
         }
 
diff --git a/tools/refparity/README.md b/tools/refparity/README.md
@@ -70,6 +70,20 @@ prompt) on failure.
 - `1` — divergence detected; diff report identifies the offending layer
 - `2` — environment or configuration error
 
+## Intermediate dumps (finer bisection)
+
+Set `TQ_DUMP_INTERMEDIATE=1` in addition to `TQ_DUMP_HIDDEN=dir` to get
+per-layer sub-stage dumps:
+
+- `h{l}_in.bin` — residual stream entering layer l
+- `h{l}_postattn.bin` — after self-attention + its residual add
+- `h{l}_preffn.bin` — after post-attention RMSNorm (= FFN input)
+- `h{l}_ffnout.bin` — FFN output, pre-residual-add
+- `h{l}.bin` — final (post-FFN-residual) — always dumped
+
+Useful to isolate whether a per-layer divergence comes from attention,
+FFN norm, or the FFN matmul chain.
+
 ## Methodology notes
 
 - **Quantization noise baseline**: expect ~1-3% L2_rel per layer due to Q4/Q5
@@ -99,7 +113,33 @@ First end-to-end run on `Qwen3-0.6B-Q4_K_M.gguf`, "Hello" prompt:
 | post_norm | ~100% | 0.24 | **real divergence — needs investigation** |
 | logits | — | 0.51 | top-1 mismatch (HF 21806 vs engine 11) |
 
-Framework correctly identifies the post_norm + logits divergence as a genuine
-engine bug (cannot be explained by Q4 quantization alone — mid-layer stays
-at 3.9%). This is tracked as a separate investigation; Phase 1's goal is only
-to ship the detection infrastructure.
+### Layer-27 FFN magnitude divergence (diagnosed via intermediate dumps)
+
+With `TQ_DUMP_INTERMEDIATE=1`, comparing our FFN output magnitude to HF's
+manual per-layer replay across layers:
+
+| layer | us_ffn_norm | hf_ffn_norm | ratio |
+|---:|---:|---:|---:|
+| 0 | 5.390 | 5.517 | 0.977 |
+| 1 | 15.170 | 14.920 | 1.017 |
+| 13 | 0.209 | 0.192 | 1.090 |
+| 26 | 245.9 | 302.4 | 0.813 |
+| 27 | **2648.0** | **5022.8** | **0.527** |
+
+FFN magnitude ratio is clean for early layers but drifts toward the end of
+the stack. At layer 27 it's ~half of HF, and since layer 27's FFN output
+in HF cancels most of the residual stream (6794 → 2214), magnitude loss
+here catastrophically fails to produce the correct collapse. Net result:
+engine h27 ≈ hf.h27 + 0.33·hf.h26 (residual leak signature, α=0.334).
+
+Pre-FFN input (`h27_preffn`) matches HF at cos=0.9999. The divergence is
+inside the FFN matmul chain (gate/up/silu/down).
+
+`TQ_NO_Q4=1` (skip load-time Q4 recompression) swings the error the other
+way: FFN ratio becomes ~1.54. Both paths are systematically off, suggesting
+a secondary kernel-level bug distinct from quantization noise. Tracked as
+follow-up.
+
+Framework's value: identified a real bug that all prior test harnesses
+(test_models.sh, PPL, cosine checks) missed because the output still parses
+as English ("Hello, I need to find the value of…" vs HF's "Hello Answer").