Skip to content

Commit 5bc50b1

Browse files
unamedkrclaude
andcommitted
refparity: TQ_DUMP_INTERMEDIATE + FFN magnitude drift diagnosis
Added per-layer sub-stage dumps (h{l}_in/postattn/preffn/ffnout) gated behind TQ_DUMP_INTERMEDIATE=1 env. Zero impact on default dump output; opt-in for finer-grained bisection of per-layer divergences. Applied to Qwen3-0.6B Q4_K_M "Hello" prompt: attention and pre-FFN norm all match HF at cos≥0.9996; divergence isolated to FFN matmul chain. FFN output magnitude ratio vs HF: L0/L1/L13: ~1.0 (matches HF) L26: 0.81 (drifting) L27: 0.53 (catastrophic — causes post_norm cos=0.24) us.h27 ≈ hf.h27 + 0.334·hf.h26 (residual-leak fit). TQ_NO_Q4=1 swings error to 1.54× — both Q4-converted and GGUF-native paths systematically wrong beyond quantization noise. Not a one-liner; tracked for follow-up round. The methodology (not the fix) transfers to 35B: extending refparity to Qwen3.5-4B (DeltaNet hybrid, fits 16GB HF FP32) would diagnose the Qwen3.6 long-gen DeltaNet drift the same way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 161a218 commit 5bc50b1

3 files changed

Lines changed: 98 additions & 4 deletions

File tree

.claude/state.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,44 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## ★ Phase 1 R2 — Intermediate FFN dumps + Qwen3-0.6B FFN bug signature (2026-04-21) ★
7+
8+
### Finding
9+
10+
Extended refparity with `TQ_DUMP_INTERMEDIATE=1` env that produces 5
11+
sub-layer dumps per layer (`h{l}_in/postattn/preffn/ffnout` + final `h{l}`).
12+
Default off; no impact on existing dump output.
13+
14+
Using these on Qwen3-0.6B Q4_K_M ("Hello" prompt):
15+
- Attention output matches HF within noise at all layers.
16+
- Pre-FFN (`h27_preffn`) matches HF at cos 0.9999 (5.9% L2_rel).
17+
- **FFN output magnitude drifts layer-wise**:
18+
- Layer 0-13: ratio ~1.0 vs HF
19+
- Layer 26: 0.81× HF
20+
- Layer 27: **0.53× HF** (catastrophic; causes post_norm cosine 0.24)
21+
22+
Residual-leak regression fits: `us.h27 ≈ hf.h27 + 0.334·hf.h26`.
23+
24+
`TQ_NO_Q4=1` flips the error: ratio 1.54 instead of 0.53. Both paths
25+
systematically off ⇒ beyond pure Q4 noise.
26+
27+
### Why this matters (per grow loop)
28+
29+
- Confirmed the framework reveals class-of-bugs invisible to test_models.sh
30+
(engine output still parses as English).
31+
- For 35B mission: this specific bug is Q4_K_M-load-time-conversion-related
32+
and doesn't touch 35B (which auto-skips Q4 conversion due to Q8_0 attn).
33+
But the METHODOLOGY transfers — if we extend refparity to Qwen3.5-4B
34+
(DeltaNet hybrid, still fits in 16 GB), we can diagnose the Qwen3.6 35B
35+
DeltaNet drift in FP32.
36+
37+
### Not yet landed
38+
39+
Fix for the FFN magnitude drift on Qwen3-0.6B Q4_K_M. Bisected to FFN
40+
matmul chain (after input norm matches HF). Both Q4-converted and GGUF-native
41+
paths have opposite-sign errors ⇒ deeper investigation needed in a later
42+
round — not a one-liner.
43+
644
## ★ Phase 1 — Reference-parity framework (2026-04-21) ★
745

846
### Delivered

src/engine/tq_transformer.c

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2774,6 +2774,10 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
27742774
for (int _i = 0; _i < dim; _i++) xs += s->x[_i] * s->x[_i];
27752775
fprintf(stderr, "[WQ-DBG] L0 PRE-NORM s->x rms=%.4f dim=%d\n", sqrtf((float)(xs/dim)), dim);
27762776
}
2777+
if (getenv("TQ_DUMP_INTERMEDIATE")) {
2778+
char _slot[24]; snprintf(_slot, sizeof(_slot), "h%d_in", l);
2779+
tq_dump_hidden(_slot, s->x, dim, pos);
2780+
}
27772781
tq_rmsnorm(s->xb, s->x, layer->attn_norm, dim, c->rms_norm_eps);
27782782

27792783
/* Begin layer-level GPU batch scope: all GGUF matmuls in this layer
@@ -2840,6 +2844,10 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
28402844
/* else: skip (should not happen for valid models) */
28412845

28422846
/* FFN Block — MoE or Dense SwiGLU/GeGLU */
2847+
if (getenv("TQ_DUMP_INTERMEDIATE")) {
2848+
char _slot[24]; snprintf(_slot, sizeof(_slot), "h%d_postattn", l);
2849+
tq_dump_hidden(_slot, s->x, dim, pos);
2850+
}
28432851

28442852
/* Gemma 4 dual-FFN: Dense (shared MLP) and MoE run IN PARALLEL from same input,
28452853
* outputs summed, then final post_ffw_norm, then residual add.
@@ -3008,6 +3016,10 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
30083016
ffn_norm_w = layer->pre_ffn_norm;
30093017
}
30103018
tq_rmsnorm(s->xb, s->x, ffn_norm_w, dim, c->rms_norm_eps);
3019+
if (getenv("TQ_DUMP_INTERMEDIATE")) {
3020+
char _slot[24]; snprintf(_slot, sizeof(_slot), "h%d_preffn", l);
3021+
tq_dump_hidden(_slot, s->xb, dim, pos);
3022+
}
30113023

30123024
/* Per-layer intermediate dim (Gemma 4 E2B has variable FFN dim) */
30133025
int inter = c->per_layer_inter_dim ? c->per_layer_inter_dim[l] : c->intermediate_dim;
@@ -3108,6 +3120,10 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
31083120
tq_rmsnorm(s->xb2, s->xb2, dense_post_norm, dim, c->rms_norm_eps);
31093121
}
31103122

3123+
if (getenv("TQ_DUMP_INTERMEDIATE")) {
3124+
char _slot[24]; snprintf(_slot, sizeof(_slot), "h%d_ffnout", l);
3125+
tq_dump_hidden(_slot, s->xb2, dim, pos);
3126+
}
31113127
tq_add(s->x, s->x, s->xb2, dim);
31123128
}
31133129

tools/refparity/README.md

Lines changed: 44 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,20 @@ prompt) on failure.
7070
- `1` — divergence detected; diff report identifies the offending layer
7171
- `2` — environment or configuration error
7272

73+
## Intermediate dumps (finer bisection)
74+
75+
Set `TQ_DUMP_INTERMEDIATE=1` in addition to `TQ_DUMP_HIDDEN=dir` to get
76+
per-layer sub-stage dumps:
77+
78+
- `h{l}_in.bin` — residual stream entering layer l
79+
- `h{l}_postattn.bin` — after self-attention + its residual add
80+
- `h{l}_preffn.bin` — after post-attention RMSNorm (= FFN input)
81+
- `h{l}_ffnout.bin` — FFN output, pre-residual-add
82+
- `h{l}.bin` — final (post-FFN-residual) — always dumped
83+
84+
Useful to isolate whether a per-layer divergence comes from attention,
85+
FFN norm, or the FFN matmul chain.
86+
7387
## Methodology notes
7488

7589
- **Quantization noise baseline**: expect ~1-3% L2_rel per layer due to Q4/Q5
@@ -99,7 +113,33 @@ First end-to-end run on `Qwen3-0.6B-Q4_K_M.gguf`, "Hello" prompt:
99113
| post_norm | ~100% | 0.24 | **real divergence — needs investigation** |
100114
| logits || 0.51 | top-1 mismatch (HF 21806 vs engine 11) |
101115

102-
Framework correctly identifies the post_norm + logits divergence as a genuine
103-
engine bug (cannot be explained by Q4 quantization alone — mid-layer stays
104-
at 3.9%). This is tracked as a separate investigation; Phase 1's goal is only
105-
to ship the detection infrastructure.
116+
### Layer-27 FFN magnitude divergence (diagnosed via intermediate dumps)
117+
118+
With `TQ_DUMP_INTERMEDIATE=1`, comparing our FFN output magnitude to HF's
119+
manual per-layer replay across layers:
120+
121+
| layer | us_ffn_norm | hf_ffn_norm | ratio |
122+
|---:|---:|---:|---:|
123+
| 0 | 5.390 | 5.517 | 0.977 |
124+
| 1 | 15.170 | 14.920 | 1.017 |
125+
| 13 | 0.209 | 0.192 | 1.090 |
126+
| 26 | 245.9 | 302.4 | 0.813 |
127+
| 27 | **2648.0** | **5022.8** | **0.527** |
128+
129+
FFN magnitude ratio is clean for early layers but drifts toward the end of
130+
the stack. At layer 27 it's ~half of HF, and since layer 27's FFN output
131+
in HF cancels most of the residual stream (6794 → 2214), magnitude loss
132+
here catastrophically fails to produce the correct collapse. Net result:
133+
engine h27 ≈ hf.h27 + 0.33·hf.h26 (residual leak signature, α=0.334).
134+
135+
Pre-FFN input (`h27_preffn`) matches HF at cos=0.9999. The divergence is
136+
inside the FFN matmul chain (gate/up/silu/down).
137+
138+
`TQ_NO_Q4=1` (skip load-time Q4 recompression) swings the error the other
139+
way: FFN ratio becomes ~1.54. Both paths are systematically off, suggesting
140+
a secondary kernel-level bug distinct from quantization noise. Tracked as
141+
follow-up.
142+
143+
Framework's value: identified a real bug that all prior test harnesses
144+
(test_models.sh, PPL, cosine checks) missed because the output still parses
145+
as English ("Hello, I need to find the value of…" vs HF's "Hello Answer").

0 commit comments

Comments
 (0)