Skip to content

Commit f612c57

Browse files
unamedkrclaude
andcommitted
state: R3 FFN drift correlates with activation magnitude (diagnosis only)
Pattern confirmed: FFN output magnitude ratio (us/hf) decreases as preffn activation norm increases. Direction is preserved (cos ≥0.89). Root cause is quantization-method-level — absmax-per-32 block scales lose small companion values when outliers dominate. Not a one-line fix. Candidate solutions for follow-up rounds documented in state.md. 35B is unaffected (auto-skips load-time Q4 conversion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5bc50b1 commit f612c57

1 file changed

Lines changed: 38 additions & 0 deletions

File tree

.claude/state.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,44 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## Phase 1 R3 — FFN magnitude error correlates with activation magnitude (2026-04-21)
7+
8+
Extended diagnosis: the FFN magnitude drift **scales with input activation magnitude**.
9+
10+
| layer | preffn norm | ffn_out ratio (us/hf) | ffn_out cos |
11+
|---:|---:|---:|---:|
12+
| 0 | 14.1 | 0.977 | 0.9765 |
13+
| 13 | 2.5 | 1.090 | 0.9178 |
14+
| 26 | 63.6 | 0.813 | 0.9758 |
15+
| **27** | **480.4** | **0.527** | 0.8915 |
16+
17+
Direction (cosine) is mostly preserved; **magnitude loss is the primary symptom**
18+
and correlates with preffn norm. This fits classic Q8 activation quantization
19+
saturation: when a 32-element block has outlier magnitudes, the absmax-per-32
20+
scale favors the outlier and truncates smaller companions.
21+
22+
Preffn input cosine is 0.9999 at L27 → divergence is purely inside the FFN
23+
matmul chain (gate/up/silu/down), not upstream.
24+
25+
### Why this matters (strategic)
26+
27+
- The bug is quant-method-level, not per-layer logic. Q4 internal recompression
28+
from Q4_K/Q6_K GGUF loses precision asymmetrically with activation range.
29+
- `TQ_NO_Q4=1` swings the opposite direction (1.54× HF) — native GGUF dequant
30+
also systematically off. Both paths bias magnitude.
31+
- Not a one-line fix. Candidates for next round:
32+
- Per-block scale + `min` tracking (not just absmax) to preserve small values
33+
- Selective bypass: use FP32 matmul for high-magnitude-activation layers
34+
- Q4_K native matmul with 6-bit sub-block scales preserved (avoids recomp)
35+
36+
### What does NOT affect 35B
37+
38+
Load-time Q4 recompression is already auto-skipped for 35B MoE (Q8_0 attn
39+
path). So this specific Qwen3-0.6B bug does not cause 35B long-gen drift.
40+
The bug methodology (activation-magnitude sensitivity) may apply to 35B's
41+
DeltaNet recurrent state though — worth testing once refparity is extended
42+
to 4B-class hybrid models.
43+
644
## ★ Phase 1 R2 — Intermediate FFN dumps + Qwen3-0.6B FFN bug signature (2026-04-21) ★
745

846
### Finding

0 commit comments

Comments
 (0)