pillar1.5(R1): restore QK-norm for pure Qwen3 (R40 was over-broad)

unamedkr · claude · unamedkr · commit b6b5f09eb469 · 2026-04-20T13:46:41.000+09:00
R40 disabled QK-norm for all "qwen" arch GGUFs. That was correct for
Qwen3.5/3.6 HYBRID (DeltaNet + self-attn, delta_n_heads &gt; 0) — those
degrade when QK-norm is applied to their 10 self-attn layers.

But pure Qwen3 (0.6B..32B, self-attn only) REQUIRES q_norm/k_norm.
Without them, long-prompt attention (pos&gt;=1) has unnormalized Q·K
scores, residual stream explodes at layer 2 (norm 5396 vs HF 10),
output is UTF-8 garbage.

Found via HF reference diff methodology (tools/pillar1/diff_layers.py,
also added here):
  - Layer-by-layer cosine/L2 at pos=1 with 144-token input
  - Layer 0 cosine 0.98 at pos=0 but 0.74 at pos=1 → attention at pos&gt;=1 broken
  - h2 norm: ours 5396 vs HF 10 → catastrophic residual stream
  - With TQ_FORCE_QK_NORM=1: h2 norm normalizes to ~11 (close to HF)

Fix: restrict the QK-norm disable to `delta_n_heads &gt; 0` only. Drop
the over-broad GGUF-arch name match. Pure Qwen3 now applies q_norm/
k_norm per-head as HF does.

Real-world output (per-token prefill, 50-word synthetic prompt):
  BEFORE: "lenameuously... catchØ�Williamson" UTF-8 garbage
  AFTER:  " word11223: word3: Word length?" pattern-matching English

Regression: 15/15 test_models + 4/4 test_tokenizer PASS.

Known remaining: batched prefill path (tq_forward_batch) still broken
independently. That path DOES apply QK-norm unconditionally (line 3615)
but still produces garbage — separate bug for follow-on R2+.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -1201,14 +1201,19 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
      * TQ_NO_QK_NORM=1 forces off (diagnostic).
      * TQ_FORCE_QK_NORM=1 forces on (Gemma fallback for Qwen if ever
      * the convention is fixed). */
+    /* Pillar 1.5 R1 fix: the R40 arch-conditional disable was too broad.
+     * Pure Qwen3 (0.6B..32B, self-attn only) REQUIRES q_norm/k_norm —
+     * without them, long-prompt (pos>=1) attention corrupts via
+     * un-normalized Q·K scores, producing norm explosion at layer 2+
+     * and UTF-8 garbage output.
+     *
+     * Qwen3.5/3.6 HYBRID (DeltaNet + self-attn, delta_n_heads > 0) was
+     * empirically shown at R40 to degrade with QK-norm applied to the
+     * 10 self-attn layers. Keep that disabled. */
     int _qknorm_disabled = (getenv("TQ_NO_QK_NORM") != NULL);
-    int _is_qwen = (c->delta_n_heads > 0);  /* Qwen hybrid: always */
-    if (model->gguf_ctx) {
-        tq_gguf_ctx_t* gctx = (tq_gguf_ctx_t*)model->gguf_ctx;
-        if (strstr(gctx->arch, "qwen") != NULL) _is_qwen = 1;
-    }
+    int _is_qwen_hybrid = (c->delta_n_heads > 0);  /* Qwen3.5/3.6 hybrid */
     int _apply_qknorm = !_qknorm_disabled;
-    if (_is_qwen && !getenv("TQ_FORCE_QK_NORM")) _apply_qknorm = 0;
+    if (_is_qwen_hybrid && !getenv("TQ_FORCE_QK_NORM")) _apply_qknorm = 0;
 
     if (layer->q_norm && _apply_qknorm) {
         for (int h = 0; h < n_heads; h++) {
diff --git a/tools/pillar1/diff_layers.py b/tools/pillar1/diff_layers.py
@@ -0,0 +1,59 @@
+#!/usr/bin/env python3
+"""Layer-by-layer diff between HF reference and our engine's dumps.
+
+Input:
+  tools/pillar1/hf_dump_long.npz  (emb, h0..h27, logits per-position)
+  /tmp/qdump/*.bin                (our engine's pos=143 dumps, raw float32)
+
+Output: per-layer table of cosine, max_abs_diff, L2_relative."""
+import numpy as np, os, sys, glob
+
+HF_NPZ = sys.argv[1] if len(sys.argv) > 1 else "tools/pillar1/hf_dump_long.npz"
+US_DIR = sys.argv[2] if len(sys.argv) > 2 else "/tmp/qdump"
+POS    = int(sys.argv[3]) if len(sys.argv) > 3 else 143
+
+hf = np.load(HF_NPZ)
+print(f"HF npz keys: {list(hf.keys())[:5]}... shape_h0={hf['h0'].shape}")
+print(f"Reading our dumps from {US_DIR} at position {POS}")
+print()
+print(f"{'slot':<12} {'dim':>6} {'our_norm':>10} {'hf_norm':>10} {'max_abs':>10} {'L2_rel':>10} {'cosine':>8}")
+print("-" * 70)
+
+def read_bin(path):
+    return np.fromfile(path, dtype=np.float32)
+
+slots = ["emb"] + [f"h{i}" for i in range(28)] + ["post_norm"]
+for slot in slots:
+    bin_path = os.path.join(US_DIR, f"{slot}.bin")
+    if not os.path.exists(bin_path):
+        continue
+    ours = read_bin(bin_path)
+    if slot not in hf.files:
+        continue
+    hf_arr = hf[slot]
+    if hf_arr.ndim == 2:
+        ref = hf_arr[POS]  # last-position vector for this layer
+    else:
+        ref = hf_arr
+    if ours.shape != ref.shape:
+        print(f"{slot}: shape mismatch us={ours.shape} hf={ref.shape}")
+        continue
+    diff = ours - ref
+    max_abs = np.max(np.abs(diff))
+    l2 = np.linalg.norm(diff)
+    hf_norm = np.linalg.norm(ref)
+    us_norm = np.linalg.norm(ours)
+    l2_rel = l2 / max(hf_norm, 1e-9)
+    cos = np.dot(ours, ref) / max(us_norm * hf_norm, 1e-9)
+    print(f"{slot:<12} {len(ours):>6} {us_norm:>10.3f} {hf_norm:>10.3f} {max_abs:>10.4f} {l2_rel:>10.4%} {cos:>8.4f}")
+
+# Compare top-5 logits (our dump logits.bin is FP32 full-vocab)
+print()
+logits_path = os.path.join(US_DIR, "logits.bin")
+if os.path.exists(logits_path):
+    ours_l = read_bin(logits_path)
+    hf_l = hf["logits"][POS] if hf["logits"].ndim == 2 else hf["logits"]
+    top5_us = np.argsort(-ours_l)[:5]
+    top5_hf = np.argsort(-hf_l)[:5]
+    print(f"HF  top-5 logits: {[(int(t), f'{hf_l[t]:.2f}') for t in top5_hf]}")
+    print(f"Us  top-5 logits: {[(int(t), f'{ours_l[t]:.2f}') for t in top5_us]}")