debug(deltanet): TQ_DELTA_PROBE — locates L0 as the outlier layer

unamedkr · claude · unamedkr · commit d1c605716725 · 2026-04-21T16:27:11.000+09:00
Adds per-layer state L2 norm dump env (comma-separated list of layer-0
call counts). Thread-local, zero-cost when unset.

Applied on Qwen3.6-35B IQ4_XS "Once upon a time in a faraway land":

  call=50  L0=127.0  L1=41.3  L2=17.9  L8=31.1  others 7-24
  call=100 L0=150.7  L1=42.4  L2=16.7  L8=31.5  others 7-24
  call=115 L0=154.9  L1=40.8  L2=16.6  L8=30.6  others 7-24  ← 117-tok loop start
  call=120 L0=154.5  L1=40.8  L2=16.7  L8=30.5  others 7-24

L0's DeltaNet recurrent state sits at 3-10× every other layer's norm.
Grew 127→155 over 100 tokens (+22%) while others stayed ±10%.

R16 proved 117-tok repetition loop IS state-driven. R17 localizes the
suspect to L0's recurrent state specifically — either a_log decay param
is scaled differently at L0, or our implementation has an L0-specific
bug. Next round: inspect a_log weight + compare our decay math to
refs/llama.cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,38 @@
 **Last updated**: 2026-04-21 (Phase 1 refparity ★)
 **Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
 
+## ★★ Phase 1 R17 — L0 DeltaNet state is 10× the others (2026-04-21) ★★
+
+Added `TQ_DELTA_PROBE=pos1,pos2,...` env in `deltanet_forward` to dump
+per-layer state L2 norm at listed layer-0 call counts.
+
+Measurement on Qwen3.6-35B IQ4_XS "Once upon a time in a faraway land":
+
+| call | L0 | L1 | L2 | L8 | L14 | L26 | typical rest |
+|---:|---:|---:|---:|---:|---:|---:|---:|
+| 50 | **127.0** | 41.3 | 17.9 | 31.1 | 24.7 | 21.5 | 7-17 |
+| 100 | **150.7** | 42.4 | 16.7 | 31.5 | 24.1 | 21.3 | 7-18 |
+| 115 | **154.9** | 40.8 | 16.6 | 30.6 | 23.9 | 21.1 | 8-17 |
+| 118 | **154.7** | 42.3 | 17.7 | 30.8 | 23.9 | 21.0 | 8-17 |
+| 120 | **154.5** | 40.8 | 16.7 | 30.5 | 23.8 | 20.8 | 8-17 |
+
+**L0 is 3-10× everything else.** Not a transient spike at the drift
+boundary — L0 sat at ~155 for tokens 100-120, while the "It could do
+math!" loop kicked in at token 117. So L0's high steady-state IS the
+chronic condition; it must be interacting badly with attention's KV or
+downstream layers.
+
+L0 grew call 50→115 from 127→155 (+22%), while others stayed ±10%. So
+L0 also lacks proper decay relative to other layers, though growth has
+slowed by call 100 (suggesting partial steady-state).
+
+**Hypothesis**: L0's decay param (`a_log`) either has a different scale
+vs upstream layers, OR our implementation is applying decay wrong at L0
+specifically. Next round: dump L0's `a_log` vs L1's, and compare our
+decay math to refs/llama.cpp qwen3_next DeltaNet.
+
+`TQ_DELTA_PROBE` stays as a permanent diagnostic env.
+
 ## ★ Phase 1 R16 — DeltaNet state CAUSALLY proves 35B drift (2026-04-21) ★
 
 Added `TQ_DELTA_RESET_EVERY=N` env ablation in `deltanet_forward` — zeroes
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -633,6 +633,32 @@ static void deltanet_forward(tq_model_t* model, tq_state_t* s, int l) {
         static __thread int _delta_call_count = 0;
         const char* _rst = getenv("TQ_DELTA_RESET_EVERY");
         if (l == 0) _delta_call_count++;
+
+        /* Probe: TQ_DELTA_PROBE=pos1,pos2,... prints per-layer state L2 norm
+         * at the listed layer-0 call counts. Helps localize which layer's
+         * recurrent state explodes first ahead of the drift cliff. */
+        const char* _probe = getenv("TQ_DELTA_PROBE");
+        if (_probe) {
+            int match = 0;
+            const char* p = _probe;
+            while (*p) {
+                int v = atoi(p);
+                if (v == _delta_call_count) { match = 1; break; }
+                while (*p && *p != ',') p++;
+                if (*p == ',') p++;
+            }
+            if (match) {
+                size_t layer_size = (size_t)dn * dk * dv;
+                double ss = 0.0;
+                for (size_t i = 0; i < layer_size; i++) {
+                    float v = state[i];
+                    ss += (double)v * v;
+                }
+                float nrm = (float)sqrt(ss);
+                fprintf(stderr, "[delta-probe] call=%d L%d state_norm=%.4f\n",
+                        _delta_call_count, l, nrm);
+            }
+        }
         if (_rst && l == 0) {
             int rst_n = atoi(_rst);
             if (rst_n > 0 && _delta_call_count > 1 && (_delta_call_count % rst_n) == 0) {