debug(deltanet): TQ_DELTA_RESET_EVERY ablation proves state-driven drift

unamedkr · claude · unamedkr · commit b061e7dbec36 · 2026-04-21T16:21:33.000+09:00
Adds env-gated ablation hook in deltanet_forward to periodically zero
the recurrent state (delta_state + conv_state) across all layers. Default
off; thread-local counter fires once per layer-0 call.

Ablation on Qwen3.6-35B IQ4_XS "Once upon a time in a faraway land":

  unset:               117 tokens → "It could do math!" loop (baseline)
  RESET_EVERY=50:      content diverges early, no 'math' loop but degrades
  RESET_EVERY=120:     identical 0-117, post-117 loop REPLACED with
                       different incoherent text ("0 Comments | Views: 4,986")

Causal conclusion: the specific 117-token repetition loop IS driven by
DeltaNet state accumulation. Reset changes post-drift content entirely,
so KV / MoE / attention aren't the mechanism. This is NOT a fix (reset
throws useful context), but gives us a diagnostic lever for future
per-head/per-layer norm probing to localize the exploding subtensor.

Regression 15/15 PASS (env-gated, default behavior unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,35 @@
 **Last updated**: 2026-04-21 (Phase 1 refparity ★)
 **Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
 
+## ★ Phase 1 R16 — DeltaNet state CAUSALLY proves 35B drift (2026-04-21) ★
+
+Added `TQ_DELTA_RESET_EVERY=N` env ablation in `deltanet_forward` — zeroes
+`s->delta_state` + `s->conv_state` across all layers at every N-th
+layer-0 call. Default off; thread-local counter, no API change.
+
+Ablation on Qwen3.6-35B IQ4_XS, "Once upon a time in a faraway land", -n 200, T=0:
+
+| TQ_DELTA_RESET_EVERY | 0-117 tokens | post-117 behavior |
+|---|---|---|
+| unset (baseline) | Alex finds ENIAC book (narrative) | "It could do math! It could do math!" loop at 117 |
+| 50 | different content (premature reset mid-story) | degrades to "a a a" but NO "It could do math" loop |
+| 120 | identical narrative 0-117 | breaks loop — "2017-05-02 17:35 0 Comments \| Views: 4,986 views" |
+
+**Causal conclusion**: the specific repetition loop at 117 tokens IS
+DeltaNet-recurrent-state-driven. Reset at 120 produces incoherent-but-
+different post-drift output, proving state accumulation is the driver,
+not KV cache or MoE scatter or attention.
+
+Reset is NOT the fix — it throws away useful state and the model
+goes incoherent differently. But now we have a diagnostic lever to
+probe WHICH part of the state is blowing up (next round: per-layer
+per-head norm dump right before the 117-token cliff).
+
+`TQ_DELTA_RESET_EVERY` stays as a permanent debug env — future DeltaNet
+bug-hunt rounds can A/B against it to localize the exploding subtensor.
+
+Regression 15/15 PASS (ablation is env-gated, no default behavior change).
+
 ## Phase 1 R11 — BPE fix does NOT move 35B long-gen drift (2026-04-21)
 
 Post-v0.27.0 validation on `Once upon a time in a faraway land` (ASCII):
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -623,6 +623,29 @@ static void deltanet_forward(tq_model_t* model, tq_state_t* s, int l) {
     float* state = s->delta_state + (size_t)l * dn * dk * dv;
     float* conv_st = s->conv_state + (size_t)l * qkv_dim * conv_buf_len;
 
+    /* Ablation hook: periodically reset recurrent state to test the hypothesis
+     * that long-gen drift is caused by DeltaNet state accumulation. Env
+     *   TQ_DELTA_RESET_EVERY=N  (default: off)
+     * zeroes delta_state + conv_state across ALL layers when the layer-0
+     * call count hits a multiple of N. Thread-local counter so this is safe
+     * with the existing single-threaded deltanet layer dispatch. */
+    {
+        static __thread int _delta_call_count = 0;
+        const char* _rst = getenv("TQ_DELTA_RESET_EVERY");
+        if (l == 0) _delta_call_count++;
+        if (_rst && l == 0) {
+            int rst_n = atoi(_rst);
+            if (rst_n > 0 && _delta_call_count > 1 && (_delta_call_count % rst_n) == 0) {
+                size_t total_delta = (size_t)c->n_layers * dn * dk * dv;
+                memset(s->delta_state, 0, total_delta * sizeof(float));
+                size_t total_conv = (size_t)c->n_layers * qkv_dim * conv_buf_len;
+                memset(s->conv_state, 0, total_conv * sizeof(float));
+                if (getenv("TQ_DEBUG"))
+                    fprintf(stderr, "[delta-reset] call=%d\n", _delta_call_count);
+            }
+        }
+    }
+
     /* Pre-quantize activation to Q8 once for all Q2/Q4 projections in this layer.
      * This eliminates redundant tq_quantize_row_q8 + malloc/free cycles. */
     int dn_has_q2 = (layer->delta_in_proj_qkv_q2 != NULL);