Skip to content

Commit b061e7d

Browse files
unamedkrclaude
andcommitted
debug(deltanet): TQ_DELTA_RESET_EVERY ablation proves state-driven drift
Adds env-gated ablation hook in deltanet_forward to periodically zero the recurrent state (delta_state + conv_state) across all layers. Default off; thread-local counter fires once per layer-0 call. Ablation on Qwen3.6-35B IQ4_XS "Once upon a time in a faraway land": unset: 117 tokens → "It could do math!" loop (baseline) RESET_EVERY=50: content diverges early, no 'math' loop but degrades RESET_EVERY=120: identical 0-117, post-117 loop REPLACED with different incoherent text ("0 Comments | Views: 4,986") Causal conclusion: the specific 117-token repetition loop IS driven by DeltaNet state accumulation. Reset changes post-drift content entirely, so KV / MoE / attention aren't the mechanism. This is NOT a fix (reset throws useful context), but gives us a diagnostic lever for future per-head/per-layer norm probing to localize the exploding subtensor. Regression 15/15 PASS (env-gated, default behavior unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2a1d40d commit b061e7d

2 files changed

Lines changed: 52 additions & 0 deletions

File tree

.claude/state.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,35 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## ★ Phase 1 R16 — DeltaNet state CAUSALLY proves 35B drift (2026-04-21) ★
7+
8+
Added `TQ_DELTA_RESET_EVERY=N` env ablation in `deltanet_forward` — zeroes
9+
`s->delta_state` + `s->conv_state` across all layers at every N-th
10+
layer-0 call. Default off; thread-local counter, no API change.
11+
12+
Ablation on Qwen3.6-35B IQ4_XS, "Once upon a time in a faraway land", -n 200, T=0:
13+
14+
| TQ_DELTA_RESET_EVERY | 0-117 tokens | post-117 behavior |
15+
|---|---|---|
16+
| unset (baseline) | Alex finds ENIAC book (narrative) | "It could do math! It could do math!" loop at 117 |
17+
| 50 | different content (premature reset mid-story) | degrades to "a a a" but NO "It could do math" loop |
18+
| 120 | identical narrative 0-117 | breaks loop — "2017-05-02 17:35 0 Comments \| Views: 4,986 views" |
19+
20+
**Causal conclusion**: the specific repetition loop at 117 tokens IS
21+
DeltaNet-recurrent-state-driven. Reset at 120 produces incoherent-but-
22+
different post-drift output, proving state accumulation is the driver,
23+
not KV cache or MoE scatter or attention.
24+
25+
Reset is NOT the fix — it throws away useful state and the model
26+
goes incoherent differently. But now we have a diagnostic lever to
27+
probe WHICH part of the state is blowing up (next round: per-layer
28+
per-head norm dump right before the 117-token cliff).
29+
30+
`TQ_DELTA_RESET_EVERY` stays as a permanent debug env — future DeltaNet
31+
bug-hunt rounds can A/B against it to localize the exploding subtensor.
32+
33+
Regression 15/15 PASS (ablation is env-gated, no default behavior change).
34+
635
## Phase 1 R11 — BPE fix does NOT move 35B long-gen drift (2026-04-21)
736

837
Post-v0.27.0 validation on `Once upon a time in a faraway land` (ASCII):

src/engine/tq_transformer.c

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -623,6 +623,29 @@ static void deltanet_forward(tq_model_t* model, tq_state_t* s, int l) {
623623
float* state = s->delta_state + (size_t)l * dn * dk * dv;
624624
float* conv_st = s->conv_state + (size_t)l * qkv_dim * conv_buf_len;
625625

626+
/* Ablation hook: periodically reset recurrent state to test the hypothesis
627+
* that long-gen drift is caused by DeltaNet state accumulation. Env
628+
* TQ_DELTA_RESET_EVERY=N (default: off)
629+
* zeroes delta_state + conv_state across ALL layers when the layer-0
630+
* call count hits a multiple of N. Thread-local counter so this is safe
631+
* with the existing single-threaded deltanet layer dispatch. */
632+
{
633+
static __thread int _delta_call_count = 0;
634+
const char* _rst = getenv("TQ_DELTA_RESET_EVERY");
635+
if (l == 0) _delta_call_count++;
636+
if (_rst && l == 0) {
637+
int rst_n = atoi(_rst);
638+
if (rst_n > 0 && _delta_call_count > 1 && (_delta_call_count % rst_n) == 0) {
639+
size_t total_delta = (size_t)c->n_layers * dn * dk * dv;
640+
memset(s->delta_state, 0, total_delta * sizeof(float));
641+
size_t total_conv = (size_t)c->n_layers * qkv_dim * conv_buf_len;
642+
memset(s->conv_state, 0, total_conv * sizeof(float));
643+
if (getenv("TQ_DEBUG"))
644+
fprintf(stderr, "[delta-reset] call=%d\n", _delta_call_count);
645+
}
646+
}
647+
}
648+
626649
/* Pre-quantize activation to Q8 once for all Q2/Q4 projections in this layer.
627650
* This eliminates redundant tq_quantize_row_q8 + malloc/free cycles. */
628651
int dn_has_q2 = (layer->delta_in_proj_qkv_q2 != NULL);

0 commit comments

Comments
 (0)