Skip to content

Commit 65e4a2d

Browse files
unamedkrclaude
andcommitted
debug(deltanet): TQ_DELTA_RESET_LAYER — per-layer reset ablation
Bisection result on Qwen3.6-35B 117-tok repetition loop: reset L0 only @ call=120: STILL loops at 117 ("anything"→"math") reset L8 only @ call=120: STILL loops at 117 reset L20 only @ call=120: STILL loops at 117 reset L38 only @ call=120: STILL loops at 117 reset ALL @ call=120: breaks loop (R16 baseline, different output) Conclusion: the drift pathology is distributed across all 30 DeltaNet layers, not localized to any single layer. No one-liner fix. Keep the diagnostic envs for future reference-port work. Strategic direction: either a 4B-class DeltaNet HF reference for full-parity diff, or a reimplementation port from the PyTorch reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a05d4e4 commit 65e4a2d

2 files changed

Lines changed: 50 additions & 5 deletions

File tree

.claude/state.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,38 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## Phase 1 R19 — Single-layer reset is not enough — drift is distributed (2026-04-21)
7+
8+
Added `TQ_DELTA_RESET_LAYER=N` env to bisect which DeltaNet layer drives
9+
the 117-tok repetition loop. Combined with `TQ_DELTA_RESET_EVERY=120` to
10+
force reset right at the drift boundary.
11+
12+
Tested on Qwen3.6-35B IQ4_XS "Once upon a time in a faraway land":
13+
14+
| reset layer | post-117 text |
15+
|:---:|:---|
16+
| L0 only | "It could do math! It could do math! It could do anything! It could" → STILL loop at 117 |
17+
| L8 only | "It could do math! It could do math! It could do math!" → loop at 117 |
18+
| L20 only | "It could do math!" ×3 → loop at 117 |
19+
| L38 only | "It could do math!" ×3 → loop at 117 |
20+
| ALL layers (R16 baseline) | "0 Comments \| Views: 4,986 views — 'The Great Adventure'" |
21+
22+
R19 conclusion: **no single DeltaNet layer carries the drift signal alone**
23+
— clearing any one leaves the repetition cliff intact. Only the full
24+
30-layer reset breaks the "It could do math!" lock. So the 117-tok
25+
pathology is a **distributed multi-layer interaction**, not a single-layer
26+
bug amenable to a one-liner fix.
27+
28+
Diagnostic infrastructure stays: `TQ_DELTA_PROBE`, `TQ_DELTA_RESET_EVERY`,
29+
`TQ_DELTA_RESET_LAYER` — future rounds can reset *ranges* or chain
30+
ablations more surgically.
31+
32+
**Strategic step-back**: 35B DeltaNet drift looks unlikely to yield to
33+
short ablation rounds. Bigger-hammer approaches (full refparity against a
34+
4B-class DeltaNet HF model, or reference reimplementation port) are the
35+
remaining paths. Leave the diagnostic envs in place and move on to other
36+
deliverables this session.
37+
638
## Phase 1 R18 — False alarm on a_log double-transform (2026-04-21)
739

840
Dug into `ssm_a` values to test whether our `-expf(delta_a_log)` was a

src/engine/tq_transformer.c

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -662,12 +662,25 @@ static void deltanet_forward(tq_model_t* model, tq_state_t* s, int l) {
662662
if (_rst && l == 0) {
663663
int rst_n = atoi(_rst);
664664
if (rst_n > 0 && _delta_call_count > 1 && (_delta_call_count % rst_n) == 0) {
665-
size_t total_delta = (size_t)c->n_layers * dn * dk * dv;
666-
memset(s->delta_state, 0, total_delta * sizeof(float));
667-
size_t total_conv = (size_t)c->n_layers * qkv_dim * conv_buf_len;
668-
memset(s->conv_state, 0, total_conv * sizeof(float));
665+
/* TQ_DELTA_RESET_LAYER=N (default -1 = all layers).
666+
* Clears just that layer's slice so we can bisect which
667+
* layer actually drives the drift. */
668+
const char* _only = getenv("TQ_DELTA_RESET_LAYER");
669+
int only_layer = _only ? atoi(_only) : -1;
670+
if (only_layer >= 0 && only_layer < c->n_layers) {
671+
memset(s->delta_state + (size_t)only_layer * dn * dk * dv,
672+
0, (size_t)dn * dk * dv * sizeof(float));
673+
memset(s->conv_state + (size_t)only_layer * qkv_dim * conv_buf_len,
674+
0, (size_t)qkv_dim * conv_buf_len * sizeof(float));
675+
} else {
676+
size_t total_delta = (size_t)c->n_layers * dn * dk * dv;
677+
memset(s->delta_state, 0, total_delta * sizeof(float));
678+
size_t total_conv = (size_t)c->n_layers * qkv_dim * conv_buf_len;
679+
memset(s->conv_state, 0, total_conv * sizeof(float));
680+
}
669681
if (getenv("TQ_DEBUG"))
670-
fprintf(stderr, "[delta-reset] call=%d\n", _delta_call_count);
682+
fprintf(stderr, "[delta-reset] call=%d layer=%d\n",
683+
_delta_call_count, only_layer);
671684
}
672685
}
673686
}

0 commit comments

Comments
 (0)