Commit 449080f
paper(working-memory-cliff): v0.4 — ReDeEP-Cliff hypothesis tested + new failure mode named
Phase 2B Karpathy loop. Tested whether the cliff failure mode shares
mechanism with RAG silent hallucination as described by ReDeEP (Sun et
al., ICLR 2025): hallucination has low External Context Score and high
Parametric Knowledge Score. We don't have direct access to attention
heads and FFN residuals from the quant.cpp forward pass, so we used
surface-level proxies on the existing 198 NIAH trial responses:
copy_score(response, haystack) = 4-gram overlap, proxy for ECS
novel_score = 1 - copy_score, proxy for PKS
Results across 4 Karpathy rounds on the existing data, no new compute:
R1 — pooled across all 115 cliff failures, FAIL has HIGHER copy_score
than PASS (0.532 vs 0.440). This is the OPPOSITE of ReDeEP's RAG
hallucination signature. Hypothesis rejected at top level.
R2 — subtype classification of all 115 failures:
CONTINUE 97 84% literal wikitext continuation
OTHER 9 8% other forms of continuation
HEADER 8 7% wikitext markup echo (= = =, etc)
SYNTH 1 <1% needle + subject fusion (the v0.3 example)
The "Boulter was hired as CFO" synthesised hallucination that
v0.3 cited as the dominant failure mode is one trial out of 115.
The dominant mode is literal continuation, and it has the
OPPOSITE ReDeEP signature from RAG hallucination.
R3 — position of longest matching substring across the haystack:
Q1 (0-25%) 87 81%
Q2 (25-50%) 11 10%
Q3 (50-75%) 6 6%
Q4 (75-100%) 3 3%
81% of cliff continuations resume from the FIRST quartile of the
haystack, not from the end of the prompt. The model is jumping
back to the beginning of the document, not autocompleting where
the assistant turn would have started.
R4 — decile precision: 70% of continuations resume specifically from
the 10-20% sub-region of the haystack — the start of the article
BODY in wikitext, just after the title and lead paragraph. The
model recognises "this is a Wikipedia article" from the title
markup and emits the body content from its canonical body-start
position.
We name this new failure mode:
PRIMACY-BIASED DOCUMENT CONTINUATION OVERFLOW
It is mechanistically distinct from RAG silent hallucination (opposite
signature), from "Lost in the Middle" (which is about retrieval
position, not generation source), and from attention sink collapse
(BOS sink is being overruled, not lost). It is also distinct from
parametric hallucination — the model is not inventing from internal
memory, it is literally copying from the loaded context.
Implications for mitigation:
- ReDeEP's AARF (Add Attention, Reduce FFN) is designed for the
parametric-takeover regime. For our cliff failure it would either
be ineffective or counterproductive — the cliff has the opposite
imbalance.
- The correct mitigation direction is anchor strengthening: increase
the chat-template anchor's effective attention weight to outcompete
the document-continuation prior. Phase 2C candidates outlined in §8:
PQRI (periodic question re-injection), conversational chunking,
QASI (SinkTrack-style instruction injection into BOS sink).
Self-correction of v0.3:
- v0.3 §4.6 cited the Boulter+CFO synthesised hallucination example
as the most consequential cliff failure mode and equated it with
RAG silent hallucination. This was based on a single visually
striking example. Subtype analysis on all 115 failures shows it's
one trial out of 115 (<1%), not the dominant mode.
- v0.4 §4.6 is rewritten with the corrected taxonomy, the explicit
ReDeEP signature comparison, the position-quartile analysis, and a
clear "honest correction" note for v0.3 readers.
- v0.4 §1 TL;DR and §6 Discussion are updated for the corrected
mechanism understanding.
- v0.4 §8 Future Work is rewritten with the concrete next-step
mitigation experiments suggested by the new mechanism.
Files added (Phase 2B Karpathy loop):
- bench/results/niah/redeep_proxy.{py,json}
R1 — pooled copy/novel proxy on all 198 trials
- bench/results/niah/redeep_subtype.{py,json}
R2 — failure subtype classification + per-subtype ReDeEP comparison
- bench/results/niah/continuation_origin.{py,json}
R3+R4 — quartile/decile position of longest haystack match
The result: a stronger, more publishable finding than the original
hypothesis would have been. We discovered a *new failure mode at edge
scale* rather than confirming an existing mechanism.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent a895624 commit 449080f
9 files changed
Lines changed: 3325 additions & 36 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| 70 | + | |
| 71 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
0 commit comments