Skip to content

Commit 449080f

Browse files
unamedkrclaude
andcommitted
paper(working-memory-cliff): v0.4 — ReDeEP-Cliff hypothesis tested + new failure mode named
Phase 2B Karpathy loop. Tested whether the cliff failure mode shares mechanism with RAG silent hallucination as described by ReDeEP (Sun et al., ICLR 2025): hallucination has low External Context Score and high Parametric Knowledge Score. We don't have direct access to attention heads and FFN residuals from the quant.cpp forward pass, so we used surface-level proxies on the existing 198 NIAH trial responses: copy_score(response, haystack) = 4-gram overlap, proxy for ECS novel_score = 1 - copy_score, proxy for PKS Results across 4 Karpathy rounds on the existing data, no new compute: R1 — pooled across all 115 cliff failures, FAIL has HIGHER copy_score than PASS (0.532 vs 0.440). This is the OPPOSITE of ReDeEP's RAG hallucination signature. Hypothesis rejected at top level. R2 — subtype classification of all 115 failures: CONTINUE 97 84% literal wikitext continuation OTHER 9 8% other forms of continuation HEADER 8 7% wikitext markup echo (= = =, etc) SYNTH 1 <1% needle + subject fusion (the v0.3 example) The "Boulter was hired as CFO" synthesised hallucination that v0.3 cited as the dominant failure mode is one trial out of 115. The dominant mode is literal continuation, and it has the OPPOSITE ReDeEP signature from RAG hallucination. R3 — position of longest matching substring across the haystack: Q1 (0-25%) 87 81% Q2 (25-50%) 11 10% Q3 (50-75%) 6 6% Q4 (75-100%) 3 3% 81% of cliff continuations resume from the FIRST quartile of the haystack, not from the end of the prompt. The model is jumping back to the beginning of the document, not autocompleting where the assistant turn would have started. R4 — decile precision: 70% of continuations resume specifically from the 10-20% sub-region of the haystack — the start of the article BODY in wikitext, just after the title and lead paragraph. The model recognises "this is a Wikipedia article" from the title markup and emits the body content from its canonical body-start position. We name this new failure mode: PRIMACY-BIASED DOCUMENT CONTINUATION OVERFLOW It is mechanistically distinct from RAG silent hallucination (opposite signature), from "Lost in the Middle" (which is about retrieval position, not generation source), and from attention sink collapse (BOS sink is being overruled, not lost). It is also distinct from parametric hallucination — the model is not inventing from internal memory, it is literally copying from the loaded context. Implications for mitigation: - ReDeEP's AARF (Add Attention, Reduce FFN) is designed for the parametric-takeover regime. For our cliff failure it would either be ineffective or counterproductive — the cliff has the opposite imbalance. - The correct mitigation direction is anchor strengthening: increase the chat-template anchor's effective attention weight to outcompete the document-continuation prior. Phase 2C candidates outlined in §8: PQRI (periodic question re-injection), conversational chunking, QASI (SinkTrack-style instruction injection into BOS sink). Self-correction of v0.3: - v0.3 §4.6 cited the Boulter+CFO synthesised hallucination example as the most consequential cliff failure mode and equated it with RAG silent hallucination. This was based on a single visually striking example. Subtype analysis on all 115 failures shows it's one trial out of 115 (<1%), not the dominant mode. - v0.4 §4.6 is rewritten with the corrected taxonomy, the explicit ReDeEP signature comparison, the position-quartile analysis, and a clear "honest correction" note for v0.3 readers. - v0.4 §1 TL;DR and §6 Discussion are updated for the corrected mechanism understanding. - v0.4 §8 Future Work is rewritten with the concrete next-step mitigation experiments suggested by the new mechanism. Files added (Phase 2B Karpathy loop): - bench/results/niah/redeep_proxy.{py,json} R1 — pooled copy/novel proxy on all 198 trials - bench/results/niah/redeep_subtype.{py,json} R2 — failure subtype classification + per-subtype ReDeEP comparison - bench/results/niah/continuation_origin.{py,json} R3+R4 — quartile/decile position of longest haystack match The result: a stronger, more publishable finding than the original hypothesis would have been. We discovered a *new failure mode at edge scale* rather than confirming an existing mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a895624 commit 449080f

9 files changed

Lines changed: 3325 additions & 36 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,5 @@ build_metal/
6767

6868
.env
6969

70+
71+
blog/
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"n_failures": 115,
3+
"n_with_match": 107,
4+
"buckets": {
5+
"Q1 (0-25%)": 87,
6+
"Q2 (25-50%)": 11,
7+
"Q3 (50-75%)": 6,
8+
"Q4 (75-100%)": 3
9+
},
10+
"mean_position": 0.23035950770630845,
11+
"mean_offset_end": -0.7696404922936915
12+
}
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
#!/usr/bin/env python3
2+
"""R3: where does the continuation start?
3+
4+
R1+R2 found that 84% of cliff failures are literal continuation of the
5+
haystack ("Doctors, followed by a role in..."). This script checks
6+
WHERE in the haystack the continuation matches, to test the hypothesis
7+
that the model is *resuming the wikitext from a specific position*
8+
rather than generating from a random place.
9+
10+
Hypotheses to distinguish:
11+
12+
H1 — RESUME-AT-END: continuation starts from where the assistant turn
13+
was supposed to begin. The chat template anchor was too weak; the
14+
model defaulted to "auto-complete from end of input".
15+
16+
H2 — RESUME-FROM-NEEDLE: continuation starts from somewhere near the
17+
inserted needle. The model partially noticed the needle but then
18+
resumed copying haystack from there.
19+
20+
H3 — JUMP-TO-EARLY: continuation starts from somewhere early in the
21+
haystack (model uses recency bias to grab a fluent passage).
22+
23+
We measure: for each CONTINUE failure, where is the longest matching
24+
substring in the haystack? Then bucket by haystack position quartile.
25+
"""
26+
import csv
27+
import json
28+
import re
29+
import sys
30+
from collections import Counter
31+
from pathlib import Path
32+
33+
REPO = Path(__file__).resolve().parent.parent.parent.parent
34+
PROXY = REPO / "bench/results/niah/redeep_proxy.json"
35+
SUBTYPE = REPO / "bench/results/niah/redeep_subtype.json"
36+
WIKITEXT = REPO / "bench/data/wikitext2_test.txt"
37+
38+
NEEDLES = [
39+
"The chief financial officer of Northwind Logistics is Sarah Chen, hired in 2023.",
40+
"The launch date for Project Aurora is November 14th in San Francisco.",
41+
"The reactor cooling tank at the Helios facility holds exactly eight thousand liters of distilled water.",
42+
]
43+
44+
45+
def normalize(s: str) -> str:
46+
return re.sub(r"[^a-z0-9 ]+", " ", s.lower()).strip()
47+
48+
49+
def reconstruct_haystack(ctx_tokens: int, needle_idx: int) -> tuple:
50+
"""Returns (haystack_with_needle, needle_position_chars)."""
51+
raw = WIKITEXT.read_text(encoding="utf-8", errors="replace")
52+
target_chars = int(ctx_tokens * 3.6)
53+
hay = raw[:target_chars]
54+
end = hay.rfind(". ")
55+
if end > 0:
56+
hay = hay[:end + 1]
57+
needle = NEEDLES[needle_idx]
58+
desired = len(hay) // 2
59+
sb = hay.rfind(". ", 0, max(desired, 2))
60+
sb = 0 if sb < 0 else sb + 2
61+
return hay[:sb] + needle + " " + hay[sb:], sb
62+
63+
64+
def longest_match_position(response: str, haystack: str, min_len: int = 30) -> int:
65+
"""Return the haystack character position where the longest substring
66+
of `response` (length >= min_len) appears, or -1 if none found."""
67+
rn = normalize(response)
68+
hn = normalize(haystack)
69+
best_pos = -1
70+
best_len = min_len - 1
71+
# Walk windows of decreasing length
72+
for L in range(min(len(rn), 80), min_len - 1, -5):
73+
for i in range(0, max(1, len(rn) - L)):
74+
sub = rn[i:i+L]
75+
if not sub.strip():
76+
continue
77+
pos = hn.find(sub)
78+
if pos >= 0 and L > best_len:
79+
best_pos = pos
80+
best_len = L
81+
break # found longest at this L
82+
if best_pos >= 0:
83+
break
84+
return best_pos
85+
86+
87+
def main():
88+
with open(PROXY) as f:
89+
proxy = json.load(f)
90+
91+
# Load all FAIL trials with their failed responses
92+
failures = [r for r in proxy["trials"] if r["pass"] == 0]
93+
94+
# Reconstruct haystacks once per (ctx, needle)
95+
haystack_cache = {}
96+
def get_hay(ctx, needle):
97+
key = (ctx, needle)
98+
if key not in haystack_cache:
99+
haystack_cache[key] = reconstruct_haystack(ctx, needle)
100+
return haystack_cache[key]
101+
102+
# For each failure, find the longest match position
103+
results = []
104+
for r in failures:
105+
haystack, needle_pos = get_hay(r["context"], r["needle"])
106+
pos = longest_match_position(r["response"], haystack)
107+
haylen = len(normalize(haystack))
108+
if pos < 0 or haylen == 0:
109+
position_pct = None
110+
relative_to_needle = None
111+
relative_to_end = None
112+
else:
113+
position_pct = pos / haylen
114+
# Approx needle position in normalized chars
115+
needle_pct = needle_pos / max(len(haystack), 1)
116+
relative_to_needle = position_pct - needle_pct
117+
relative_to_end = position_pct - 1.0 # negative = before end
118+
results.append({
119+
**r,
120+
"match_pos_pct": position_pct,
121+
"rel_to_needle": relative_to_needle,
122+
"rel_to_end": relative_to_end,
123+
})
124+
125+
# Distribution of match positions across all failures
126+
valid = [r for r in results if r["match_pos_pct"] is not None]
127+
print(f"Failures with detectable haystack match: {len(valid)} / {len(failures)}")
128+
print(f" (those without a match — model invented enough that no 30-char substring matches)\n")
129+
130+
# Bucket by quartile
131+
buckets = Counter()
132+
for r in valid:
133+
p = r["match_pos_pct"]
134+
if p < 0.25: buckets["Q1 (0-25%)"] += 1
135+
elif p < 0.50: buckets["Q2 (25-50%)"] += 1
136+
elif p < 0.75: buckets["Q3 (50-75%)"] += 1
137+
else: buckets["Q4 (75-100%)"] += 1
138+
139+
print("Continuation start position (longest matching substring) — quartile of haystack:")
140+
for q in ["Q1 (0-25%)", "Q2 (25-50%)", "Q3 (50-75%)", "Q4 (75-100%)"]:
141+
n = buckets[q]
142+
bar = "█" * (n * 30 // max(len(valid), 1))
143+
print(f" {q:<14} {n:>3} {bar}")
144+
print()
145+
146+
# Mean position relative to needle and end
147+
if valid:
148+
mean_pos = sum(r["match_pos_pct"] for r in valid) / len(valid)
149+
mean_to_end = sum(r["rel_to_end"] for r in valid) / len(valid)
150+
mean_to_needle = sum(r["rel_to_needle"] for r in valid) / len(valid)
151+
print(f"Mean match position (% of haystack): {mean_pos:.2f}")
152+
print(f"Mean offset from end of haystack: {mean_to_end:+.2f} (negative = before end)")
153+
print(f"Mean offset from needle position: {mean_to_needle:+.2f} (positive = after needle)")
154+
print()
155+
156+
# Hypothesis verdicts
157+
print("--- Hypothesis verdicts ---\n")
158+
if valid:
159+
q4_frac = buckets["Q4 (75-100%)"] / len(valid)
160+
q1_frac = buckets["Q1 (0-25%)"] / len(valid)
161+
if q4_frac > 0.5:
162+
print(f"H1 (RESUME-AT-END) — SUPPORTED: {q4_frac:.0%} of continuations match")
163+
print( " the last quartile of the haystack.")
164+
print( " → chat template anchor failed; model")
165+
print( " defaulted to autocomplete from prompt end.")
166+
elif q1_frac > 0.5:
167+
print(f"H3 (JUMP-TO-EARLY) — SUPPORTED: {q1_frac:.0%} of continuations match")
168+
print( " the first quartile of the haystack.")
169+
else:
170+
# Distribution is broader — check needle proximity
171+
near_needle = sum(1 for r in valid if abs(r["rel_to_needle"]) < 0.1)
172+
if near_needle / len(valid) > 0.5:
173+
print(f"H2 (RESUME-FROM-NEEDLE) — SUPPORTED: {near_needle/len(valid):.0%}")
174+
print( " of continuations match within ±10% of needle position.")
175+
else:
176+
print("MIXED — no single position dominates. Continuations are spread across the haystack.")
177+
178+
# Save
179+
out = REPO / "bench/results/niah/continuation_origin.json"
180+
with open(out, "w") as f:
181+
json.dump({
182+
"n_failures": len(failures),
183+
"n_with_match": len(valid),
184+
"buckets": dict(buckets),
185+
"mean_position": mean_pos if valid else None,
186+
"mean_offset_end": mean_to_end if valid else None,
187+
}, f, indent=2)
188+
print(f"\nSaved to {out}")
189+
190+
191+
if __name__ == "__main__":
192+
main()

0 commit comments

Comments
 (0)