Skip to content

Commit f527939

Browse files
unamedkrclaude
andcommitted
paper(working-memory-cliff): Phase 1B — FP32-weights control + arXiv-ready
Closes the quantization confound loop on the working memory cliff finding and ships an arXiv-submission-ready tech report draft. R4 (FP32-weights control, 6 trials): the cliff sits in the same place when on-the-fly Q4 weight requantization is disabled. Weight precision ctx=1024 ctx=1280 Q4 (default) 100% 0% FP32 (TQ_NO_Q4=1) 100% 0% Going from Q4 to FP32 weights eliminates any quantization artifact but does not move the transition. The cliff is therefore a model property — chat-template-anchored instruction-following robustness — not a weight-quantization artifact and not a KV-cache artifact. The above-cliff failure mode is identical between Q4 and FP32 weights: all six FP32 ctx=1280 trials produced wikitext continuation ("Doctors , followed by a role in How to Curse..."), the same dominant failure mode as the Q4 grid. CLI bug discovered during the seed-sweep attempt (R5): tools/quant.c documents `-s <seed>` in --help but does not implement it. There is no parser case for `-s`, no rng_seed config field, and the underlying tq_sample_topp call hardcodes rng_state=42 per CLI invocation. The result is that all 60 attempted seed-sweep trials degenerated to "model path = <seed>" (e.g., `Loading model from 42... cannot open '42'`). Documented in §5.6 as a limitation. Filing the CLI fix as a separate quant.cpp issue. The seed-sweep artifacts are committed for transparency (they are evidence of the bug discovery, not data). Tech report draft v0.3 (docs/paper/working-memory-cliff.md, 292 lines): - §1: TL;DR with concrete cliff numbers and the FP32 control finding - §2: Related Work (KIVI, H2O, SnapKV, PyramidKV, NIAH, Lost in the Middle, RULER, LongBench, MLC-LLM) with explicit comparison - §3: Method — protocol, models, KV configs, grid - §4: Results — six tables (1B, 3B, summary, neutrality, FP32 control) + failure mode taxonomy - §5: Negative findings (prompt format trap, panic output, 8B problem, single-language scope, single prompt format, seed-sweep CLI bug) - §6: Discussion — what "long-context replaces RAG" means at the edge, with the 0.4–0.78% effective-window numbers - §7: Reproducibility — exact CLI commands, CSV file references, git commit hash for fixed-version reproduction - §8: Future work (8B+, mechanistic interpretability, cross-lingual) - §9: References (9 citations, all open arXiv) Submission package (docs/paper/): - working-memory-cliff.md — single-source markdown - working-memory-cliff.tex — auto-generated LaTeX (517 lines) - md2tex.py — pure-Python markdown → arXiv LaTeX converter, no pandoc dependency - build.sh — pandoc-or-fallback build script - arxiv-metadata.md — abstract (280 words), classification, keywords, submission checklist - hf-blog-draft.md — HuggingFace blog post (151 lines, friendly tone, ready for publication) - twitter-thread.md — 10-tweet launch thread + 5 anticipated criticism responses Next-step option for the user: the user can submit working-memory-cliff.tex to arXiv directly, publish hf-blog-draft.md to HuggingFace, and queue the twitter thread for simultaneous launch. The CLI seed bug is a separate small fix tracked outside this commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 56d750b commit f527939

14 files changed

Lines changed: 2226 additions & 11 deletions

bench/niah_fp32_control.sh

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#!/usr/bin/env bash
2+
# Minimal FP32-weights control experiment for the working memory cliff
3+
# tech report. Reuses bench/niah_test.sh's prompt format and scoring but
4+
# only runs the 6 cells that bracket the cliff transition (the full
5+
# 36-run grid is infeasible because the FP32-weights path runs at
6+
# ~0.02 tok/s on Metal — see TQ_NO_Q4 in tools/quant.c).
7+
#
8+
# Goal: measure whether the cliff location depends on weight precision.
9+
# If 3B FP32 at ctx=1024 passes and ctx=1280 fails (matching the Q4
10+
# default), the cliff is *independent* of weight quantization, meaning
11+
# the ceiling is a property of the model's instruction-following
12+
# robustness rather than its weight precision.
13+
14+
set -e
15+
export LC_ALL=C
16+
export LANG=C
17+
18+
TQ=${TQ:-./build_metal/quant}
19+
MODEL=${MODEL:-models/Llama-3.2-3B-Instruct-Q8_0.gguf}
20+
THREADS=${THREADS:-8}
21+
OUT_DIR=bench/results/niah
22+
RUN_ID=$(date -u +%Y%m%dT%H%M%S)
23+
RAW_LOG="$OUT_DIR/raw_fp32ctrl_${RUN_ID}.log"
24+
RESULT_CSV="$OUT_DIR/results_fp32ctrl_${RUN_ID}.csv"
25+
26+
mkdir -p "$OUT_DIR"
27+
echo "method,context,depth,needle_idx,pass,response" > "$RESULT_CSV"
28+
29+
# Same three needles as the main grid
30+
NEEDLE_0="The chief financial officer of Northwind Logistics is Sarah Chen, hired in 2023."
31+
QUESTION_0="Who is the chief financial officer of Northwind Logistics? Answer with the full name."
32+
KEYWORD_0="Sarah\|Chen"
33+
34+
NEEDLE_1="The launch date for Project Aurora is November 14th in San Francisco."
35+
QUESTION_1="When and where will Project Aurora launch? Answer in one sentence."
36+
KEYWORD_1="November\|San Francisco"
37+
38+
NEEDLE_2="The reactor cooling tank at the Helios facility holds exactly eight thousand liters of distilled water."
39+
QUESTION_2="How much distilled water does the reactor cooling tank at Helios hold?"
40+
KEYWORD_2="eight thousand\|8000\|8,000"
41+
42+
NEEDLES=("$NEEDLE_0" "$NEEDLE_1" "$NEEDLE_2")
43+
QUESTIONS=("$QUESTION_0" "$QUESTION_1" "$QUESTION_2")
44+
KEYWORDS=("$KEYWORD_0" "$KEYWORD_1" "$KEYWORD_2")
45+
46+
# Just the cliff transition cells
47+
CONTEXTS=(1024 1280)
48+
DEPTH=0.5 # mid-document only — depth sensitivity already characterised
49+
50+
build_prompt() {
51+
local ctx_tokens="$1" needle="$2" question="$3"
52+
python3 - "$ctx_tokens" "$needle" "$question" <<'PYEOF'
53+
import sys
54+
ctx_tokens=int(sys.argv[1]); needle=sys.argv[2]; question=sys.argv[3]
55+
with open("bench/data/wikitext2_test.txt") as f:
56+
raw=f.read()
57+
target=int(ctx_tokens*3.6)
58+
hay=raw[:target]
59+
end=hay.rfind(". ")
60+
if end>0: hay=hay[:end+1]
61+
sb=hay.rfind(". ", 0, max(len(hay)//2,2))
62+
sb = 0 if sb<0 else sb+2
63+
h=hay[:sb]+needle+" "+hay[sb:]
64+
sys.stdout.write(h+"\n\nQuestion: "+question)
65+
PYEOF
66+
}
67+
68+
run_idx=0
69+
total=$(( ${#CONTEXTS[@]} * ${#NEEDLES[@]} ))
70+
71+
echo "==> FP32-weights control experiment"
72+
echo " binary: $TQ"
73+
echo " model: $MODEL"
74+
echo " flag: TQ_NO_Q4=1 (loads weights as FP32)"
75+
echo " cells: contexts=${CONTEXTS[*]} depth=$DEPTH needles=${#NEEDLES[@]}"
76+
echo " raw: $RAW_LOG"
77+
echo " results: $RESULT_CSV"
78+
echo ""
79+
80+
for ctx in "${CONTEXTS[@]}"; do
81+
cli_ctx=$(( ctx + 256 ))
82+
for ni in "${!NEEDLES[@]}"; do
83+
run_idx=$(( run_idx + 1 ))
84+
needle="${NEEDLES[$ni]}"
85+
question="${QUESTIONS[$ni]}"
86+
keyword="${KEYWORDS[$ni]}"
87+
88+
prompt=$(build_prompt "$ctx" "$needle" "$question")
89+
printf "[%d/%d] fp32-w ctx=%d needle=%d " "$run_idx" "$total" "$ctx" "$ni"
90+
91+
out=$(TQ_NO_Q4=1 "$TQ" "$MODEL" -p "$prompt" -n 32 -T 0.0 -j "$THREADS" \
92+
--chat --ctx "$cli_ctx" -k fp32 2>&1 || true)
93+
94+
resp=$(echo "$out" | awk '
95+
/^---$/ { n++; next }
96+
n==1 && /^\[tokenizer\]/ { next }
97+
n==1 { print }
98+
' || true)
99+
if [ -z "$resp" ]; then resp=$(echo "$out" | tail -3 | head -1); fi
100+
101+
resp_csv=$(echo "$resp" | tr '\n' ' ' | sed 's/"/""/g')
102+
if echo "$resp" | grep -qiE "$(echo "$keyword" | sed 's/\\|/|/g')"; then
103+
pass=1; echo "PASS"
104+
else
105+
pass=0; echo "FAIL: ${resp:0:60}"
106+
fi
107+
108+
echo "fp32-weights,$ctx,$DEPTH,$ni,$pass,\"$resp_csv\"" >> "$RESULT_CSV"
109+
echo "===== fp32-weights ctx=$ctx needle=$ni =====" >> "$RAW_LOG"
110+
echo "$out" >> "$RAW_LOG"
111+
echo "" >> "$RAW_LOG"
112+
done
113+
done
114+
115+
echo ""
116+
echo "==> Summary by context:"
117+
for ctx in "${CONTEXTS[@]}"; do
118+
pass=$(awk -F, -v c="$ctx" 'NR>1 && $2==c {p+=$5; t++} END{printf "%d/%d", p, t}' "$RESULT_CSV")
119+
pct=$(awk -F, -v c="$ctx" 'NR>1 && $2==c {p+=$5; t++} END{if(t>0)printf "%.0f%%", 100*p/t}' "$RESULT_CSV")
120+
printf " ctx=%-5d %s (%s)\n" "$ctx" "$pass" "$pct"
121+
done

bench/niah_seed_sweep.sh

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
#!/usr/bin/env bash
2+
# Cliff-cell seed sweep for the working memory cliff tech report.
3+
# Runs the two cliff transition cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280)
4+
# with 5 random seeds × 3 needles × 2 methods = 60 trials per cell.
5+
#
6+
# Goal: confirm whether the cliff cell's mid-failure rate (1B fp32 4/9
7+
# at ctx=1024) is statistically distinguishable from random or whether
8+
# it's binomial noise. With 5 seeds × 3 needles = 15 samples per
9+
# (model, ctx, method) combination we can compute proper Wilson
10+
# confidence intervals.
11+
12+
set -e
13+
export LC_ALL=C
14+
export LANG=C
15+
16+
TQ=${TQ:-./build_metal/quant}
17+
THREADS=${THREADS:-8}
18+
OUT_DIR=bench/results/niah
19+
RUN_ID=$(date -u +%Y%m%dT%H%M%S)
20+
RAW_LOG="$OUT_DIR/raw_seedsweep_${RUN_ID}.log"
21+
RESULT_CSV="$OUT_DIR/results_seedsweep_${RUN_ID}.csv"
22+
23+
mkdir -p "$OUT_DIR"
24+
echo "model,method,context,depth,needle_idx,seed,pass,response" > "$RESULT_CSV"
25+
26+
NEEDLE_0="The chief financial officer of Northwind Logistics is Sarah Chen, hired in 2023."
27+
QUESTION_0="Who is the chief financial officer of Northwind Logistics? Answer with the full name."
28+
KEYWORD_0="Sarah\|Chen"
29+
30+
NEEDLE_1="The launch date for Project Aurora is November 14th in San Francisco."
31+
QUESTION_1="When and where will Project Aurora launch? Answer in one sentence."
32+
KEYWORD_1="November\|San Francisco"
33+
34+
NEEDLE_2="The reactor cooling tank at the Helios facility holds exactly eight thousand liters of distilled water."
35+
QUESTION_2="How much distilled water does the reactor cooling tank at Helios hold?"
36+
KEYWORD_2="eight thousand\|8000\|8,000"
37+
38+
NEEDLES=("$NEEDLE_0" "$NEEDLE_1" "$NEEDLE_2")
39+
QUESTIONS=("$QUESTION_0" "$QUESTION_1" "$QUESTION_2")
40+
KEYWORDS=("$KEYWORD_0" "$KEYWORD_1" "$KEYWORD_2")
41+
42+
# (model, ctx) cliff cells to sample
43+
CELL_MODELS=(
44+
"models/Llama-3.2-1B-Instruct-Q8_0.gguf"
45+
"models/Llama-3.2-3B-Instruct-Q8_0.gguf"
46+
)
47+
CELL_CONTEXTS=(1024 1280)
48+
CELL_NAMES=("1B" "3B")
49+
DEPTH=0.5
50+
SEEDS=(42 1337 7 2024 31415)
51+
52+
METHOD_NAMES=("fp32" "turbo_q4_w128")
53+
METHOD_FLAGS=("-k fp32" "-k turbo_kv_4b -v q4 --k-window 128")
54+
55+
build_prompt() {
56+
local ctx_tokens="$1" needle="$2" question="$3"
57+
python3 - "$ctx_tokens" "$needle" "$question" <<'PYEOF'
58+
import sys
59+
ctx_tokens=int(sys.argv[1]); needle=sys.argv[2]; question=sys.argv[3]
60+
with open("bench/data/wikitext2_test.txt") as f:
61+
raw=f.read()
62+
target=int(ctx_tokens*3.6)
63+
hay=raw[:target]
64+
end=hay.rfind(". ")
65+
if end>0: hay=hay[:end+1]
66+
sb=hay.rfind(". ", 0, max(len(hay)//2,2))
67+
sb = 0 if sb<0 else sb+2
68+
h=hay[:sb]+needle+" "+hay[sb:]
69+
sys.stdout.write(h+"\n\nQuestion: "+question)
70+
PYEOF
71+
}
72+
73+
total=$(( ${#CELL_MODELS[@]} * ${#NEEDLES[@]} * ${#SEEDS[@]} * ${#METHOD_NAMES[@]} ))
74+
run_idx=0
75+
76+
echo "==> NIAH cliff-cell seed sweep"
77+
echo " binary: $TQ"
78+
echo " cells: ${CELL_NAMES[*]}"
79+
echo " seeds: ${SEEDS[*]}"
80+
echo " total: $total trials"
81+
echo " raw: $RAW_LOG"
82+
echo " csv: $RESULT_CSV"
83+
echo ""
84+
85+
for ci in "${!CELL_MODELS[@]}"; do
86+
model="${CELL_MODELS[$ci]}"
87+
ctx="${CELL_CONTEXTS[$ci]}"
88+
cell_name="${CELL_NAMES[$ci]}"
89+
cli_ctx=$(( ctx + 256 ))
90+
91+
for mi in "${!METHOD_NAMES[@]}"; do
92+
mname="${METHOD_NAMES[$mi]}"
93+
mflags="${METHOD_FLAGS[$mi]}"
94+
for ni in "${!NEEDLES[@]}"; do
95+
needle="${NEEDLES[$ni]}"
96+
question="${QUESTIONS[$ni]}"
97+
keyword="${KEYWORDS[$ni]}"
98+
prompt=$(build_prompt "$ctx" "$needle" "$question")
99+
100+
for seed in "${SEEDS[@]}"; do
101+
run_idx=$(( run_idx + 1 ))
102+
printf "[%3d/%d] %-2s %-14s ctx=%-5d needle=%d seed=%-5d " \
103+
"$run_idx" "$total" "$cell_name" "$mname" "$ctx" "$ni" "$seed"
104+
105+
out=$( "$TQ" "$model" -p "$prompt" -n 32 -T 0.0 -s "$seed" -j "$THREADS" \
106+
--chat --ctx "$cli_ctx" $mflags 2>&1 || true )
107+
108+
resp=$(echo "$out" | awk '
109+
/^---$/ { n++; next }
110+
n==1 && /^\[tokenizer\]/ { next }
111+
n==1 { print }
112+
' || true)
113+
if [ -z "$resp" ]; then resp=$(echo "$out" | tail -3 | head -1); fi
114+
resp_csv=$(echo "$resp" | tr '\n' ' ' | sed 's/"/""/g')
115+
116+
if echo "$resp" | grep -qiE "$(echo "$keyword" | sed 's/\\|/|/g')"; then
117+
pass=1; echo "PASS"
118+
else
119+
pass=0; echo "FAIL: ${resp:0:50}"
120+
fi
121+
122+
echo "$cell_name,$mname,$ctx,$DEPTH,$ni,$seed,$pass,\"$resp_csv\"" >> "$RESULT_CSV"
123+
echo "===== $cell_name $mname ctx=$ctx needle=$ni seed=$seed =====" >> "$RAW_LOG"
124+
echo "$out" >> "$RAW_LOG"
125+
echo "" >> "$RAW_LOG"
126+
done
127+
done
128+
done
129+
done
130+
131+
echo ""
132+
echo "==> Summary by (model × method):"
133+
for ci in "${!CELL_MODELS[@]}"; do
134+
cell_name="${CELL_NAMES[$ci]}"
135+
for mname in "${METHOD_NAMES[@]}"; do
136+
pass=$(awk -F, -v cn="$cell_name" -v m="$mname" 'NR>1 && $1==cn && $2==m {p+=$7; t++} END{printf "%d/%d", p, t}' "$RESULT_CSV")
137+
pct=$(awk -F, -v cn="$cell_name" -v m="$mname" 'NR>1 && $1==cn && $2==m {p+=$7; t++} END{if(t>0)printf "%.0f%%", 100*p/t}' "$RESULT_CSV")
138+
printf " %-2s %-14s %s (%s)\n" "$cell_name" "$mname" "$pass" "$pct"
139+
done
140+
done

bench/results/niah/master_table.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# NIAH Master Table — Phase 1 Working Memory Cliff Measurements
1+
# NIAH Master Table — Phase 1B Working Memory Cliff Measurements
22

33
**Date**: 2026-04-11
44
**Hardware**: Apple M-series, Metal kernel path (`build_metal/quant`)
5-
**Protocol**: 3 needles × 3 depths (0.1, 0.5, 0.9) per (model, ctx, KV-config) cell.
5+
**Protocol**: 3 needles × 3 depths (0.1, 0.5, 0.9) per (model, ctx, KV-config) cell, plus a 6-trial FP32-weights control at the cliff transition.
66
**Scoring**: case-insensitive ERE grep for keywords, against 32-token greedy generation.
7-
**Total trials**: 198 (90 + 36 + 72).
7+
**Total trials**: 204 (R1=36 + R2=90 + R3=72 + R4=6).
88

99
---
1010

@@ -60,6 +60,23 @@ The headline result remains: **KV compression preserves whatever the model can a
6060

6161
---
6262

63+
## Weight-precision control (R4): the cliff is invariant to weight quantization
64+
65+
The default `quant.cpp` loader silently re-quantizes Q8_0 GGUF weights to Q4 in memory. To eliminate the possibility that the cliff is an artifact of this requantization, we re-ran the cliff transition cells with the FP32-weights loader path (`TQ_NO_Q4=1`).
66+
67+
| Weight precision | ctx=1024 | ctx=1280 |
68+
|---|---:|---:|
69+
| Q4 (default loader) | **100%** (18/18) | **0%** (0/18) |
70+
| **FP32** (`TQ_NO_Q4=1`) | **100% (3/3)** | **0% (0/3)** |
71+
72+
**Identical cliff location at 8× per-parameter precision.** The model's instruction-following collapse happens at the same context length whether weights are Q4 or FP32 — the cliff is a *model* property, not a weight quantization artifact.
73+
74+
Above-cliff failure mode is also identical between Q4 and FP32 weights: all six FP32 ctx=1280 trials produced wikitext continuation ("Doctors , followed by a role in the 2007 theatre production of How to Curse..."), the same dominant failure mode as the Q4 grid.
75+
76+
Source: `bench/results/niah/results_fp32ctrl_20260411T091023.csv` (6 trials, 60 minutes wall time on Metal — FP32-weights inference runs at ~10 min per trial, hence the small grid).
77+
78+
---
79+
6380
## Failure mode taxonomy (qualitative)
6481

6582
When the model fails above the cliff, it does not say "I don't know." It produces one of:

0 commit comments

Comments
 (0)