Commit 56d750b
bench(niah)+paper: working memory cliff measured for 1B Q8 + 3B Q4 (162 trials)
Phase 1 of the arXiv tech report Karpathy loop. Measures the effective
working memory of two edge-device quantized LLMs and shows that 6.4× KV
cache compression is orthogonal to the cliff: it preserves whatever the
model can already retrieve without shifting where the model collapses.
Llama-3.2-1B-Instruct-Q8_0
Method ctx=256 ctx=512 ctx=1024 ctx=1536 ctx=2048
fp32 8/9 9/9 4/9 0/9 0/9
turbo_q4_w128 8/9 9/9 2/9 0/9 0/9
Cliff: 512 → 1024 (graded), zero by 1536.
Llama-3.2-3B-Instruct-Q4 (default CLI loader)
Method ctx=512 ctx=1024 ctx=1280 ctx=1536 ctx=1792 ctx=2048
fp32 9/9 9/9 0/9 0/9 0/9 0/9
turbo_q4_w128 9/9 9/9 0/9 0/9 0/9 0/9
Cliff: 1024 → 1280, step function with no degradation interval.
Both models reach effective working memory at <1% of their nominal 128K
context window. The "long-context replaces RAG" framing holds at the
memory-allocation level (fits in 9.5 GB on a 16 GB Mac) but breaks at
the retrieval level — the model stops following the chat-template
instruction long before the KV cache is full.
Compression neutrality (apples-to-apples vs FP32 baseline):
- 3B Q4: 0/10 cells disagree, +0.0 pp overall
- 1B Q8: 1/10 cells disagree (the cliff cell, both at noise floor),
-4.4 pp overall (within binomial noise at n=9)
Failure mode taxonomy above the cliff (qualitative): wikitext
continuation, section header echo, and -- most consequential --
synthesised hallucinations that fuse the needle into the haystack
subject's biography ("In 2023 Boulter was hired as the chief
financial officer..."). This is the same silent-hallucination failure
that vector RAG produces on retrieval miss, happening in the regime
that was supposed to eliminate it.
Files added:
- docs/paper/working-memory-cliff.md
arXiv-style tech report draft v0.2 with all results sections filled
- bench/results/niah/master_table.md
Per-model cliff tables + compression-neutrality summary + failure
mode taxonomy + reproduction commands
- bench/results/niah/results_2026041{1T043236,1T052319}.{csv,md}
R2 (1B Q8 sweep, 90 trials) and R3 (3B Q4 ceiling, 72 trials) raw
data + per-run aggregates
- bench/results/niah/raw_2026041{1T043236,1T052319}.log
Per-run CLI outputs for audit
Files modified:
- bench/niah_test.sh
+ LC_ALL=C / LANG=C export so multibyte UTF-8 in model responses
doesn't crash awk and abort the grid (macOS awk towc failure)
+ NIAH_CONTEXTS / NIAH_DEPTHS env-var override for ad-hoc grids
without editing the case-based GRID modes
- bench/results/niah/aggregate.py
+ UTF-8 errors='replace' on CSV read so garbage bytes from model
responses don't fail aggregation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 6db5943 commit 56d750b
10 files changed
Lines changed: 9720 additions & 17 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
15 | 22 | | |
16 | 23 | | |
17 | 24 | | |
| |||
43 | 50 | | |
44 | 51 | | |
45 | 52 | | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
62 | 80 | | |
63 | 81 | | |
64 | 82 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
16 | 18 | | |
17 | 19 | | |
18 | 20 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
0 commit comments