Skip to content

Commit 45f5d58

Browse files
unamedkrclaude
andcommitted
paper(working-memory-cliff): v0.5 — Phase 2C complete, RLV promoted to primary Phase 3
Phase 2C anchor mitigation control: 36 trials testing two prompt-level anchor-strengthening interventions against the cliff. Both failed. arm 1024 1280 1536 2048 total baseline 2/3 0/3 0/3 0/3 2/12 PQRI 0/3 0/3 0/3 0/3 0/12 convchunk 0/3 0/3 0/3 0/3 0/12 PQRI inserts "[REMINDER: <question>]" markers every ~256 tokens at sentence boundaries inside the haystack. convchunk wraps the haystack as 4 separate <|user|> turns each repeating the question. Both interventions performed *worse* than baseline at the pre-cliff control cell (ctx=1024, 0/3 vs 2/3) — the added prompt overhead pushed the borderline cell over the cliff edge — and neither moved the cliff at ctx >= 1280. This is a strong negative result. It implies the cliff is *not* at the prompt format level: even when chat-template tokens are physically present at multiple locations in the prompt, the model's attention to them collapses below the threshold needed to override the document-continuation prior. The next viable mitigation directions split into two classes: (a) Attention-level intervention (e.g., SinkTrack-style instruction injection into the BOS sink, or attention head re-weighting at the cliff layers). These require model-internal access; quant.cpp would need attention hook extensions. Multi-week C/Metal work. (b) Cliff-avoidance architectures that respect the measured cliff as a hard budget and never ask the model to retrieve from a region larger than its effective working memory. Pure orchestration above the existing CLI; no model changes. This paper update promotes (b) to the primary Phase 3 candidate under the name Read-Locate-Verify (RLV), modelled on the human cognitive retrieval pattern: Stage 1: GIST — chunked summarisation pass, structured outline Stage 2: LOCATOR — outline + question -> region pointer Stage 3: LOOKUP — load targeted region KV, answer question Stage 4: VERIFY — gist + answer -> {confident, unsure, contradicted} Stage 5: RESEARCH — retry with different region if verify fails Stage 6: HONEST — explicit uncertainty if all retries fail Each stage runs with the haystack kept below the measured cliff budget (1024 tokens for Llama-3.2-3B-Q4 on the default loader path). The architecture maps directly to how humans look up half-remembered facts in documents larger than their working memory: build an index, locate the rough region, read it in detail, verify against the index, re-search if needed, output calibrated uncertainty. Why RLV is feasible *now* with quant.cpp: - save_context/load_context (.kv file persistence) lets each region be precomputed once and mmap-loaded in milliseconds — Stage 3 inference is generation-bound, not prefill-bound. - The Phase 1B cliff measurement gives a concrete *parameter* for the architecture: region size in Stage 3 must stay below the measured effective working memory, which we've already determined per-model. - The Phase 2B failure-mode characterisation tells us *why* RLV avoids the cliff: each LLM call keeps the haystack small enough that the chat-template anchor stays loud. The continuation prior never has the document mass it would need to overpower the anchor. - The Phase 2C negative result on PQRI/convchunk tells us *why we should not try to fight the cliff*. Avoidance is the productive direction. Phase 3 prototype plan (1 week, ~2-3 hours of compute): - Day 1-2: Python orchestrator + 5-stage flow controller - Day 3: Reproduce v0.12 Acme document QA benchmark (7 questions) with RLV; expected to match the 7/7 from pure long-context but with explicit verification - Day 4-5: Stress test on 8000-token wikitext article with 10 questions (single-hop + multi-hop). Compare 3 systems: vector RAG, pure long-context, RLV. Predicted result: vector RAG fails on multi-hop, pure long-context fails entirely (above cliff), RLV succeeds with explicit uncertainty for the rare cases it cannot resolve. - Day 6-7: Write up + commit + community release Files updated: - docs/paper/working-memory-cliff.md (v0.4 → v0.5) + new §4.6 anchor mitigation control with the 36-trial table + reframed §8 promoting RLV to primary Phase 3 candidate + comparison table: RAG vs long-context vs agentic vs RLV + concrete 1-week prototype plan with day-by-day breakdown - docs/paper/working-memory-cliff.tex (regenerated from md) - bench/results/niah/master_table.md + new section: anchor mitigation control with the failure table + total trial count 204 -> 240 - bench/results/niah/results_anchor_20260411T141243.csv (final 36-row) - bench/results/niah/raw_anchor_20260411T141243.log (full per-run logs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 78b007c commit 45f5d58

5 files changed

Lines changed: 636 additions & 19 deletions

File tree

bench/results/niah/master_table.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
**Hardware**: Apple M-series, Metal kernel path (`build_metal/quant`)
55
**Protocol**: 3 needles × 3 depths (0.1, 0.5, 0.9) per (model, ctx, KV-config) cell, plus a 6-trial FP32-weights control at the cliff transition.
66
**Scoring**: case-insensitive ERE grep for keywords, against 32-token greedy generation.
7-
**Total trials**: 204 (R1=36 + R2=90 + R3=72 + R4=6).
7+
**Total trials**: 240 (R1=36 + R2=90 + R3=72 + R4=6 + R5=36 anchor mitigation control).
88

99
---
1010

@@ -77,6 +77,24 @@ Source: `bench/results/niah/results_fp32ctrl_20260411T091023.csv` (6 trials, 60
7777

7878
---
7979

80+
## Anchor mitigation control (R5): prompt-level interventions fail to move the cliff
81+
82+
Phase 2C tested whether two intuitive prompt-level interventions could move the cliff by spatially refreshing the chat-template anchor. Both failed.
83+
84+
| Arm | ctx=1024 | ctx=1280 | ctx=1536 | ctx=2048 | Total |
85+
|---|---:|---:|---:|---:|---:|
86+
| baseline | 2/3 | 0/3 | 0/3 | 0/3 | **2/12** |
87+
| **PQRI** (`[REMINDER:]` every ~256 tok) | **0/3** | **0/3** | **0/3** | **0/3** | **0/12** |
88+
| **convchunk** (4 user turns × question each) | **0/3** | **0/3** | **0/3** | **0/3** | **0/12** |
89+
90+
Both interventions performed *worse* than baseline at the pre-cliff control cell (ctx=1024) — the added prompt overhead pushed the borderline cell over the cliff edge. Neither moved the cliff itself.
91+
92+
**Implication**: the cliff is not at the prompt format level. Even when the chat-template tokens are physically present at multiple locations in the prompt, the model's attention to them collapses below the threshold needed to override the document-continuation prior. The next viable mitigation directions are either (a) attention-mechanism-level interventions (SinkTrack-style instruction injection into the BOS sink, or attention head re-weighting) which require model-internal access, or (b) cliff-avoidance architectures (Read-Locate-Verify) that respect the measured cliff as a hard budget and never ask the model to retrieve from a region larger than its effective working memory.
93+
94+
Source: `bench/results/niah/results_anchor_20260411T141243.csv` (36 trials).
95+
96+
---
97+
8098
## Failure mode taxonomy (qualitative)
8199

82100
When the model fails above the cliff, it does not say "I don't know." It produces one of:

0 commit comments

Comments
 (0)