Commit 45f5d58
paper(working-memory-cliff): v0.5 — Phase 2C complete, RLV promoted to primary Phase 3
Phase 2C anchor mitigation control: 36 trials testing two prompt-level
anchor-strengthening interventions against the cliff. Both failed.
arm 1024 1280 1536 2048 total
baseline 2/3 0/3 0/3 0/3 2/12
PQRI 0/3 0/3 0/3 0/3 0/12
convchunk 0/3 0/3 0/3 0/3 0/12
PQRI inserts "[REMINDER: <question>]" markers every ~256 tokens at
sentence boundaries inside the haystack. convchunk wraps the haystack
as 4 separate <|user|> turns each repeating the question. Both
interventions performed *worse* than baseline at the pre-cliff control
cell (ctx=1024, 0/3 vs 2/3) — the added prompt overhead pushed the
borderline cell over the cliff edge — and neither moved the cliff at
ctx >= 1280.
This is a strong negative result. It implies the cliff is *not* at
the prompt format level: even when chat-template tokens are physically
present at multiple locations in the prompt, the model's attention to
them collapses below the threshold needed to override the
document-continuation prior. The next viable mitigation directions
split into two classes:
(a) Attention-level intervention (e.g., SinkTrack-style instruction
injection into the BOS sink, or attention head re-weighting at
the cliff layers). These require model-internal access; quant.cpp
would need attention hook extensions. Multi-week C/Metal work.
(b) Cliff-avoidance architectures that respect the measured cliff as
a hard budget and never ask the model to retrieve from a region
larger than its effective working memory. Pure orchestration above
the existing CLI; no model changes.
This paper update promotes (b) to the primary Phase 3 candidate
under the name Read-Locate-Verify (RLV), modelled on the human
cognitive retrieval pattern:
Stage 1: GIST — chunked summarisation pass, structured outline
Stage 2: LOCATOR — outline + question -> region pointer
Stage 3: LOOKUP — load targeted region KV, answer question
Stage 4: VERIFY — gist + answer -> {confident, unsure, contradicted}
Stage 5: RESEARCH — retry with different region if verify fails
Stage 6: HONEST — explicit uncertainty if all retries fail
Each stage runs with the haystack kept below the measured cliff
budget (1024 tokens for Llama-3.2-3B-Q4 on the default loader path).
The architecture maps directly to how humans look up half-remembered
facts in documents larger than their working memory: build an index,
locate the rough region, read it in detail, verify against the index,
re-search if needed, output calibrated uncertainty.
Why RLV is feasible *now* with quant.cpp:
- save_context/load_context (.kv file persistence) lets each region
be precomputed once and mmap-loaded in milliseconds — Stage 3
inference is generation-bound, not prefill-bound.
- The Phase 1B cliff measurement gives a concrete *parameter* for
the architecture: region size in Stage 3 must stay below the
measured effective working memory, which we've already determined
per-model.
- The Phase 2B failure-mode characterisation tells us *why* RLV
avoids the cliff: each LLM call keeps the haystack small enough
that the chat-template anchor stays loud. The continuation prior
never has the document mass it would need to overpower the anchor.
- The Phase 2C negative result on PQRI/convchunk tells us *why we
should not try to fight the cliff*. Avoidance is the productive
direction.
Phase 3 prototype plan (1 week, ~2-3 hours of compute):
- Day 1-2: Python orchestrator + 5-stage flow controller
- Day 3: Reproduce v0.12 Acme document QA benchmark (7 questions)
with RLV; expected to match the 7/7 from pure long-context
but with explicit verification
- Day 4-5: Stress test on 8000-token wikitext article with 10 questions
(single-hop + multi-hop). Compare 3 systems: vector RAG,
pure long-context, RLV. Predicted result: vector RAG fails
on multi-hop, pure long-context fails entirely (above
cliff), RLV succeeds with explicit uncertainty for the
rare cases it cannot resolve.
- Day 6-7: Write up + commit + community release
Files updated:
- docs/paper/working-memory-cliff.md (v0.4 → v0.5)
+ new §4.6 anchor mitigation control with the 36-trial table
+ reframed §8 promoting RLV to primary Phase 3 candidate
+ comparison table: RAG vs long-context vs agentic vs RLV
+ concrete 1-week prototype plan with day-by-day breakdown
- docs/paper/working-memory-cliff.tex (regenerated from md)
- bench/results/niah/master_table.md
+ new section: anchor mitigation control with the failure table
+ total trial count 204 -> 240
- bench/results/niah/results_anchor_20260411T141243.csv (final 36-row)
- bench/results/niah/raw_anchor_20260411T141243.log (full per-run logs)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 78b007c commit 45f5d58
5 files changed
Lines changed: 636 additions & 19 deletions
File tree
- bench/results/niah
- docs/paper
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| |||
77 | 77 | | |
78 | 78 | | |
79 | 79 | | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
80 | 98 | | |
81 | 99 | | |
82 | 100 | | |
| |||
0 commit comments