|
| 1 | +import { Authors, Badges } from '@/components/utils' |
| 2 | + |
| 3 | +# Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs |
| 4 | + |
| 5 | +<Authors |
| 6 | + authors="Ngoc Bui, Yale University; Shubham Sharma, JPMorganChase; Simran Lamba, JPMorganChase; Saumitra Mishra, JPMorganChase; Rex Ying, Yale University" |
| 7 | +/> |
| 8 | + |
| 9 | + |
| 10 | +<Badges |
| 11 | + venue="ICLR 2026" |
| 12 | + github="https://github.com/ngocbh/trimkv/" |
| 13 | + arxiv="https://arxiv.org/abs/2512.03324" |
| 14 | + pdf="https://arxiv.org/abs/2512.03324" |
| 15 | +/> |
| 16 | + |
| 17 | +## TL;DR |
| 18 | + |
| 19 | +Large language models are getting better at handling long contexts, but there is still a major systems bottleneck: the **KV cache**. As a model generates more tokens, it stores more keys and values from previous steps, which increases memory use and slows inference. |
| 20 | + |
| 21 | +Our paper introduces **TRIM-KV**, a new approach for memory-bounded inference that learns **which tokens are worth keeping** in the KV cache. |
| 22 | + |
| 23 | +## Why this matters |
| 24 | + |
| 25 | +Most existing KV eviction methods rely on recent attention patterns to decide what to keep. In practice, that can be brittle. A token that is not useful right now may still become important much later, especially in long reasoning or long-generation settings. |
| 26 | + |
| 27 | +TRIM-KV takes a different view: instead of asking what the model attended to recently, we ask whether a token is **intrinsically important at the moment it is created**. |
| 28 | + |
| 29 | +## The core idea |
| 30 | + |
| 31 | +TRIM-KV adds a lightweight **retention gate** that assigns each token a score representing its long-term importance for a particular layer and head. This score gradually decays over time. When the cache reaches its memory budget, the model evicts the token with the lowest retention score. |
| 32 | + |
| 33 | +This makes eviction a learned decision rather than a hand-designed heuristic. |
| 34 | + |
| 35 | +## Why it works |
| 36 | + |
| 37 | +Not all tokens contribute equally to future computation. Some carry key task information, such as instructions, facts, or core problem statements. Others are much less useful. TRIM-KV learns this distinction directly from token representations and keeps the most valuable tokens under a fixed memory budget. |
| 38 | + |
| 39 | +Interestingly, the learned policy naturally recovers behaviors that resemble well-known heuristics such as sink tokens, sliding windows, and gist-like compression — but without explicitly hard-coding them. |
| 40 | + |
| 41 | +## Results |
| 42 | + |
| 43 | +We evaluate TRIM-KV on a diverse set of long-context and long-generation benchmarks. Across these settings, TRIM-KV consistently outperforms strong KV eviction and learnable retrieval baselines, especially in low-memory regimes. In several cases, it even performs better than full-cache inference, suggesting that selective retention can also act as a form of regularization by filtering out noisy or uninformative tokens. |
| 44 | + |
| 45 | +- Long Mathematical Reasoning (AIME24, GSM8k, MATH-500) |
| 46 | + |
| 47 | + |
| 48 | + |
| 49 | +- Long Procedural Generation (LongProc) |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | +- Long Memory Conversation (LongMEmEval) |
| 54 | + |
| 55 | + |
| 56 | + |
| 57 | +- Long-Context Understanding (SCBench, LongBench, LongBenchV2) |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | + |
| 64 | +## Takeaway |
| 65 | + |
| 66 | +TRIM-KV shows that efficient long-context inference is not just about compressing memory — it is about **learning what to remember**. |
| 67 | + |
| 68 | +By predicting token importance at creation time, TRIM-KV turns KV cache eviction into a simple, trainable, and effective mechanism for scaling LLM inference under memory constraints. |
| 69 | + |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## Citation |
| 74 | + |
| 75 | +```bibtex |
| 76 | +@article{bui2025cache, |
| 77 | + title={Cache what lasts: Token retention for memory-bounded kv cache in llms}, |
| 78 | + author={Bui, Ngoc and Sharma, Shubham and Lamba, Simran and Mishra, Saumitra and Ying, Rex}, |
| 79 | + journal={arXiv preprint arXiv:2512.03324}, |
| 80 | + year={2025} |
| 81 | +} |
| 82 | +``` |
0 commit comments