Skip to content

Commit f0dfdab

Browse files
authored
Merge pull request #24 from ngocbh/master
add trimkv
2 parents 785db53 + 87a587f commit f0dfdab

7 files changed

Lines changed: 94 additions & 1 deletion

File tree

88.4 KB
Loading
181 KB
Loading
78.4 KB
Loading
540 KB
Loading
90.8 KB
Loading

app/projects/trimkv/page.mdx

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
import { Authors, Badges } from '@/components/utils'
2+
3+
# Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
4+
5+
<Authors
6+
authors="Ngoc Bui, Yale University; Shubham Sharma, JPMorganChase; Simran Lamba, JPMorganChase; Saumitra Mishra, JPMorganChase; Rex Ying, Yale University"
7+
/>
8+
9+
10+
<Badges
11+
venue="ICLR 2026"
12+
github="https://github.com/ngocbh/trimkv/"
13+
arxiv="https://arxiv.org/abs/2512.03324"
14+
pdf="https://arxiv.org/abs/2512.03324"
15+
/>
16+
17+
## TL;DR
18+
19+
Large language models are getting better at handling long contexts, but there is still a major systems bottleneck: the **KV cache**. As a model generates more tokens, it stores more keys and values from previous steps, which increases memory use and slows inference.
20+
21+
Our paper introduces **TRIM-KV**, a new approach for memory-bounded inference that learns **which tokens are worth keeping** in the KV cache.
22+
23+
## Why this matters
24+
25+
Most existing KV eviction methods rely on recent attention patterns to decide what to keep. In practice, that can be brittle. A token that is not useful right now may still become important much later, especially in long reasoning or long-generation settings.
26+
27+
TRIM-KV takes a different view: instead of asking what the model attended to recently, we ask whether a token is **intrinsically important at the moment it is created**.
28+
29+
## The core idea
30+
31+
TRIM-KV adds a lightweight **retention gate** that assigns each token a score representing its long-term importance for a particular layer and head. This score gradually decays over time. When the cache reaches its memory budget, the model evicts the token with the lowest retention score.
32+
33+
This makes eviction a learned decision rather than a hand-designed heuristic.
34+
35+
## Why it works
36+
37+
Not all tokens contribute equally to future computation. Some carry key task information, such as instructions, facts, or core problem statements. Others are much less useful. TRIM-KV learns this distinction directly from token representations and keeps the most valuable tokens under a fixed memory budget.
38+
39+
Interestingly, the learned policy naturally recovers behaviors that resemble well-known heuristics such as sink tokens, sliding windows, and gist-like compression — but without explicitly hard-coding them.
40+
41+
## Results
42+
43+
We evaluate TRIM-KV on a diverse set of long-context and long-generation benchmarks. Across these settings, TRIM-KV consistently outperforms strong KV eviction and learnable retrieval baselines, especially in low-memory regimes. In several cases, it even performs better than full-cache inference, suggesting that selective retention can also act as a form of regularization by filtering out noisy or uninformative tokens.
44+
45+
- Long Mathematical Reasoning (AIME24, GSM8k, MATH-500)
46+
47+
![TrimKV on Math Reasoning|scale=0.8](./assets/math.png)
48+
49+
- Long Procedural Generation (LongProc)
50+
51+
![TrimKV on Long Procedural Generation|scale=0.5](./assets/longproc.png)
52+
53+
- Long Memory Conversation (LongMEmEval)
54+
55+
![TrimKV on LongMemEval|scale=0.8](./assets/longmemeval.png)
56+
57+
- Long-Context Understanding (SCBench, LongBench, LongBenchV2)
58+
59+
![TrimKV on Long-Context|scale=0.8](./assets/scbench.png)
60+
![TrimKV on Long-Context|scale=0.8](./assets/longbench.png)
61+
62+
63+
64+
## Takeaway
65+
66+
TRIM-KV shows that efficient long-context inference is not just about compressing memory — it is about **learning what to remember**.
67+
68+
By predicting token importance at creation time, TRIM-KV turns KV cache eviction into a simple, trainable, and effective mechanism for scaling LLM inference under memory constraints.
69+
70+
71+
---
72+
73+
## Citation
74+
75+
```bibtex
76+
@article{bui2025cache,
77+
title={Cache what lasts: Token retention for memory-bounded kv cache in llms},
78+
author={Bui, Ngoc and Sharma, Shubham and Lamba, Simran and Mishra, Saumitra and Ying, Rex},
79+
journal={arXiv preprint arXiv:2512.03324},
80+
year={2025}
81+
}
82+
```

config/publications.ts

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,17 @@ export interface Publication {
2020
}
2121

2222
export const publications: Publication[] = [
23+
{
24+
title: "Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs",
25+
authors: "Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying",
26+
venue: "ICLR 2026",
27+
page: "trimkv",
28+
code: "https://github.com/ngocbh/trimkv",
29+
paper: "https://arxiv.org/abs/2512.03324",
30+
abstract: "We propose TRIM-KV, a learnable KV cache eviction method for long-context and long-horizon LLM inference. Instead of relying on recent attention as a proxy for importance, TRIM-KV predicts each token’s intrinsic long-term utility at creation time using a lightweight retention gate whose score decays over time. Under a fixed memory budget, the model evicts tokens with the lowest retention scores, preserving the most useful context with negligible inference overhead.",
31+
impact: "TRIM-KV reframes KV cache eviction as a trainable memory-retention problem rather than a hand-crafted heuristic. It consistently improves memory-bounded LLM inference across reasoning, procedural generation, conversational memory, and long-context understanding benchmarks, often outperforming stronger eviction baselines and in some cases even full-cache inference, while also exposing interpretable token-retention patterns.",
32+
tags: [Tag.GenerativeModel],
33+
},
2334
{
2435
title: "HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data",
2536
authors: "Hiren Madhu, João Felipe Rocha, Tinglin Huang, Siddharth Viswanath, Smita Krishnaswamy, Rex Ying",
@@ -84,7 +95,7 @@ export const publications: Publication[] = [
8495
paper: "https://arxiv.org/abs/2504.05019",
8596
abstract: "We tackle the challenge of simulating diverse human behaviors using large language models (LLMs), which often struggle to reflect the variability across individuals and subpopulations. We introduce Mixture of Personas (MoP), a probabilistic prompting approach that models population diversity through a contextual mixture of persona-based language model agents.",
8697
impact: "Our work shows that probabilistic persona modeling offers a powerful mechanism for capturing population-level diversity in LLM simulations, opening up new possibilities for social science research, data augmentation.",
87-
tags: [],
98+
tags: [Tag.Applications],
8899
},
89100
{
90101
title: "Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning",

0 commit comments

Comments
 (0)