Commit e06faa5
feat: chat-mode KV cache reuse — O(history^2) → O(new tokens) per turn
User-reported issue: chat mode gets exponentially slower as history
accumulates. Each turn re-prefills the entire conversation through
all transformer layers because both quant_generate (single-header)
and the HTTP server's tq_generate were freeing the KV state on every
call. Result: turn N's prefill cost was O(N * total_history_tokens),
which is O(N²) cumulative.
Fix: introduce tq_generate_continue / quant_chat that:
1. Keeps the KV state alive across calls (caller-managed)
2. Tracks the token IDs currently committed to the KV cache
3. On each call, computes the longest common prefix (LCP) between
the cached tokens and the new prompt, and only prefills the
diverging suffix [LCP, n_new)
4. Updates the cache record with the prompt + generated tokens
Three layers wired up:
1. quant.h (single-header / Python wheel)
- quant_ctx now stores cached_tokens / n_cached / cached_capacity
- new public quant_chat(ctx, prompt, cb, ud) — pass NULL prompt
to reset the session
- existing quant_generate unchanged for backwards compat
2. src/engine/tq_generate.c (library build)
- new tq_generate_continue(model, tok, state, prompt, config,
**cached, *n_cached, *cap, output, size)
- same prefix-match logic, mirrors the single-header impl
3. src/server/tq_server.c (HTTP server)
- tq_server now holds a persistent kv_state + cached_tokens
- both /v1/chat/completions paths (streaming + non-streaming)
call tq_generate_continue instead of tq_generate
- state freed on tq_server_free
4. bindings/python/quantcpp
- _binding.py: optional binding for quant_chat (gracefully
missing on older single-header builds)
- Model.chat(prompt) — generator with KV reuse, falls back to
generate() if symbol unavailable
- Model.reset_chat() — wipes the session
- cli.py: `quantcpp run` interactive loop now accumulates ChatML
history and uses Model.chat() for cheap re-sends
Measured (SmolLM2-135M, M1 Pro, single thread, 10 turns of accumulating
synthetic chat history, max_tokens=8/turn):
quant_generate (no reuse): 295 → 681 → 1105 → 1581 → 2105 → 2660
→ 3245 → 3926 → 4679 → 5386 ms
quant_chat (with reuse): 294 → 430 → 451 → 509 → 545 → 608
→ 693 → 750 → 796 → 902 ms
Turn 10 speedup: 5386 → 902 ms (5.97x)
Identical-prompt repeat (perfect LCP): 366 → 91/91/91/91 ms (4x)
Caveat: when assistant responses contain text that re-tokenizes
differently in the larger context (BPE merge non-roundtripping),
LCP truncates and the suffix re-prefills. Real-world chat clients
that replay the exact assistant response see >90% of the speedup.
Worst-case is still better than the no-reuse baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 45f5d58 commit e06faa5
6 files changed
Lines changed: 566 additions & 7 deletions
File tree
- bindings/python/quantcpp
- src
- engine
- server
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
383 | 383 | | |
384 | 384 | | |
385 | 385 | | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
386 | 459 | | |
387 | 460 | | |
388 | 461 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
132 | 132 | | |
133 | 133 | | |
134 | 134 | | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
135 | 149 | | |
136 | 150 | | |
137 | 151 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
152 | 152 | | |
153 | 153 | | |
154 | 154 | | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
155 | 159 | | |
156 | 160 | | |
157 | 161 | | |
158 | 162 | | |
159 | 163 | | |
| 164 | + | |
160 | 165 | | |
161 | | - | |
| 166 | + | |
| 167 | + | |
162 | 168 | | |
| 169 | + | |
163 | 170 | | |
| 171 | + | |
164 | 172 | | |
165 | 173 | | |
166 | 174 | | |
| |||
0 commit comments