You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/wiki/core/Llama.md
+38-14Lines changed: 38 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,13 +4,13 @@ title: Llama Class
4
4
module_name: llama_cpp.llama
5
5
source_file: llama_cpp/llama.py
6
6
class_name: Llama
7
-
last_updated: 2026-05-01
7
+
last_updated: 2026-05-06
8
8
version_target: "latest"
9
9
---
10
10
```
11
11
12
12
## Overview
13
-
The `Llama` class is the core, high-level Python wrapper for a `llama.cpp` model. It handles model loading, memory management (KV cache), tokenization, and generation (both base text completion and chat formatting). It includes advanced features like dynamic LoRA routing, hybrid model checkpointing, speculative decoding, and context shifting.
13
+
The `Llama` class is the core, high-level Python wrapper for a `llama.cpp` model. It handles model loading, memory management (KV cache), tokenization, and generation (both base text completion and chat formatting). It includes advanced features like dynamic LoRA routing, dual-mode hybrid/recurrent checkpointing, speculative decoding, and context shifting.
14
14
15
15
## Constructor (`__init__`)
16
16
@@ -51,8 +51,9 @@ Initialize the model and context. Note that model loading will immediately alloc
51
51
|`chat_format`|`str`|`None`| String specifying the chat template (e.g., `"llama-2"`, `"chatml"`). Guessed from GGUF if None. |
52
52
|`chat_handler`|`LlamaChatCompletionHandler`|`None`| Optional custom handler. See [[ChatHandlers]]. |
53
53
|`draft_model`|`LlamaDraftModel`|`None`| Optional draft model for speculative decoding. |
54
-
|`ctx_checkpoints`|`int`|`32`| Max context checkpoints per slot (Hybrid/SWA models). |
55
-
|`checkpoint_interval`|`int`|`4096`| Token interval for saving Hybrid model checkpoints. |
54
+
|`ctx_checkpoints`|`int`|`16`| Max hybrid/recurrent context checkpoints to keep. Set to `0` to disable checkpointing for single-turn fast paths. |
55
+
|`checkpoint_interval`|`int`|`4096`| Token interval for saving periodic Hybrid/Recurrent checkpoints during long prompt evaluation. |
56
+
|`checkpoint_on_device`|`bool`|`False`| Store Hybrid/Recurrent checkpoint tensor payloads in `llama_context`-owned device buffers via `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`. Reduces device-to-host copy overhead, but only one active checkpoint per `seq_id` is safe. |
56
57
57
58
*(Note: There are numerous additional RoPE/YaRN scaling parameters available for specialized context extension. Refer to the source code for the full list).*
58
59
@@ -189,18 +190,41 @@ The `Llama` class allows you to load multiple LoRAs into VRAM and apply them dyn
189
190
190
191
5. **Hybrid & Recurrent Architectures**:
191
192
192
-
The class natively detects Hybrid/Recurrent models (like LFM2VL/LFM2.5VL, Qwen3.5/3.6, Mambaor specialized SWA models(Gemma3/4))and automatically enables the `HybridCheckpointCache`. This creates periodic save-states during large context pre-filling, allowing the model to roll back seamlessly if a generation is rejected (e.g., speculative decoding mismatches) without corrupting the recurrent state.
193
+
The class natively detects Hybrid/Recurrent models (for example LFM2VL/LFM2.5VL, Qwen3.5/3.6, Mamba, RWKV, or specialized SWA models such asGemma3/4) and automatically enables the `HybridCheckpointCache`.
193
194
194
-
* Tips: If you are using hybrid multimodal model for building ComfyUI nodes or running single-turn API wrappers where you do not need multi-turn state rollbacks, simply initialize your Llama instance with`ctx_checkpoints=0`:
195
+
Unlike regular Transformer KV caches, Hybrid/Recurrent model memory cannot always be safely truncated token-by-token. The wrapper therefore saves periodic sequence-state checkpoints during long context prefill, allowing rollback to a verified prefix without corrupting recurrent state.
196
+
197
+
`HybridCheckpointCache` supports two checkpoint storage modes:
198
+
199
+
-**Host checkpoint mode** (`checkpoint_on_device=False`, default): checkpoint payloads are serialized into Python-owned bytes. This supports multiple historical checkpoints per `seq_id`, which is useful for multi-turn reuse and deeper rollback history.
200
+
-**Device checkpoint mode** (`checkpoint_on_device=True`): checkpoint tensor payloads are stored in`llama_context`-owned device buffers via `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`. Python only keeps the host-visible serialized portion. This reduces device-to-host tensor copy overhead, but only one active checkpoint per `seq_id`is safe because device payloads are keyed by `seq_id`.
201
+
202
+
*Tips*: If you are using a hybrid multimodal model for ComfyUI nodes or single-turn API wrappers where you do not need multi-turn state rollback, initialize your `Llama` instance with`ctx_checkpoints=0`:
ctx_checkpoints=0# Disable checkpoints for zero-latency single-turn fast paths
210
+
)
211
+
```
212
+
213
+
For long prompts on GPU-backed Hybrid/Recurrent models, you can enable device-backed checkpoints to reduce device-to-host copy overhead:
214
+
215
+
```python
216
+
llm = Llama(
217
+
model_path="./Qwen3.6-27B.gguf",
218
+
n_ctx=32768,
219
+
n_gpu_layers=-1,
220
+
ctx_checkpoints=16,
221
+
checkpoint_interval=4096,
222
+
checkpoint_on_device=True
223
+
)
224
+
```
225
+
226
+
Use `checkpoint_on_device=False`if you need multiple historical checkpoints for the same `seq_id`. Use `checkpoint_on_device=True` when fast rollback/checkpointing is more important than keeping many historical checkpoint payloads.
ctx_checkpoints=0# <-- SET THIS TO 0 TO ENABLE ZERO-LATENCY FAST PATH
202
-
)
203
-
```
204
228
6. **Assistant Prefill**:
205
229
206
230
`llama-cpp-python` supports native **Assistant Prefill**for seamless message continuation. You can now simply use the `assistant_prefill=True` parameter in the `create_chat_completion` function.
0 commit comments