You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- fixes session title generation
- adds 'context_size' provider_opt for DMR usage instead of giving 'max_tokens' double responsibility to avoid confusion
- improved thinking budget support and fix for NoThinking()
- improves how flags are sent to the DMR model/runtime configuration endpoint
- clarify docs on sampling/runtime params
Signed-off-by: Christopher Petito <chrisjpetito@gmail.com>
Copy file name to clipboardExpand all lines: docs/providers/dmr/index.md
+94-12Lines changed: 94 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,29 +64,111 @@ models:
64
64
model: ai/qwen3
65
65
max_tokens: 8192
66
66
provider_opts:
67
-
runtime_flags: ["--ngl=33", "--top-p=0.9"]
67
+
runtime_flags: ["--threads", "8"]
68
68
```
69
69
70
70
Runtime flags also accept a single string:
71
71
72
72
```yaml
73
73
provider_opts:
74
-
runtime_flags: "--ngl=33 --top-p=0.9"
74
+
runtime_flags: "--threads 8"
75
75
```
76
76
77
-
## Parameter Mapping
77
+
Use only flags your Model Runner backend allows (see `docker model configure --help` and backend docs). **Do not** put sampling parameters (`temperature`, `top_p`, penalties) in `runtime_flags` — set them on the model (`temperature`, `top_p`, etc.); they are sent **per request** via the OpenAI-compatible chat API.
78
78
79
-
docker-agent model config fields map to llama.cpp flags automatically:
79
+
## Context size
80
80
81
-
| Config | llama.cpp Flag |
82
-
| ------------------- | --------------------- |
83
-
| `temperature` | `--temp` |
84
-
| `top_p` | `--top-p` |
85
-
| `frequency_penalty` | `--frequency-penalty` |
86
-
| `presence_penalty` | `--presence-penalty` |
87
-
| `max_tokens` | `--context-size` |
81
+
`max_tokens` controls the **maximum output tokens** per chat completion request. To set the engine's **total context window**, use `provider_opts.context_size`:
88
82
89
-
`runtime_flags`always take priority over derived flags on conflict.
83
+
```yaml
84
+
models:
85
+
local:
86
+
provider: dmr
87
+
model: ai/qwen3
88
+
max_tokens: 4096 # max output tokens (per-request)
89
+
provider_opts:
90
+
context_size: 32768 # total context window (sent via _configure)
91
+
```
92
+
93
+
If `context_size` is omitted, Model Runner uses its default. `max_tokens` is **not** used as the context window.
94
+
95
+
## Thinking / reasoning budget
96
+
97
+
When using the **llama.cpp** backend, `thinking_budget` is sent as structured `llamacpp.reasoning-budget` on `_configure` (maps to `--reasoning-budget`). String efforts use the same token mapping as other providers; `adaptive` maps to unlimited (`-1`).
98
+
99
+
When using the **vLLM** backend, `thinking_budget` is sent as `thinking_token_budget` in each chat completion request. Effort levels map to token counts using the same scale as other providers; `adaptive` maps to unlimited (`-1`).
100
+
101
+
```yaml
102
+
models:
103
+
local:
104
+
provider: dmr
105
+
model: ai/qwen3
106
+
thinking_budget: medium # llama.cpp: reasoning-budget=8192; vLLM: thinking_token_budget=8192
107
+
```
108
+
109
+
On **MLX** and **SGLang** backends, `thinking_budget` is silently ignored — those engines do not currently expose a per-request reasoning token budget knob.
110
+
111
+
## vLLM-specific configuration
112
+
113
+
When running a model on the **vLLM** backend, additional engine-level settings can be passed via `provider_opts` and are forwarded to model-runner's `_configure` endpoint:
114
+
115
+
- `gpu_memory_utilization`— fraction of GPU memory (0.0–1.0) vLLM may use. Values outside this range are rejected.
116
+
- `hf_overrides`— map of Hugging Face config overrides applied when vLLM loads the model.
117
+
118
+
```yaml
119
+
models:
120
+
vllm-local:
121
+
provider: dmr
122
+
model: ai/some-model-safetensors
123
+
provider_opts:
124
+
gpu_memory_utilization: 0.9
125
+
hf_overrides:
126
+
max_model_len: 8192
127
+
dtype: bfloat16
128
+
```
129
+
130
+
`hf_overrides`keys (including nested ones) must match `^[a-zA-Z_][a-zA-Z0-9_]*$` — the same rule model-runner enforces server-side to block injection via flags. Invalid keys are rejected at client creation time so you fail fast instead of after a round-trip.
131
+
132
+
These options are ignored on non-vLLM backends.
133
+
134
+
## Keeping models resident in memory (`keep_alive`)
135
+
136
+
By default model-runner unloads idle models after a few minutes. Override the idle timeout via `provider_opts.keep_alive`:
137
+
138
+
```yaml
139
+
models:
140
+
sticky:
141
+
provider: dmr
142
+
model: ai/qwen3
143
+
provider_opts:
144
+
keep_alive: "30m" # duration string
145
+
# keep_alive: "0" # unload immediately after each request
146
+
# keep_alive: "-1" # keep loaded forever
147
+
```
148
+
149
+
Accepted values: any Go duration string (`"30s"`, `"5m"`, `"1h"`, `"2h30m"`), `"0"` (immediate unload), or `"-1"` (never unload). Invalid values are rejected before the configure request is sent.
150
+
151
+
## Operating mode (`mode`)
152
+
153
+
Model-runner normally infers the backend mode from the request path. You can pin it explicitly via `provider_opts.mode`:
154
+
155
+
```yaml
156
+
provider_opts:
157
+
mode: embedding # one of: completion, embedding, reranking, image-generation
158
+
```
159
+
160
+
Most agents don't need this — leave it unset unless you know you need it.
161
+
162
+
## Raw runtime flags (`raw_runtime_flags`)
163
+
164
+
`runtime_flags` (a list) is the preferred way to pass flags. If you have a pre-built command-line string you'd rather ship verbatim, use `raw_runtime_flags` instead:
165
+
166
+
```yaml
167
+
provider_opts:
168
+
raw_runtime_flags: "--threads 8 --batch-size 512"
169
+
```
170
+
171
+
Model-runner parses the string with shell-style word splitting. `runtime_flags` and `raw_runtime_flags` are mutually exclusive — setting both is an error.
0 commit comments