You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve accelerate() for large MoE models (Qwen3.5-35B-A3B)
- Skip cache for device_map='auto' — HF's shard-by-shard loading is faster
than our load-to-CPU-then-dispatch path (10s vs 25s)
- Add suffix-based tensor matching for MoE models where state_dict and
safetensors use different key prefixes (684/693 matched vs 1/693 before)
- Add match ratio check — skip cache save when <50% tensors match
- Fix meta tensor crash with to_empty() fallback
- Add parallel shard loading via ThreadPoolExecutor
- Update benchmark script with proper HF cache cleanup
- Update README with Qwen3.5-35B-A3B benchmark results
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+15-3Lines changed: 15 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -124,7 +124,7 @@ import torch
124
124
125
125
## Model Loading Acceleration
126
126
127
-
`zerostart.accelerate()` patches `from_pretrained` to speed up model loading by skipping unnecessary work (random weight init, repeated downloads). Add one line:
127
+
`zerostart.accelerate()` patches `from_pretrained` to speed up model loading. Sets `low_cpu_mem_usage=True` by default (skips random weight initialization), and auto-caches models for faster repeat loads on models that fit in GPU memory.
128
128
129
129
```python
130
130
import zerostart
@@ -134,6 +134,14 @@ from transformers import AutoModelForCausalLM
134
134
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", device_map="cuda")
All measured on RTX A6000 (48GB). For models requiring `device_map='auto'` (model > VRAM), accelerate() matches baseline by eliminating random weight initialization. For models that fit entirely in GPU memory, the mmap cache provides additional speedup.
| Suffix tensor matching | Handles MoE models where state_dict and safetensors use different key prefixes |
149
159
| Network volume fix | Eager read instead of mmap on NFS/JuiceFS (cold reads only*) |
150
160
| .bin conversion | Converts legacy checkpoints to safetensors, mmaps on repeat |
151
161
152
162
*Network volume fix only helps on cold reads from network-backed filesystems where mmap page faults trigger network round-trips. On FUSE with warm page cache (most container providers), mmap is already fast.
153
163
164
+
For `device_map='auto'` (model larger than VRAM), caching is skipped — HF's shard-by-shard loading directly to the right device is faster than our load-to-CPU-then-dispatch path.
0 commit comments