|
3 | 3 | Parallel streaming wheel extraction for installing large Python packages on remote GPUs. |
4 | 4 |
|
5 | 5 | ```bash |
6 | | -zerostart run -p torch serve.py |
| 6 | +zerostart run serve.py |
7 | 7 | ``` |
8 | 8 |
|
9 | | -Works on any container GPU provider — RunPod, Vast.ai, Lambda, etc. |
| 9 | +Auto-detects dependencies from PEP 723 inline metadata, `pyproject.toml`, or `requirements.txt`. Works on any container GPU provider — RunPod, Vast.ai, Lambda, etc. |
10 | 10 |
|
11 | 11 | ## Benchmarks |
12 | 12 |
|
13 | | -Measured on RTX 4090 pods (RunPod). Results vary with network speed — slower pod networks show a larger advantage for zerostart because there's more room for parallel downloads to help. |
14 | | - |
15 | 13 | ### Cold Start (first run, empty cache) |
16 | 14 |
|
17 | | -| Workload | zerostart | uv | Speedup | |
18 | | -|----------|-----------|-----|---------| |
19 | | -| torch + CUDA (6.8 GB) | 33s | 98s | 3x | |
20 | | -| vllm (9.4 GB) | 60s | 58s | ~1x | |
21 | | -| triton (638 MB) | 3.4s | 1.0s | uv faster | |
| 15 | +Cold start speedup depends on pod network bandwidth. zerostart opens multiple parallel HTTP connections per wheel — this helps when a single connection can't saturate the link, but doesn't help when one connection already maxes out the pipe. |
22 | 16 |
|
23 | | -zerostart's cold start advantage comes from parallel HTTP Range requests. For large packages like torch on a bandwidth-limited connection, this matters. For small packages like triton, the overhead isn't worth it — just use uv. |
| 17 | +| Pod network | Workload | zerostart | uv | Speedup | |
| 18 | +|-------------|----------|-----------|-----|---------| |
| 19 | +| Moderate (~200 Mbps) | torch (6.8 GB) | 33s | 98s | 3x | |
| 20 | +| Moderate (~200 Mbps) | triton (638 MB) | 3.4s | 1.0s | uv faster | |
| 21 | +| Fast (~1 Gbps) | diffusers+torch (7 GB) | 57s | 57s | ~1x | |
24 | 22 |
|
25 | | -vllm cold starts are roughly comparable. The package set is large (177 wheels) but many are small, so uv's single-connection approach keeps up. |
| 23 | +On bandwidth-constrained pods (common with cheaper providers), parallel Range requests download large wheels 3x faster. On fast-network pods, a single connection already saturates the link and both tools finish in about the same time. For small packages, zerostart's startup overhead makes uv faster — just use uv directly. |
26 | 24 |
|
27 | 25 | ### Warm Start (cached environment) |
28 | 26 |
|
| 27 | +Warm starts are where zerostart consistently wins regardless of network speed. uv re-resolves dependencies and rebuilds the environment on every invocation. zerostart checks a cache marker and exec's Python directly. |
| 28 | + |
29 | 29 | | Workload | zerostart | uv | Speedup | |
30 | 30 | |----------|-----------|-----|---------| |
31 | 31 | | torch | 1.8s | 13.2s | 7x | |
32 | 32 | | vllm | 3.3s | 14.5s | 4x | |
33 | 33 | | triton | 0.2s | 1.0s | 5x | |
34 | 34 |
|
35 | | -Warm starts are where zerostart consistently wins. uv re-resolves dependencies and rebuilds the environment on every run. zerostart checks a cache marker and exec's Python directly — no resolution, no environment setup. |
36 | | - |
37 | | -### Network speed matters |
38 | | - |
39 | | -On pods with slower network (common with cheaper providers), the cold start advantage grows because parallel Range requests can saturate the link where a single connection can't. On fast-network pods (1Gbps+), uv downloads quickly enough that the parallel approach helps less. |
| 35 | +All measured on RunPod (RTX 4090 / A6000). |
40 | 36 |
|
41 | 37 | ## How It Works |
42 | 38 |
|
@@ -92,38 +88,39 @@ Requires Linux + Python 3.10+ + `uv` (pre-installed on most GPU containers). |
92 | 88 | ## Quick Start |
93 | 89 |
|
94 | 90 | ```bash |
95 | | -# Run a script with dependencies |
96 | | -zerostart run -p torch -p transformers serve.py |
| 91 | +# Auto-detect deps from PEP 723 metadata, pyproject.toml, or requirements.txt |
| 92 | +zerostart run serve.py |
97 | 93 |
|
98 | | -# Run inline |
99 | | -zerostart run torch -- -c "import torch; print(torch.cuda.is_available())" |
| 94 | +# Add extra packages on top of auto-detected deps |
| 95 | +zerostart run -p torch serve.py |
100 | 96 |
|
101 | | -# With a requirements file |
| 97 | +# Explicit requirements file |
102 | 98 | zerostart run -r requirements.txt serve.py |
103 | 99 |
|
| 100 | +# Run a package directly |
| 101 | +zerostart run torch -- -c "import torch; print(torch.cuda.is_available())" |
| 102 | + |
104 | 103 | # Pass args to your script |
105 | 104 | zerostart run serve.py -- --port 8000 |
106 | 105 | ``` |
107 | 106 |
|
108 | | -### PEP 723 Inline Script Metadata |
| 107 | +### Dependency Detection |
109 | 108 |
|
110 | | -Embed dependencies directly in your script — no `requirements.txt` needed: |
| 109 | +zerostart automatically finds dependencies — no flags needed: |
111 | 110 |
|
| 111 | +1. **PEP 723 inline metadata** (checked first): |
112 | 112 | ```python |
113 | 113 | # /// script |
114 | 114 | # dependencies = ["torch>=2.0", "transformers", "safetensors"] |
115 | 115 | # /// |
116 | | - |
117 | 116 | import torch |
118 | | -from transformers import AutoModel |
119 | | - |
120 | | -model = AutoModel.from_pretrained("bert-base-uncased") |
121 | | -print(f"Loaded on {model.device}") |
122 | 117 | ``` |
123 | 118 |
|
124 | | -```bash |
125 | | -zerostart run serve.py # deps auto-detected from script |
126 | | -``` |
| 119 | +2. **pyproject.toml** `[project.dependencies]` in the script's directory or parents |
| 120 | + |
| 121 | +3. **requirements.txt** in the script's directory or parents |
| 122 | + |
| 123 | +`-p` and `-r` flags add packages on top of whatever is auto-detected. |
127 | 124 |
|
128 | 125 | ## Model Loading Acceleration |
129 | 126 |
|
@@ -212,14 +209,13 @@ Key design decisions: |
212 | 209 | ## When to Use It |
213 | 210 |
|
214 | 211 | **Good fit:** |
215 | | -- Large GPU packages (torch, vllm, diffusers) on container providers with moderate network |
216 | | -- Repeated runs where warm start time matters |
217 | | -- Spot instances, CI/CD, autoscaling where cold starts add up |
| 212 | +- Repeated runs on the same pod — warm starts are 4-7x faster than uv |
| 213 | +- Large GPU packages on bandwidth-constrained pods — parallel downloads help when a single connection is slow |
| 214 | +- Spot instances, CI/CD, autoscaling where you restart often and warm cache pays off |
218 | 215 |
|
219 | 216 | **Not worth it:** |
220 | | -- Small packages — uv is already fast, zerostart adds overhead |
221 | | -- One-off scripts that don't repeat |
222 | | -- Pods with very fast network (1Gbps+) where uv cold starts are already quick |
| 217 | +- One-off cold starts on fast-network pods — uv is just as fast |
| 218 | +- Small packages — uv is faster, zerostart adds startup overhead |
223 | 219 | - Local NVMe with models in page cache |
224 | 220 |
|
225 | 221 | ## Requirements |
|
0 commit comments