You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: autoresearch/README.md
+20-8Lines changed: 20 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,14 +2,14 @@
2
2
3
3
Autonomous ML research agent that iteratively improves a GPT pretraining script to minimize validation bits-per-byte (val_bpb). Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch).
4
4
5
-
The key difference: everything runs **inside a single Docker container** on [Ocean](https://dashboard.oncompute.ai/) GPU nodes with a **local open-source LLM** — no API keys needed. The current setup uses 2×H200 GPUs: one dedicated to the agent LLM, the other to training.
5
+
The key difference: everything runs **inside a single Docker container** on [Ocean](https://dashboard.oncompute.ai/) GPU nodes with a **local open-source LLM** — no API keys needed. The current setup uses a single H200 GPU with **Qwen3-14B** (unquantized bf16, ~28GB) — small enough to share the GPU with training at 0.7 temperature for maximum exploration throughput.
6
6
7
7
## From Karpathy's Experiment to Ocean
8
8
9
9
Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) uses the Claude API to drive the agent loop. We adapted it to run fully self-contained on Ocean Network:
10
10
11
-
1.**Local LLM instead of API** — Replaced Claude API calls with **Qwen3.5-27B** served via **vLLM** (unquantized bf16, ~54GB). No API keys, no per-token costs.
12
-
2.**Dedicated GPUs** — GPU 0 runs the agent LLM, GPU 1 runs training. Each gets the full 141GB — no memory-sharing complexity. (A single-GPU variant using Qwen3-32B-AWQ is available as `algo_qwen3-32B.py`.)
11
+
1.**Local LLM instead of API** — Replaced Claude API calls with **Qwen3-14B** served via **vLLM** (unquantized bf16, ~28GB). No API keys, no per-token costs.
12
+
2.**Single GPU** — vLLM takes 25% of H200 VRAM (~35GB) for the agent LLM, training uses the rest. Previous variants with Qwen3-32B-AWQ and Qwen3.5-27B are preserved as `algo_qwen3-32B.py` and `algo_qwen3.5-27B.py`.
13
13
3.**Single Docker container** — Everything packaged in one container: PyTorch, vLLM, Flash Attention 3, data pipeline. Ocean runs it on remote GPU nodes via a symlink to `/app/data/transformations/algorithm`.
14
14
4.**Self-bootstrapping data** — `prepare.py` downloads HuggingFace data shards and trains a BPE tokenizer at container startup, so nothing needs to persist between runs.
15
15
@@ -20,7 +20,7 @@ A few clicks give you an autonomous ML researcher that runs for hours on H200 GP
20
20
## How It Works
21
21
22
22
1.**Data prep** — Downloads HuggingFace data shards, trains a BPE tokenizer (`prepare.py`)
23
-
2.**Load agent LLM** — Qwen3.5-27B via vLLM on GPU 0 (~54GB VRAM, stays resident)
23
+
2.**Load agent LLM** — Qwen3-14B via vLLM (~28GB VRAM, stays resident, shares GPU with training)
24
24
3.**Baseline run** — Runs the original `train.py` on GPU 1 (5-min training budget), records val_bpb
25
25
4.**Agent loop** (up to 200 iterations):
26
26
- LLM reads experiment history + current best `train.py`
@@ -35,8 +35,9 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
35
35
36
36
| File | Description |
37
37
|------|-------------|
38
-
|`algo.py`| Core agent loop — orchestrates LLM inference (GPU 0) and training (GPU 1) |
39
-
|`algo_qwen3-32B.py`| Previous single-GPU variant using Qwen3-32B-AWQ |
38
+
|`algo.py`| Core agent loop — Qwen3-14B, single H200, 0.7 temp, LLM and training share GPU |
39
+
|`algo_qwen3-32B.py`| Single-GPU variant using Qwen3-32B-AWQ |
40
+
|`algo_qwen3.5-27B.py`| 2×H200 variant using Qwen3.5-27B (one GPU per role) |
40
41
|`train.py`| GPT pretraining script (the file the agent modifies) |
41
42
|`prepare.py`| Data download, tokenizer, dataloader, evaluation (read-only) |
42
43
|`program.md`| Instructions for the agent LLM |
@@ -46,7 +47,7 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
46
47
## Usage
47
48
48
49
1. Go to [dashboard.oncompute.ai](https://dashboard.oncompute.ai/)
49
-
2. Select a **2×H200 GPU** environment (or single H200 with `algo_qwen3-32B.py`)
50
+
2. Select a **single H200 GPU** environment (or 2×H200 with `algo_qwen3.5-27B.py`)
50
51
3. Configure the job and add payment
51
52
4. Open the **Ocean Orchestrator** in VS Code / your editor
52
53
5. Open this directory in the orchestrator and run the job — the container builds and executes `algo.py` autonomously
- Key improvements: model depth (more layers), concentrated in the second half of the run
100
+
91
101
### Takeaway
92
102
93
103
Lower temperature (0.5 vs 0.7) reduces the crash rate (62-74% vs 86%) but produces significantly worse results. The more "creative" 0.7 temperature generates more broken code, but the successful mutations are bolder and lead to real architectural improvements (e.g. deeper models). At 0.5 temp the agent plays it safe, converges early to ~1.007 val_bpb, and stalls — even with 12 hours of compute it can't match what 0.7 temp achieved in 5.5 hours.
104
+
105
+
Switching from the quantized Qwen3-32B-AWQ (single GPU) to the full Qwen3.5-27B (2×H200) didn't help — the larger model ran fewer experiments in the same time (77 vs 201), had a lower crash rate (51% vs 86%), but couldn't beat the 0.9818 val_bpb that Qwen3-32B at 0.7 temp reached. The reduced throughput likely offset any quality gains from the stronger model.
0 commit comments