Skip to content

Commit 09aa5c8

Browse files
committed
feat: add qwen3.5-27B results | switch to smaller model for final try
1 parent 0cb0f28 commit 09aa5c8

5 files changed

Lines changed: 1384 additions & 29 deletions

File tree

autoresearch/README.md

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22

33
Autonomous ML research agent that iteratively improves a GPT pretraining script to minimize validation bits-per-byte (val_bpb). Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch).
44

5-
The key difference: everything runs **inside a single Docker container** on [Ocean](https://dashboard.oncompute.ai/) GPU nodes with a **local open-source LLM** — no API keys needed. The current setup uses 2×H200 GPUs: one dedicated to the agent LLM, the other to training.
5+
The key difference: everything runs **inside a single Docker container** on [Ocean](https://dashboard.oncompute.ai/) GPU nodes with a **local open-source LLM** — no API keys needed. The current setup uses a single H200 GPU with **Qwen3-14B** (unquantized bf16, ~28GB) — small enough to share the GPU with training at 0.7 temperature for maximum exploration throughput.
66

77
## From Karpathy's Experiment to Ocean
88

99
Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) uses the Claude API to drive the agent loop. We adapted it to run fully self-contained on Ocean Network:
1010

11-
1. **Local LLM instead of API** — Replaced Claude API calls with **Qwen3.5-27B** served via **vLLM** (unquantized bf16, ~54GB). No API keys, no per-token costs.
12-
2. **Dedicated GPUs**GPU 0 runs the agent LLM, GPU 1 runs training. Each gets the full 141GB — no memory-sharing complexity. (A single-GPU variant using Qwen3-32B-AWQ is available as `algo_qwen3-32B.py`.)
11+
1. **Local LLM instead of API** — Replaced Claude API calls with **Qwen3-14B** served via **vLLM** (unquantized bf16, ~28GB). No API keys, no per-token costs.
12+
2. **Single GPU**vLLM takes 25% of H200 VRAM (~35GB) for the agent LLM, training uses the rest. Previous variants with Qwen3-32B-AWQ and Qwen3.5-27B are preserved as `algo_qwen3-32B.py` and `algo_qwen3.5-27B.py`.
1313
3. **Single Docker container** — Everything packaged in one container: PyTorch, vLLM, Flash Attention 3, data pipeline. Ocean runs it on remote GPU nodes via a symlink to `/app/data/transformations/algorithm`.
1414
4. **Self-bootstrapping data**`prepare.py` downloads HuggingFace data shards and trains a BPE tokenizer at container startup, so nothing needs to persist between runs.
1515

@@ -20,7 +20,7 @@ A few clicks give you an autonomous ML researcher that runs for hours on H200 GP
2020
## How It Works
2121

2222
1. **Data prep** — Downloads HuggingFace data shards, trains a BPE tokenizer (`prepare.py`)
23-
2. **Load agent LLM** — Qwen3.5-27B via vLLM on GPU 0 (~54GB VRAM, stays resident)
23+
2. **Load agent LLM** — Qwen3-14B via vLLM (~28GB VRAM, stays resident, shares GPU with training)
2424
3. **Baseline run** — Runs the original `train.py` on GPU 1 (5-min training budget), records val_bpb
2525
4. **Agent loop** (up to 200 iterations):
2626
- LLM reads experiment history + current best `train.py`
@@ -35,8 +35,9 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
3535

3636
| File | Description |
3737
|------|-------------|
38-
| `algo.py` | Core agent loop — orchestrates LLM inference (GPU 0) and training (GPU 1) |
39-
| `algo_qwen3-32B.py` | Previous single-GPU variant using Qwen3-32B-AWQ |
38+
| `algo.py` | Core agent loop — Qwen3-14B, single H200, 0.7 temp, LLM and training share GPU |
39+
| `algo_qwen3-32B.py` | Single-GPU variant using Qwen3-32B-AWQ |
40+
| `algo_qwen3.5-27B.py` | 2×H200 variant using Qwen3.5-27B (one GPU per role) |
4041
| `train.py` | GPT pretraining script (the file the agent modifies) |
4142
| `prepare.py` | Data download, tokenizer, dataloader, evaluation (read-only) |
4243
| `program.md` | Instructions for the agent LLM |
@@ -46,7 +47,7 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
4647
## Usage
4748

4849
1. Go to [dashboard.oncompute.ai](https://dashboard.oncompute.ai/)
49-
2. Select a **H200 GPU** environment (or single H200 with `algo_qwen3-32B.py`)
50+
2. Select a **single H200 GPU** environment (or H200 with `algo_qwen3.5-27B.py`)
5051
3. Configure the job and add payment
5152
4. Open the **Ocean Orchestrator** in VS Code / your editor
5253
5. Open this directory in the orchestrator and run the job — the container builds and executes `algo.py` autonomously
@@ -59,7 +60,7 @@ python plot_progress.py path/to/results.json progress.png
5960

6061
## Results
6162

62-
All results below are from the single-GPU setup (Qwen3-32B-AWQ on one H200). Results with the 2×H200 / Qwen3.5-27B setup are pending.
63+
The first three runs used the single-GPU setup (Qwen3-32B-AWQ on one H200). The last run used the 2×H200 setup with Qwen3.5-27B.
6364

6465
### Qwen3-32B-AWQ — 0.7 Temperature (First Run)
6566

@@ -88,6 +89,17 @@ All results below are from the single-GPU setup (Qwen3-32B-AWQ on one H200). Res
8889
- **201 iterations** over 12 hours, 52 successful runs (74% crash rate)
8990
- Double the runtime of the first run but worse results — the agent got stuck and couldn't escape the local minimum
9091

92+
### Qwen3.5-27B — 2×H200, ~12 Hours
93+
94+
![Qwen3.5-27B progress](assets/images/qwen3.5-27B_progress.png)
95+
96+
- **Baseline**: 1.0251 val_bpb
97+
- **Best**: 0.9993 val_bpb (2.5% improvement)
98+
- **77 iterations** over ~12 hours, 38 successful runs (51% crash rate)
99+
- Key improvements: model depth (more layers), concentrated in the second half of the run
100+
91101
### Takeaway
92102

93103
Lower temperature (0.5 vs 0.7) reduces the crash rate (62-74% vs 86%) but produces significantly worse results. The more "creative" 0.7 temperature generates more broken code, but the successful mutations are bolder and lead to real architectural improvements (e.g. deeper models). At 0.5 temp the agent plays it safe, converges early to ~1.007 val_bpb, and stalls — even with 12 hours of compute it can't match what 0.7 temp achieved in 5.5 hours.
104+
105+
Switching from the quantized Qwen3-32B-AWQ (single GPU) to the full Qwen3.5-27B (2×H200) didn't help — the larger model ran fewer experiments in the same time (77 vs 201), had a lower crash rate (51% vs 86%), but couldn't beat the 0.9818 val_bpb that Qwen3-32B at 0.7 temp reached. The reduced throughput likely offset any quality gains from the stronger model.

autoresearch/algo.py

Lines changed: 7 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
"""
22
Autonomous autoresearch agent loop for ocean network.
33
4-
Runs inside a Docker container on a H200 GPU node.
5-
GPU 0: dedicated to the agent LLM (Qwen3.5-27B via vLLM, unquantized bf16)
6-
GPU 1: dedicated to training (full 141GB VRAM)
4+
Runs inside a Docker container on a single H200 GPU node.
5+
The agent LLM (Qwen3-14B via vLLM, unquantized bf16, ~28GB) and training
6+
share the single GPU — vLLM takes 25% of VRAM, training uses the rest.
77
"""
88

99
import json
@@ -14,24 +14,16 @@
1414
import time
1515
from datetime import datetime, timezone
1616

17-
# ---------------------------------------------------------------------------
18-
# GPU isolation — must be set before any CUDA imports
19-
# ---------------------------------------------------------------------------
20-
21-
AGENT_GPU = "0"
22-
TRAINING_GPU = "1"
23-
os.environ["CUDA_VISIBLE_DEVICES"] = AGENT_GPU # vLLM only sees GPU 0
24-
2517
# ---------------------------------------------------------------------------
2618
# Configuration
2719
# ---------------------------------------------------------------------------
2820

29-
AGENT_MODEL = "Qwen/Qwen3.5-27B"
30-
GPU_MEMORY_UTILIZATION = 0.90 # dedicated GPU — use most of it
21+
AGENT_MODEL = "Qwen/Qwen3-14B"
22+
GPU_MEMORY_UTILIZATION = 0.25 # ~35GB for LLM weights+KV cache, rest for training
3123
MAX_ITERATIONS = 200
3224
TRAINING_TIMEOUT = 600 # 10 minutes
33-
MAX_MODEL_LEN = 65536 # larger context — dedicated GPU has plenty of room
34-
MAX_OUTPUT_TOKENS = 10000 # train.py is ~8K tokens; need enough room for the full file
25+
MAX_MODEL_LEN = 40960
26+
MAX_OUTPUT_TOKENS = 16384 # max tokens for LLM output (enough for full train.py)
3527
TEMPERATURE = 0.7
3628
STAGNATION_THRESHOLD = 3 # consecutive non-improvements before nudge
3729
MAX_HISTORY_IN_PROMPT = 20 # only show last N iterations in prompt
@@ -146,18 +138,13 @@ def run_training(train_py_content):
146138
"""
147139
write_file(TRAIN_PY_PATH, train_py_content)
148140

149-
# Run training on the dedicated training GPU
150-
train_env = os.environ.copy()
151-
train_env["CUDA_VISIBLE_DEVICES"] = TRAINING_GPU
152-
153141
try:
154142
result = subprocess.run(
155143
[sys.executable, TRAIN_PY_PATH],
156144
capture_output=True,
157145
text=True,
158146
timeout=TRAINING_TIMEOUT,
159147
cwd="/app",
160-
env=train_env,
161148
)
162149
except subprocess.TimeoutExpired as e:
163150
stderr_text = e.stderr if isinstance(e.stderr, str) else (e.stderr.decode() if e.stderr else "")
@@ -432,7 +419,6 @@ def main():
432419
max_model_len=MAX_MODEL_LEN,
433420
dtype="auto",
434421
trust_remote_code=True,
435-
enforce_eager=True, # required: CUDA graphs fail for Qwen3.5 DeltaNet on this vLLM version
436422
)
437423
sampling_params = SamplingParams(
438424
max_tokens=MAX_OUTPUT_TOKENS,

0 commit comments

Comments
 (0)