Skip to content

Commit 2bd232e

Browse files
Alexandru DimaAlexandru Dima
authored andcommitted
Merge branch 'main' of github.com:oceanprotocol/oncompute-tutorials
Merge
2 parents 14612f8 + 7ad065e commit 2bd232e

5 files changed

Lines changed: 657 additions & 13 deletions

File tree

autoresearch/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ RUN uv pip install --system --no-cache-dir \
1919
torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
2020

2121
RUN uv pip install --system --no-cache-dir \
22-
vllm \
22+
"vllm>=0.17.0" \
2323
kernels>=0.11.7 \
2424
rustbpe>=0.1.0 \
2525
tiktoken>=0.11.0 \

autoresearch/README.md

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,26 @@
22

33
Autonomous ML research agent that iteratively improves a GPT pretraining script to minimize validation bits-per-byte (val_bpb). Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch).
44

5-
The key difference: everything runs **inside a single Docker container** on an [Ocean](https://dashboard.oncompute.ai/) GPU node (H200, 141GB VRAM) with a **local open-source LLM** — no API keys needed.
5+
The key difference: everything runs **inside a single Docker container** on [Ocean](https://dashboard.oncompute.ai/) GPU nodes with a **local open-source LLM** — no API keys needed. The current setup uses 2×H200 GPUs: one dedicated to the agent LLM, the other to training.
6+
7+
## From Karpathy's Experiment to Ocean
8+
9+
Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) uses the Claude API to drive the agent loop. We adapted it to run fully self-contained on Ocean Network:
10+
11+
1. **Local LLM instead of API** — Replaced Claude API calls with **Qwen3.5-27B** served via **vLLM** (unquantized bf16, ~54GB). No API keys, no per-token costs.
12+
2. **Dedicated GPUs** — GPU 0 runs the agent LLM, GPU 1 runs training. Each gets the full 141GB — no memory-sharing complexity. (A single-GPU variant using Qwen3-32B-AWQ is available as `algo_qwen3-32B.py`.)
13+
3. **Single Docker container** — Everything packaged in one container: PyTorch, vLLM, Flash Attention 3, data pipeline. Ocean runs it on remote GPU nodes via a symlink to `/app/data/transformations/algorithm`.
14+
4. **Self-bootstrapping data**`prepare.py` downloads HuggingFace data shards and trains a BPE tokenizer at container startup, so nothing needs to persist between runs.
15+
16+
> **Alternative**: You can also use the Claude API (or any LLM API) from inside the container by passing an API key as an environment variable. Stronger model, but adds cost and network dependency.
17+
18+
A few clicks give you an autonomous ML researcher that runs for hours on H200 GPUs, costs nothing beyond the compute rental, and produces a `results.json` with the full experiment history and winning code.
619

720
## How It Works
821

922
1. **Data prep** — Downloads HuggingFace data shards, trains a BPE tokenizer (`prepare.py`)
10-
2. **Load agent LLM** — Qwen3-32B-AWQ via vLLM (~18GB VRAM, stays resident)
11-
3. **Baseline run** — Runs the original `train.py` (5-min training budget), records val_bpb
23+
2. **Load agent LLM** — Qwen3.5-27B via vLLM on GPU 0 (~54GB VRAM, stays resident)
24+
3. **Baseline run** — Runs the original `train.py` on GPU 1 (5-min training budget), records val_bpb
1225
4. **Agent loop** (up to 200 iterations):
1326
- LLM reads experiment history + current best `train.py`
1427
- Generates a hypothesis + complete new `train.py`
@@ -22,7 +35,8 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
2235

2336
| File | Description |
2437
|------|-------------|
25-
| `algo.py` | Core agent loop — orchestrates LLM inference and training |
38+
| `algo.py` | Core agent loop — orchestrates LLM inference (GPU 0) and training (GPU 1) |
39+
| `algo_qwen3-32B.py` | Previous single-GPU variant using Qwen3-32B-AWQ |
2640
| `train.py` | GPT pretraining script (the file the agent modifies) |
2741
| `prepare.py` | Data download, tokenizer, dataloader, evaluation (read-only) |
2842
| `program.md` | Instructions for the agent LLM |
@@ -32,7 +46,7 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
3246
## Usage
3347

3448
1. Go to [dashboard.oncompute.ai](https://dashboard.oncompute.ai/)
35-
2. Select an **H200 GPU** environment
49+
2. Select a **H200 GPU** environment (or single H200 with `algo_qwen3-32B.py`)
3650
3. Configure the job and add payment
3751
4. Open the **Ocean Orchestrator** in VS Code / your editor
3852
5. Open this directory in the orchestrator and run the job — the container builds and executes `algo.py` autonomously
@@ -45,6 +59,8 @@ python plot_progress.py path/to/results.json progress.png
4559

4660
## Results
4761

62+
All results below are from the single-GPU setup (Qwen3-32B-AWQ on one H200). Results with the 2×H200 / Qwen3.5-27B setup are pending.
63+
4864
### Qwen3-32B-AWQ — 0.7 Temperature (First Run)
4965

5066
![Qwen3-32B first run](assets/images/qwen32B_first_run_progress.png)

autoresearch/algo.py

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
"""
22
Autonomous autoresearch agent loop for ocean network.
33
4-
Runs inside a Docker container on a remote GPU node.
5-
Uses a local open-source LLM (Qwen3-32B-AWQ via vLLM) to iteratively
6-
improve train.py, measuring val_bpb as the optimization target.
4+
Runs inside a Docker container on a 2×H200 GPU node.
5+
GPU 0: dedicated to the agent LLM (Qwen3.5-27B via vLLM, unquantized bf16)
6+
GPU 1: dedicated to training (full 141GB VRAM)
77
"""
88

99
import json
@@ -14,15 +14,23 @@
1414
import time
1515
from datetime import datetime, timezone
1616

17+
# ---------------------------------------------------------------------------
18+
# GPU isolation — must be set before any CUDA imports
19+
# ---------------------------------------------------------------------------
20+
21+
AGENT_GPU = "0"
22+
TRAINING_GPU = "1"
23+
os.environ["CUDA_VISIBLE_DEVICES"] = AGENT_GPU # vLLM only sees GPU 0
24+
1725
# ---------------------------------------------------------------------------
1826
# Configuration
1927
# ---------------------------------------------------------------------------
2028

21-
AGENT_MODEL = "Qwen/Qwen3-32B-AWQ"
22-
GPU_MEMORY_UTILIZATION = 0.25 # ~35GB for weights+KV cache, rest for training
29+
AGENT_MODEL = "Qwen/Qwen3.5-27B"
30+
GPU_MEMORY_UTILIZATION = 0.90 # dedicated GPU — use most of it
2331
MAX_ITERATIONS = 200
2432
TRAINING_TIMEOUT = 600 # 10 minutes
25-
MAX_MODEL_LEN = 40960 # total context window (input + output)
33+
MAX_MODEL_LEN = 65536 # larger context — dedicated GPU has plenty of room
2634
MAX_OUTPUT_TOKENS = 16384 # max tokens for LLM output (enough for full train.py)
2735
TEMPERATURE = 0.7
2836
STAGNATION_THRESHOLD = 5 # consecutive non-improvements before nudge
@@ -138,13 +146,18 @@ def run_training(train_py_content):
138146
"""
139147
write_file(TRAIN_PY_PATH, train_py_content)
140148

149+
# Run training on the dedicated training GPU
150+
train_env = os.environ.copy()
151+
train_env["CUDA_VISIBLE_DEVICES"] = TRAINING_GPU
152+
141153
try:
142154
result = subprocess.run(
143155
[sys.executable, TRAIN_PY_PATH],
144156
capture_output=True,
145157
text=True,
146158
timeout=TRAINING_TIMEOUT,
147159
cwd="/app",
160+
env=train_env,
148161
)
149162
except subprocess.TimeoutExpired as e:
150163
stderr_text = e.stderr if isinstance(e.stderr, str) else (e.stderr.decode() if e.stderr else "")
@@ -419,6 +432,7 @@ def main():
419432
max_model_len=MAX_MODEL_LEN,
420433
dtype="auto",
421434
trust_remote_code=True,
435+
enforce_eager=True, # avoid DeltaNet compilation issues with Qwen3.5
422436
)
423437
sampling_params = SamplingParams(
424438
max_tokens=MAX_OUTPUT_TOKENS,

0 commit comments

Comments
 (0)