docs: update autoresearch README

ndrpp · ndrpp · commit 7ad065eda6d4 · 2026-03-24T11:38:30.000+02:00
diff --git a/autoresearch/README.md b/autoresearch/README.md
@@ -2,45 +2,26 @@
 
 Autonomous ML research agent that iteratively improves a GPT pretraining script to minimize validation bits-per-byte (val_bpb). Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch).
 
-The key difference: everything runs **inside a single Docker container** on an [Ocean](https://dashboard.oncompute.ai/) GPU node (H200, 141GB VRAM) with a **local open-source LLM** — no API keys needed.
+The key difference: everything runs **inside a single Docker container** on [Ocean](https://dashboard.oncompute.ai/) GPU nodes with a **local open-source LLM** — no API keys needed. The current setup uses 2×H200 GPUs: one dedicated to the agent LLM, the other to training.
 
 ## From Karpathy's Experiment to Ocean
 
-Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) uses the Claude API to drive an agent loop that iteratively improves a GPT training script. It's a brilliant idea — let an LLM be the researcher — but it requires API keys, costs money per token, and runs on your own machine. Here's how we adapted it to run fully self-contained on Ocean Network:
+Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) uses the Claude API to drive the agent loop. We adapted it to run fully self-contained on Ocean Network:
 
-### 1. Replace the API with a local LLM
+1. **Local LLM instead of API** — Replaced Claude API calls with **Qwen3.5-27B** served via **vLLM** (unquantized bf16, ~54GB). No API keys, no per-token costs.
+2. **Dedicated GPUs** — GPU 0 runs the agent LLM, GPU 1 runs training. Each gets the full 141GB — no memory-sharing complexity. (A single-GPU variant using Qwen3-32B-AWQ is available as `algo_qwen3-32B.py`.)
+3. **Single Docker container** — Everything packaged in one container: PyTorch, vLLM, Flash Attention 3, data pipeline. Ocean runs it on remote GPU nodes via a symlink to `/app/data/transformations/algorithm`.
+4. **Self-bootstrapping data** — `prepare.py` downloads HuggingFace data shards and trains a BPE tokenizer at container startup, so nothing needs to persist between runs.
 
-The original calls Claude via the Anthropic API. We replaced that with **Qwen3-32B-AWQ** served locally through **vLLM**. The AWQ 4-bit quantization brings the model down to ~18GB VRAM, leaving the rest of the H200's 141GB for training. vLLM loads once and stays resident for all 200 iterations — no network calls, no API keys, no per-token costs.
+> **Alternative**: You can also use the Claude API (or any LLM API) from inside the container by passing an API key as an environment variable. Stronger model, but adds cost and network dependency.
 
-### 2. Share one GPU between the agent and training
-
-This is the core engineering challenge. The agent LLM and the training run need to coexist on the same GPU. We configure vLLM with `gpu_memory_utilization=0.25` (~35GB for weights + KV cache), leaving ~100GB for PyTorch training. The agent generates code, then training runs as a subprocess — they never compete for memory simultaneously because inference finishes before training starts.
-
-### 3. Package everything in a single Docker container
-
-Ocean's compute-to-data model runs a Docker container on a remote GPU node. We built a container on `nvidia/cuda:12.8.0-devel-ubuntu22.04` that includes PyTorch, vLLM, Flash Attention 3 (via `kernels`), and all dependencies. The entire pipeline — data download, tokenizer training, LLM loading, and the 200-iteration research loop — runs from a single entrypoint (`algo.py`).
-
-### 4. Adapt the data pipeline for container execution
-
-Karpathy's setup assumes a persistent local environment. In a container, nothing persists between runs. `prepare.py` handles this by downloading HuggingFace data shards and training a BPE tokenizer from scratch at container startup, caching everything under `~/.cache/autoresearch/` for the duration of the job.
-
-### 5. Wire into Ocean's orchestrator
-
-Ocean expects the algorithm at `/app/data/transformations/algorithm`. A symlink in the Dockerfile (`ln -sf /app/algo.py /app/data/transformations/algorithm`) bridges this. Results are written to `/data/outputs/results.json` so they're downloadable from the Ocean dashboard when the job completes.
-
-### Alternative: use an API instead of a local LLM
-
-You could also use the Claude API (or any other LLM API) from inside the container — just pass the API key as an environment variable and swap the vLLM calls for Anthropic SDK calls. This frees up the ~35GB reserved for the agent model, giving training the full GPU, and a stronger model like Claude Sonnet would likely produce fewer crashes and smarter changes. The tradeoff is API costs and a dependency on network access.
-
-### The result
-
-A few clicks give you an autonomous ML researcher that runs for hours on an H200 GPU, costs nothing beyond the compute rental, and produces a `results.json` with the full experiment history and winning code.
+A few clicks give you an autonomous ML researcher that runs for hours on H200 GPUs, costs nothing beyond the compute rental, and produces a `results.json` with the full experiment history and winning code.
 
 ## How It Works
 
 1. **Data prep** — Downloads HuggingFace data shards, trains a BPE tokenizer (`prepare.py`)
-2. **Load agent LLM** — Qwen3-32B-AWQ via vLLM (~18GB VRAM, stays resident)
-3. **Baseline run** — Runs the original `train.py` (5-min training budget), records val_bpb
+2. **Load agent LLM** — Qwen3.5-27B via vLLM on GPU 0 (~54GB VRAM, stays resident)
+3. **Baseline run** — Runs the original `train.py` on GPU 1 (5-min training budget), records val_bpb
 4. **Agent loop** (up to 200 iterations):
    - LLM reads experiment history + current best `train.py`
    - Generates a hypothesis + complete new `train.py`
@@ -54,7 +35,8 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
 
 | File | Description |
 |------|-------------|
-| `algo.py` | Core agent loop — orchestrates LLM inference and training |
+| `algo.py` | Core agent loop — orchestrates LLM inference (GPU 0) and training (GPU 1) |
+| `algo_qwen3-32B.py` | Previous single-GPU variant using Qwen3-32B-AWQ |
 | `train.py` | GPT pretraining script (the file the agent modifies) |
 | `prepare.py` | Data download, tokenizer, dataloader, evaluation (read-only) |
 | `program.md` | Instructions for the agent LLM |
@@ -64,7 +46,7 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
 ## Usage
 
 1. Go to [dashboard.oncompute.ai](https://dashboard.oncompute.ai/)
-2. Select an **H200 GPU** environment
+2. Select a **2×H200 GPU** environment (or single H200 with `algo_qwen3-32B.py`)
 3. Configure the job and add payment
 4. Open the **Ocean Orchestrator** in VS Code / your editor
 5. Open this directory in the orchestrator and run the job — the container builds and executes `algo.py` autonomously
@@ -77,6 +59,8 @@ python plot_progress.py path/to/results.json progress.png
 
 ## Results
 
+All results below are from the single-GPU setup (Qwen3-32B-AWQ on one H200). Results with the 2×H200 / Qwen3.5-27B setup are pending.
+
 ### Qwen3-32B-AWQ — 0.7 Temperature (First Run)
 
 ![Qwen3-32B first run](assets/images/qwen32B_first_run_progress.png)