oceanprotocol
diff --git a/‎autoresearch/README.md‎
Lines changed: 20 additions & 8 deletions b/‎autoresearch/README.md‎
Lines changed: 20 additions & 8 deletions
diff --git a/‎autoresearch/algo.py‎
Lines changed: 7 additions & 21 deletions b/‎autoresearch/algo.py‎
Lines changed: 7 additions & 21 deletions
@@ -2,14 +2,14 @@
 
 Autonomous ML research agent that iteratively improves a GPT pretraining script to minimize validation bits-per-byte (val_bpb). Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch).
 
-The key difference: everything runs **inside a single Docker container** on [Ocean](https://dashboard.oncompute.ai/) GPU nodes with a **local open-source LLM** — no API keys needed. The current setup uses 2×H200 GPUs: one dedicated to the agent LLM, the other to training.
+The key difference: everything runs **inside a single Docker container** on [Ocean](https://dashboard.oncompute.ai/) GPU nodes with a **local open-source LLM** — no API keys needed. The current setup uses a single H200 GPU with **Qwen3-14B** (unquantized bf16, ~28GB) — small enough to share the GPU with training at 0.7 temperature for maximum exploration throughput.
 
 ## From Karpathy's Experiment to Ocean
 
 Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) uses the Claude API to drive the agent loop. We adapted it to run fully self-contained on Ocean Network:
 
-1. **Local LLM instead of API** — Replaced Claude API calls with **Qwen3.5-27B** served via **vLLM** (unquantized bf16, ~54GB). No API keys, no per-token costs.
-2. **Dedicated GPUs** — GPU 0 runs the agent LLM, GPU 1 runs training. Each gets the full 141GB — no memory-sharing complexity. (A single-GPU variant using Qwen3-32B-AWQ is available as `algo_qwen3-32B.py`.)
+1. **Local LLM instead of API** — Replaced Claude API calls with **Qwen3-14B** served via **vLLM** (unquantized bf16, ~28GB). No API keys, no per-token costs.
+2. **Single GPU** — vLLM takes 25% of H200 VRAM (~35GB) for the agent LLM, training uses the rest. Previous variants with Qwen3-32B-AWQ and Qwen3.5-27B are preserved as `algo_qwen3-32B.py` and `algo_qwen3.5-27B.py`.
 3. **Single Docker container** — Everything packaged in one container: PyTorch, vLLM, Flash Attention 3, data pipeline. Ocean runs it on remote GPU nodes via a symlink to `/app/data/transformations/algorithm`.
 4. **Self-bootstrapping data** — `prepare.py` downloads HuggingFace data shards and trains a BPE tokenizer at container startup, so nothing needs to persist between runs.
 
@@ -20,7 +20,7 @@ A few clicks give you an autonomous ML researcher that runs for hours on H200 GP
 ## How It Works
 
 1. **Data prep** — Downloads HuggingFace data shards, trains a BPE tokenizer (`prepare.py`)
-2. **Load agent LLM** — Qwen3.5-27B via vLLM on GPU 0 (~54GB VRAM, stays resident)
+2. **Load agent LLM** — Qwen3-14B via vLLM (~28GB VRAM, stays resident, shares GPU with training)
 3. **Baseline run** — Runs the original `train.py` on GPU 1 (5-min training budget), records val_bpb
 4. **Agent loop** (up to 200 iterations):
    - LLM reads experiment history + current best `train.py`
@@ -35,8 +35,9 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
 
 | File | Description |
 |------|-------------|
-| `algo.py` | Core agent loop — orchestrates LLM inference (GPU 0) and training (GPU 1) |
-| `algo_qwen3-32B.py` | Previous single-GPU variant using Qwen3-32B-AWQ |
+| `algo.py` | Core agent loop — Qwen3-14B, single H200, 0.7 temp, LLM and training share GPU |
+| `algo_qwen3-32B.py` | Single-GPU variant using Qwen3-32B-AWQ |
+| `algo_qwen3.5-27B.py` | 2×H200 variant using Qwen3.5-27B (one GPU per role) |
 | `train.py` | GPT pretraining script (the file the agent modifies) |
 | `prepare.py` | Data download, tokenizer, dataloader, evaluation (read-only) |
 | `program.md` | Instructions for the agent LLM |
@@ -46,7 +47,7 @@ The user extracts `results["best"]["train_py"]` to get the winning code.
 ## Usage
 
 1. Go to [dashboard.oncompute.ai](https://dashboard.oncompute.ai/)
-2. Select a **2×H200 GPU** environment (or single H200 with `algo_qwen3-32B.py`)
+2. Select a **single H200 GPU** environment (or 2×H200 with `algo_qwen3.5-27B.py`)
 3. Configure the job and add payment
 4. Open the **Ocean Orchestrator** in VS Code / your editor
 5. Open this directory in the orchestrator and run the job — the container builds and executes `algo.py` autonomously
@@ -59,7 +60,7 @@ python plot_progress.py path/to/results.json progress.png
 
 ## Results
 
-All results below are from the single-GPU setup (Qwen3-32B-AWQ on one H200). Results with the 2×H200 / Qwen3.5-27B setup are pending.
+The first three runs used the single-GPU setup (Qwen3-32B-AWQ on one H200). The last run used the 2×H200 setup with Qwen3.5-27B.
 
 ### Qwen3-32B-AWQ — 0.7 Temperature (First Run)
 
@@ -88,6 +89,17 @@ All results below are from the single-GPU setup (Qwen3-32B-AWQ on one H200). Res
 - **201 iterations** over 12 hours, 52 successful runs (74% crash rate)
 - Double the runtime of the first run but worse results — the agent got stuck and couldn't escape the local minimum
 
+### Qwen3.5-27B — 2×H200, ~12 Hours
+
+![Qwen3.5-27B progress](assets/images/qwen3.5-27B_progress.png)
+
+- **Baseline**: 1.0251 val_bpb
+- **Best**: 0.9993 val_bpb (2.5% improvement)
+- **77 iterations** over ~12 hours, 38 successful runs (51% crash rate)
+- Key improvements: model depth (more layers), concentrated in the second half of the run
+
 ### Takeaway
 
 Lower temperature (0.5 vs 0.7) reduces the crash rate (62-74% vs 86%) but produces significantly worse results. The more "creative" 0.7 temperature generates more broken code, but the successful mutations are bolder and lead to real architectural improvements (e.g. deeper models). At 0.5 temp the agent plays it safe, converges early to ~1.007 val_bpb, and stalls — even with 12 hours of compute it can't match what 0.7 temp achieved in 5.5 hours.
+
+Switching from the quantized Qwen3-32B-AWQ (single GPU) to the full Qwen3.5-27B (2×H200) didn't help — the larger model ran fewer experiments in the same time (77 vs 201), had a lower crash rate (51% vs 86%), but couldn't beat the 0.9818 val_bpb that Qwen3-32B at 0.7 temp reached. The reduced throughput likely offset any quality gains from the stronger model.
@@ -1,9 +1,9 @@
 """
 Autonomous autoresearch agent loop for ocean network.
 
-Runs inside a Docker container on a 2×H200 GPU node.
-GPU 0: dedicated to the agent LLM (Qwen3.5-27B via vLLM, unquantized bf16)
-GPU 1: dedicated to training (full 141GB VRAM)
+Runs inside a Docker container on a single H200 GPU node.
+The agent LLM (Qwen3-14B via vLLM, unquantized bf16, ~28GB) and training
+share the single GPU — vLLM takes 25% of VRAM, training uses the rest.
 """
 
 import json
@@ -14,24 +14,16 @@
 import time
 from datetime import datetime, timezone
 
-# ---------------------------------------------------------------------------
-# GPU isolation — must be set before any CUDA imports
-# ---------------------------------------------------------------------------
-
-AGENT_GPU = "0"
-TRAINING_GPU = "1"
-os.environ["CUDA_VISIBLE_DEVICES"] = AGENT_GPU  # vLLM only sees GPU 0
-
 # ---------------------------------------------------------------------------
 # Configuration
 # ---------------------------------------------------------------------------
 
-AGENT_MODEL = "Qwen/Qwen3.5-27B"
-GPU_MEMORY_UTILIZATION = 0.90  # dedicated GPU — use most of it
+AGENT_MODEL = "Qwen/Qwen3-14B"
+GPU_MEMORY_UTILIZATION = 0.25  # ~35GB for LLM weights+KV cache, rest for training
 MAX_ITERATIONS = 200
 TRAINING_TIMEOUT = 600  # 10 minutes
-MAX_MODEL_LEN = 65536   # larger context — dedicated GPU has plenty of room
-MAX_OUTPUT_TOKENS = 10000  # train.py is ~8K tokens; need enough room for the full file
+MAX_MODEL_LEN = 40960
+MAX_OUTPUT_TOKENS = 16384  # max tokens for LLM output (enough for full train.py)
 TEMPERATURE = 0.7
 STAGNATION_THRESHOLD = 3  # consecutive non-improvements before nudge
 MAX_HISTORY_IN_PROMPT = 20  # only show last N iterations in prompt
@@ -146,18 +138,13 @@ def run_training(train_py_content):
     """
     write_file(TRAIN_PY_PATH, train_py_content)
 
-    # Run training on the dedicated training GPU
-    train_env = os.environ.copy()
-    train_env["CUDA_VISIBLE_DEVICES"] = TRAINING_GPU
-
     try:
         result = subprocess.run(
             [sys.executable, TRAIN_PY_PATH],
             capture_output=True,
             text=True,
             timeout=TRAINING_TIMEOUT,
             cwd="/app",
-            env=train_env,
         )
     except subprocess.TimeoutExpired as e:
         stderr_text = e.stderr if isinstance(e.stderr, str) else (e.stderr.decode() if e.stderr else "")
@@ -432,7 +419,6 @@ def main():
         max_model_len=MAX_MODEL_LEN,
         dtype="auto",
         trust_remote_code=True,
-        enforce_eager=True,  # required: CUDA graphs fail for Qwen3.5 DeltaNet on this vLLM version
     )
     sampling_params = SamplingParams(
         max_tokens=MAX_OUTPUT_TOKENS,