Skip to content

Commit 39b6456

Browse files
committed
feat(autoresearch): add qwen3-14B experiment results
1 parent 0e9160f commit 39b6456

5 files changed

Lines changed: 7342 additions & 1 deletion

File tree

autoresearch/README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,8 +98,17 @@ The first three runs used the single-GPU setup (Qwen3-32B-AWQ on one H200). The
9898
- **77 iterations** over ~12 hours, 38 successful runs (51% crash rate)
9999
- Key improvements: model depth (more layers), concentrated in the second half of the run
100100

101+
### Qwen3-14B - 1xH200, ~12 Hours
102+
103+
![Qwen3-14B progress](assets/images/qwen3-14B_progress.png)
104+
105+
- **Baseline**: 1.0268 val_bpb
106+
- **Best**: 0.9967 val_pbp (2.9% improvement)
107+
- **165 iterations** over ~12 hours, 72 successful runs (56% crash rate)
108+
- Key improvements: model depth
109+
101110
### Takeaway
102111

103112
Lower temperature (0.5 vs 0.7) reduces the crash rate (62-74% vs 86%) but produces significantly worse results. The more "creative" 0.7 temperature generates more broken code, but the successful mutations are bolder and lead to real architectural improvements (e.g. deeper models). At 0.5 temp the agent plays it safe, converges early to ~1.007 val_bpb, and stalls — even with 12 hours of compute it can't match what 0.7 temp achieved in 5.5 hours.
104113

105-
Switching from the quantized Qwen3-32B-AWQ (single GPU) to the full Qwen3.5-27B (2×H200) didn't help — the larger model ran fewer experiments in the same time (77 vs 201), had a lower crash rate (51% vs 86%), but couldn't beat the 0.9818 val_bpb that Qwen3-32B at 0.7 temp reached. The reduced throughput likely offset any quality gains from the stronger model.
114+
Switching from the quantized Qwen3-32B-AWQ (single GPU) to the full Qwen3.5-27B (2×H200) didn't help — the larger model ran fewer experiments in the same time (77 vs 201), had a lower crash rate (51% vs 86%), but couldn't beat the 0.9818 val_bpb that Qwen3-32B at 0.7 temp reached. The reduced throughput likely offset any quality gains from the stronger model. Percentage wise the smallest model (Qwen3-14B) managed to perform the best, due to it's short iteration time.
99.7 KB
Loading

autoresearch/pyproject.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ dependencies = [
1313
"torch==2.9.1",
1414
"vllm",
1515
"huggingface-hub",
16+
"matplotlib"
1617
]
1718

1819
[tool.uv.sources]
@@ -24,3 +25,8 @@ torch = [
2425
name = "pytorch-cu128"
2526
url = "https://download.pytorch.org/whl/cu128"
2627
explicit = true
28+
29+
[tool.uv.workspace]
30+
members = [
31+
"autoresearch",
32+
]

autoresearch/results/qwen3-14B/results.json

Lines changed: 1520 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)