Skip to content

Commit 6cfc3eb

Browse files
committed
docs: add results from yesterday's runs
1 parent 4a37759 commit 6cfc3eb

7 files changed

Lines changed: 1855 additions & 2 deletions

File tree

autoresearch/README.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,33 @@ python plot_progress.py path/to/results.json progress.png
4545

4646
## Results
4747

48-
### Qwen3-32B-AWQ — First Run
48+
### Qwen3-32B-AWQ — 0.7 Temperature (First Run)
4949

5050
![Qwen3-32B first run](assets/images/qwen32B_first_run_progress.png)
5151

5252
- **Baseline**: 1.0077 val_bpb
5353
- **Best**: 0.9818 val_bpb (2.6% improvement)
54-
- **201 iterations** over 5.5 hours, 30 successful runs (85% crash rate)
54+
- **201 iterations** over 5.5 hours, 29 successful runs (86% crash rate)
5555
- Key improvements: increased model depth (8→10 layers), late-stage hyperparameter tuning
56+
57+
### Qwen3-32B-AWQ — 0.5 Temperature, 6 Hours
58+
59+
![Qwen3-32B 0.5temp 6h](assets/images/qwen32B_6h_0.5temp_progress.png)
60+
61+
- **Baseline**: 1.0227 val_bpb
62+
- **Best**: 1.0072 val_bpb (1.5% improvement)
63+
- **94 iterations** over 6 hours, 36 successful runs (62% crash rate)
64+
- Lower crash rate than 0.7 temp, but much less improvement — the agent converged early and plateaued
65+
66+
### Qwen3-32B-AWQ — 0.5 Temperature, 12 Hours
67+
68+
![Qwen3-32B 0.5temp 12h](assets/images/qwen32B_12h_0.5temp_progress.png)
69+
70+
- **Baseline**: 1.0215 val_bpb
71+
- **Best**: 1.0074 val_bpb (1.4% improvement)
72+
- **201 iterations** over 12 hours, 52 successful runs (74% crash rate)
73+
- Double the runtime of the first run but worse results — the agent got stuck and couldn't escape the local minimum
74+
75+
### Takeaway
76+
77+
Lower temperature (0.5 vs 0.7) reduces the crash rate (62-74% vs 86%) but produces significantly worse results. The more "creative" 0.7 temperature generates more broken code, but the successful mutations are bolder and lead to real architectural improvements (e.g. deeper models). At 0.5 temp the agent plays it safe, converges early to ~1.007 val_bpb, and stalls — even with 12 hours of compute it can't match what 0.7 temp achieved in 5.5 hours.
83.6 KB
Loading
81.5 KB
Loading

autoresearch/results/first_successful_run_qwen32B/outputs/results.json renamed to autoresearch/results/first_successful_run_qwen32B_0.7temp/results.json

File renamed without changes.

autoresearch/results/qwen32B_12hours_0.5temp/results.json

Lines changed: 1831 additions & 0 deletions
Large diffs are not rendered by default.

autoresearch/results/first_successful_run_qwen32B/outputs-84d8f380168440566c406ede5c6afc602883434b9c3b18b54d6140e52e278029.tar renamed to autoresearch/results/qwen32B_6hours_0.5temp/results.json

228 KB
Binary file not shown.

0 commit comments

Comments
 (0)