Skip to content

Commit 73fc160

Browse files
committed
Update documentation and improve .gitignore for multi-node training support
1 parent 32de383 commit 73fc160

23 files changed

Lines changed: 176 additions & 20 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,3 +179,4 @@ node_modules
179179
skills-lock.json
180180
blueprint*
181181
tmux_wait.py
182+
qwen2*

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ When swarm training mode is enabled, an additional component will be activated:
169169

170170
### ✈️ Navigation
171171

172-
* **Tutorials**: From [Installation](https://modelscope.github.io/AgentJet/en/installation) to [Tuning your first agent](https://modelscope.github.io/AgentJet/en/tune_your_first_agent) — the essential path for beginners.
172+
* **Tutorials**: From [Installation](https://modelscope.github.io/AgentJet/en/installation) to [Tuning your first agent](https://modelscope.github.io/AgentJet/en/tune_your_first_agent) to [Multi-Node Training](https://modelscope.github.io/AgentJet/en/multi_node_training) — the essential path for beginners.
173173
* **Core Components**: Define your [Trainable Workflow](https://modelscope.github.io/AgentJet/en/workflow) and manage [Data](https://modelscope.github.io/AgentJet/en/data_pipeline) and [Reward](https://modelscope.github.io/AgentJet/en/task_judger).
174174
* **Example**: Check the [Example Library](https://modelscope.github.io/AgentJet/#example-library) above for real-world cases like [Math](https://modelscope.github.io/AgentJet/en/example_math_agent), [Werewolves game](https://modelscope.github.io/AgentJet/en/example_werewolves) and [Learning to ask task](https://modelscope.github.io/AgentJet/en/example_learning_to_ask).
175175
* **Deep Dive**: Master advanced [Configuration](https://modelscope.github.io/AgentJet/en/configuration).

ajet/copilot/monitor-with-tmux/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,4 +178,4 @@ $ python3 /tmp/tmux_wait.py ajet_session 240 && tmux capture-pane -t ajet_sessio
178178
# Destroy tmux session
179179
tmux kill-session -t ajet_session
180180
181-
```
181+
```

docs/en/blog_vibe_training_werewolves.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ The code follows a simple three-step pattern that defines all Swarm Client progr
9595

9696
## Architecture: How It Works
9797

98-
```
98+
```text
9999
Swarm Server (GPU Cluster, 8x GPUs)
100100
├── Qwen2.5-7B-Instruct (trainable)
101101
├── vLLM inference engine

docs/en/blog_vibe_training_werewolves.zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ def execute_agent(task: Task, api_baseurl_key: OpenaiBaseUrlAndApiKey):
9595

9696
## 架构:工作原理
9797

98-
```
98+
```text
9999
Swarm Server(GPU 集群,8 块 GPU)
100100
├── Qwen2.5-7B-Instruct(可训练)
101101
├── vLLM 推理引擎

docs/en/lora_training.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,4 +132,4 @@ merged_model = lora_model.merge_and_unload()
132132
<a href="../example_math_agent/" class="feature-card"><div class="card-header"><img src="https://api.iconify.design/mdi:calculator.svg" class="card-icon card-icon-general" alt=""><h3>Math Agent</h3></div><p class="card-desc">Train a tool-using math reasoning agent.</p></a>
133133
<a href="../tune_your_first_agent/" class="feature-card"><div class="card-header"><img src="https://api.iconify.design/mdi:rocket-launch.svg" class="card-icon card-icon-general" alt=""><h3>Tune First Agent</h3></div><p class="card-desc">Get started with AgentJet training.</p></a>
134134
<a href="../configuration/" class="feature-card"><div class="card-header"><img src="https://api.iconify.design/mdi:cog.svg" class="card-icon card-icon-general" alt=""><h3>Configuration</h3></div><p class="card-desc">Deep dive into config options.</p></a>
135-
</div>
135+
</div>

docs/en/multi_node_training.md

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Multi-Node Multi-GPU Training
2+
3+
AgentJet supports scaling training across multiple machines and GPUs. This guide covers how to set up multi-node training in both **Classic Mode** and **Swarm Mode**.
4+
5+
---
6+
7+
## Prerequisites
8+
9+
Before starting multi-node training, ensure that:
10+
11+
1. All nodes have AgentJet installed and configured (see [Installation](../en/installation.md)).
12+
2. All nodes can communicate with each other over the network.
13+
3. A [Ray](https://docs.ray.io/) cluster is properly set up across all nodes.
14+
15+
### Setting Up the Ray Cluster
16+
17+
You have two options to set up Ray:
18+
19+
=== "Auto Configuration"
20+
21+
Use the built-in helper to automatically configure Ray based on cluster environment variables:
22+
23+
```bash
24+
ajet --with-ray-cluster
25+
```
26+
27+
This command reads the following environment variables and initializes a Ray cluster automatically:
28+
29+
| Environment Variable | Description |
30+
|---|---|
31+
| `MASTER_ADDR` | The hostname or IP address of the head node. AgentJet compares this with the current node's hostname (`os.uname().nodename`) to determine whether to start a Ray **head** node or a **worker** node. |
32+
| `MASTER_PORT` | The port used by the Ray head node for cluster communication. |
33+
34+
**How it works:**
35+
36+
- If the current node's hostname matches `MASTER_ADDR`, AgentJet starts a **Ray head** node: `ray start --head --node-ip-address=$MASTER_ADDR --port=$MASTER_PORT`
37+
- Otherwise, AgentJet starts a **Ray worker** node that connects to the head: `ray start --address=$MASTER_ADDR:$MASTER_PORT`
38+
39+
!!! warning "Cluster Compatibility"
40+
Currently, `--with-ray-cluster` is designed for clusters that provide `MASTER_ADDR` and `MASTER_PORT` environment variables (e.g., Alibaba PAI DLC). For other cluster schedulers (SLURM, Kubernetes, etc.), you may need to set these environment variables manually or use the manual configuration method below.
41+
42+
=== "Manual Configuration"
43+
44+
Set up Ray manually by starting a head node and connecting worker nodes:
45+
46+
```bash
47+
# On the head node
48+
ray start --head --port=6379
49+
50+
# On each worker node
51+
ray start --address='<head-node-ip>:6379'
52+
```
53+
54+
Verify the cluster is running:
55+
56+
```bash
57+
ray status
58+
```
59+
60+
---
61+
62+
## Classic Mode
63+
64+
In Classic Mode, multi-node training is straightforward. After the Ray cluster is ready, simply update your YAML configuration to specify the number of nodes and GPUs per node.
65+
66+
### Step 1: Configure Ray Cluster
67+
68+
Set up the Ray cluster using either method above.
69+
70+
### Step 2: Update Training Configuration
71+
72+
Modify your YAML config to specify the multi-node topology:
73+
74+
```yaml
75+
ajet:
76+
trainer_common:
77+
nnodes: 4 # number of machines
78+
n_gpus_per_node: 8 # number of GPUs per machine
79+
```
80+
81+
### Step 3: Launch Training
82+
83+
Run training as usual:
84+
85+
```bash
86+
ajet --conf your_config.yaml --backbone='verl'
87+
```
88+
89+
AgentJet will automatically distribute the workload across all 4 nodes (32 GPUs total in this example).
90+
91+
!!! tip "Scaling Tips"
92+
- Set `nnodes` to the total number of machines in your Ray cluster.
93+
- Set `n_gpus_per_node` to match the number of GPUs available on each machine.
94+
- Ensure all nodes have identical GPU configurations for optimal performance.
95+
96+
---
97+
98+
## Swarm Mode
99+
100+
In Swarm Mode, multi-node training allows you to launch a distributed swarm server that spans multiple GPU machines, enabling training of larger models.
101+
102+
### Step 1: Configure Ray Cluster
103+
104+
Set up the Ray cluster across all GPU nodes using either method above.
105+
106+
### Step 2: Start the Swarm Server
107+
108+
Launch the swarm server on the head node:
109+
110+
```bash
111+
ajet-swarm start
112+
```
113+
114+
The swarm server will leverage the entire Ray cluster for model hosting and training.
115+
116+
### Step 3: Submit a Multi-Node Job from Client
117+
118+
From any machine (including a GPU-less laptop), submit a training job with multi-node configuration:
119+
120+
```python
121+
from ajet_swarm import AgentJetJob
122+
123+
ajet_job = AgentJetJob(
124+
base_yaml_config="tutorial/example_werewolves_swarm/werewolves.yaml",
125+
# the YAML config should also set nnodes > 1, e.g.:
126+
# ajet:
127+
# trainer_common:
128+
# nnodes: 4
129+
# n_gpus_per_node: 8
130+
)
131+
```
132+
133+
The YAML referenced by `base_yaml_config` should contain the same multi-node settings:
134+
135+
```yaml
136+
ajet:
137+
trainer_common:
138+
nnodes: 4 # number of machines
139+
n_gpus_per_node: 8 # number of GPUs per machine
140+
```
141+
142+
!!! info "Swarm Advantage"
143+
With Swarm Mode, you can submit multi-node training jobs remotely without direct access to the GPU cluster. The swarm server coordinates all distributed training internally.
144+
145+
---
146+
147+
## Summary
148+
149+
| | Classic Mode | Swarm Mode |
150+
|---|---|---|
151+
| **Ray Setup** | Required on all nodes | Required on all GPU nodes |
152+
| **Config** | Set `nnodes` and `n_gpus_per_node` in YAML | Same YAML config, submitted via `AgentJetJob` |
153+
| **Launch** | `ajet --conf ...` on head node | `ajet-swarm start` on head node, submit from client |
154+
| **Remote Submit** | Not supported | Supported (GPU-less laptop) |

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ nav:
2929
- Tune First Agent: en/tune_your_first_agent.md
3030
- Swarm Training Intro: en/swarm_intro_blog_en.md
3131
- Agentic Frameworks: en/agent_framework_support.md
32+
- Multi-Node Training: en/multi_node_training.md
3233

3334
- • Classic Examples:
3435
- Math Agent: en/example_math_agent.md

tests/bench/benchmark_appworldlora/benchmark_appworldlora.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,4 @@ def __init__(self):
2727
self.probe_key = "reward_probe"
2828

2929
def __call__(self, key, log_dict):
30-
return super().__call__(key, log_dict)
30+
return super().__call__(key, log_dict)

tests/bench/benchmark_appworldlora/benchmark_appworldlora.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,4 +81,4 @@ defaults:
8181
- verl_default # verl inherit 1/1
8282
- trinity_default # trinity inherit 1/1
8383
- ajet_default
84-
- _self_
84+
- _self_

0 commit comments

Comments
 (0)