|
| 1 | +# Multi-Node Multi-GPU Training |
| 2 | + |
| 3 | +AgentJet supports scaling training across multiple machines and GPUs. This guide covers how to set up multi-node training in both **Classic Mode** and **Swarm Mode**. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Prerequisites |
| 8 | + |
| 9 | +Before starting multi-node training, ensure that: |
| 10 | + |
| 11 | +1. All nodes have AgentJet installed and configured (see [Installation](../en/installation.md)). |
| 12 | +2. All nodes can communicate with each other over the network. |
| 13 | +3. A [Ray](https://docs.ray.io/) cluster is properly set up across all nodes. |
| 14 | + |
| 15 | +### Setting Up the Ray Cluster |
| 16 | + |
| 17 | +You have two options to set up Ray: |
| 18 | + |
| 19 | +=== "Auto Configuration" |
| 20 | + |
| 21 | + Use the built-in helper to automatically configure Ray based on cluster environment variables: |
| 22 | + |
| 23 | + ```bash |
| 24 | + ajet --with-ray-cluster |
| 25 | + ``` |
| 26 | + |
| 27 | + This command reads the following environment variables and initializes a Ray cluster automatically: |
| 28 | + |
| 29 | + | Environment Variable | Description | |
| 30 | + |---|---| |
| 31 | + | `MASTER_ADDR` | The hostname or IP address of the head node. AgentJet compares this with the current node's hostname (`os.uname().nodename`) to determine whether to start a Ray **head** node or a **worker** node. | |
| 32 | + | `MASTER_PORT` | The port used by the Ray head node for cluster communication. | |
| 33 | + |
| 34 | + **How it works:** |
| 35 | + |
| 36 | + - If the current node's hostname matches `MASTER_ADDR`, AgentJet starts a **Ray head** node: `ray start --head --node-ip-address=$MASTER_ADDR --port=$MASTER_PORT` |
| 37 | + - Otherwise, AgentJet starts a **Ray worker** node that connects to the head: `ray start --address=$MASTER_ADDR:$MASTER_PORT` |
| 38 | + |
| 39 | + !!! warning "Cluster Compatibility" |
| 40 | + Currently, `--with-ray-cluster` is designed for clusters that provide `MASTER_ADDR` and `MASTER_PORT` environment variables (e.g., Alibaba PAI DLC). For other cluster schedulers (SLURM, Kubernetes, etc.), you may need to set these environment variables manually or use the manual configuration method below. |
| 41 | + |
| 42 | +=== "Manual Configuration" |
| 43 | + |
| 44 | + Set up Ray manually by starting a head node and connecting worker nodes: |
| 45 | + |
| 46 | + ```bash |
| 47 | + # On the head node |
| 48 | + ray start --head --port=6379 |
| 49 | + |
| 50 | + # On each worker node |
| 51 | + ray start --address='<head-node-ip>:6379' |
| 52 | + ``` |
| 53 | + |
| 54 | + Verify the cluster is running: |
| 55 | + |
| 56 | + ```bash |
| 57 | + ray status |
| 58 | + ``` |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Classic Mode |
| 63 | + |
| 64 | +In Classic Mode, multi-node training is straightforward. After the Ray cluster is ready, simply update your YAML configuration to specify the number of nodes and GPUs per node. |
| 65 | + |
| 66 | +### Step 1: Configure Ray Cluster |
| 67 | + |
| 68 | +Set up the Ray cluster using either method above. |
| 69 | + |
| 70 | +### Step 2: Update Training Configuration |
| 71 | + |
| 72 | +Modify your YAML config to specify the multi-node topology: |
| 73 | + |
| 74 | +```yaml |
| 75 | +ajet: |
| 76 | + trainer_common: |
| 77 | + nnodes: 4 # number of machines |
| 78 | + n_gpus_per_node: 8 # number of GPUs per machine |
| 79 | +``` |
| 80 | +
|
| 81 | +### Step 3: Launch Training |
| 82 | +
|
| 83 | +Run training as usual: |
| 84 | +
|
| 85 | +```bash |
| 86 | +ajet --conf your_config.yaml --backbone='verl' |
| 87 | +``` |
| 88 | + |
| 89 | +AgentJet will automatically distribute the workload across all 4 nodes (32 GPUs total in this example). |
| 90 | + |
| 91 | +!!! tip "Scaling Tips" |
| 92 | + - Set `nnodes` to the total number of machines in your Ray cluster. |
| 93 | + - Set `n_gpus_per_node` to match the number of GPUs available on each machine. |
| 94 | + - Ensure all nodes have identical GPU configurations for optimal performance. |
| 95 | + |
| 96 | +--- |
| 97 | + |
| 98 | +## Swarm Mode |
| 99 | + |
| 100 | +In Swarm Mode, multi-node training allows you to launch a distributed swarm server that spans multiple GPU machines, enabling training of larger models. |
| 101 | + |
| 102 | +### Step 1: Configure Ray Cluster |
| 103 | + |
| 104 | +Set up the Ray cluster across all GPU nodes using either method above. |
| 105 | + |
| 106 | +### Step 2: Start the Swarm Server |
| 107 | + |
| 108 | +Launch the swarm server on the head node: |
| 109 | + |
| 110 | +```bash |
| 111 | +ajet-swarm start |
| 112 | +``` |
| 113 | + |
| 114 | +The swarm server will leverage the entire Ray cluster for model hosting and training. |
| 115 | + |
| 116 | +### Step 3: Submit a Multi-Node Job from Client |
| 117 | + |
| 118 | +From any machine (including a GPU-less laptop), submit a training job with multi-node configuration: |
| 119 | + |
| 120 | +```python |
| 121 | +from ajet_swarm import AgentJetJob |
| 122 | + |
| 123 | +ajet_job = AgentJetJob( |
| 124 | + base_yaml_config="tutorial/example_werewolves_swarm/werewolves.yaml", |
| 125 | + # the YAML config should also set nnodes > 1, e.g.: |
| 126 | + # ajet: |
| 127 | + # trainer_common: |
| 128 | + # nnodes: 4 |
| 129 | + # n_gpus_per_node: 8 |
| 130 | +) |
| 131 | +``` |
| 132 | + |
| 133 | +The YAML referenced by `base_yaml_config` should contain the same multi-node settings: |
| 134 | + |
| 135 | +```yaml |
| 136 | +ajet: |
| 137 | + trainer_common: |
| 138 | + nnodes: 4 # number of machines |
| 139 | + n_gpus_per_node: 8 # number of GPUs per machine |
| 140 | +``` |
| 141 | +
|
| 142 | +!!! info "Swarm Advantage" |
| 143 | + With Swarm Mode, you can submit multi-node training jobs remotely without direct access to the GPU cluster. The swarm server coordinates all distributed training internally. |
| 144 | +
|
| 145 | +--- |
| 146 | +
|
| 147 | +## Summary |
| 148 | +
|
| 149 | +| | Classic Mode | Swarm Mode | |
| 150 | +|---|---|---| |
| 151 | +| **Ray Setup** | Required on all nodes | Required on all GPU nodes | |
| 152 | +| **Config** | Set `nnodes` and `n_gpus_per_node` in YAML | Same YAML config, submitted via `AgentJetJob` | |
| 153 | +| **Launch** | `ajet --conf ...` on head node | `ajet-swarm start` on head node, submit from client | |
| 154 | +| **Remote Submit** | Not supported | Supported (GPU-less laptop) | |
0 commit comments