Update documentation and improve .gitignore for multi-node training support

binary-husky · binary-husky · commit 73fc1601bd87 · 2026-04-10T15:52:40.000+08:00
diff --git a/.gitignore b/.gitignore
@@ -179,3 +179,4 @@ node_modules
 skills-lock.json
 blueprint*
 tmux_wait.py
+qwen2*
diff --git a/README.md b/README.md
@@ -169,7 +169,7 @@ When swarm training mode is enabled, an additional component will be activated:
 
 ### ✈️ Navigation
 
-* **Tutorials**: From [Installation](https://modelscope.github.io/AgentJet/en/installation) to [Tuning your first agent](https://modelscope.github.io/AgentJet/en/tune_your_first_agent) — the essential path for beginners.
+* **Tutorials**: From [Installation](https://modelscope.github.io/AgentJet/en/installation) to [Tuning your first agent](https://modelscope.github.io/AgentJet/en/tune_your_first_agent) to [Multi-Node Training](https://modelscope.github.io/AgentJet/en/multi_node_training) — the essential path for beginners.
 * **Core Components**: Define your [Trainable Workflow](https://modelscope.github.io/AgentJet/en/workflow) and manage [Data](https://modelscope.github.io/AgentJet/en/data_pipeline) and [Reward](https://modelscope.github.io/AgentJet/en/task_judger).
 * **Example**: Check the [Example Library](https://modelscope.github.io/AgentJet/#example-library) above for real-world cases like [Math](https://modelscope.github.io/AgentJet/en/example_math_agent), [Werewolves game](https://modelscope.github.io/AgentJet/en/example_werewolves) and  [Learning to ask task](https://modelscope.github.io/AgentJet/en/example_learning_to_ask).
 * **Deep Dive**: Master advanced [Configuration](https://modelscope.github.io/AgentJet/en/configuration).
diff --git a/ajet/copilot/monitor-with-tmux/SKILL.md b/ajet/copilot/monitor-with-tmux/SKILL.md
@@ -178,4 +178,4 @@ $ python3 /tmp/tmux_wait.py ajet_session 240 && tmux capture-pane -t ajet_sessio
 # Destroy tmux session
 tmux kill-session -t ajet_session
 
-```
+```
diff --git a/docs/en/blog_vibe_training_werewolves.md b/docs/en/blog_vibe_training_werewolves.md
@@ -95,7 +95,7 @@ The code follows a simple three-step pattern that defines all Swarm Client progr
 
 ## Architecture: How It Works
 
-```
+```text
 Swarm Server (GPU Cluster, 8x GPUs)
     ├── Qwen2.5-7B-Instruct (trainable)
     ├── vLLM inference engine
diff --git a/docs/en/blog_vibe_training_werewolves.zh.md b/docs/en/blog_vibe_training_werewolves.zh.md
@@ -95,7 +95,7 @@ def execute_agent(task: Task, api_baseurl_key: OpenaiBaseUrlAndApiKey):
 
 ## 架构：工作原理
 
-```
+```text
 Swarm Server（GPU 集群，8 块 GPU）
     ├── Qwen2.5-7B-Instruct（可训练）
     ├── vLLM 推理引擎
diff --git a/docs/en/lora_training.md b/docs/en/lora_training.md
@@ -132,4 +132,4 @@ merged_model = lora_model.merge_and_unload()
 <a href="../example_math_agent/" class="feature-card"><div class="card-header"><img src="https://api.iconify.design/mdi:calculator.svg" class="card-icon card-icon-general" alt=""><h3>Math Agent</h3></div><p class="card-desc">Train a tool-using math reasoning agent.</p></a>
 <a href="../tune_your_first_agent/" class="feature-card"><div class="card-header"><img src="https://api.iconify.design/mdi:rocket-launch.svg" class="card-icon card-icon-general" alt=""><h3>Tune First Agent</h3></div><p class="card-desc">Get started with AgentJet training.</p></a>
 <a href="../configuration/" class="feature-card"><div class="card-header"><img src="https://api.iconify.design/mdi:cog.svg" class="card-icon card-icon-general" alt=""><h3>Configuration</h3></div><p class="card-desc">Deep dive into config options.</p></a>
-</div>
+</div>
diff --git a/docs/en/multi_node_training.md b/docs/en/multi_node_training.md
@@ -0,0 +1,154 @@
+# Multi-Node Multi-GPU Training
+
+AgentJet supports scaling training across multiple machines and GPUs. This guide covers how to set up multi-node training in both **Classic Mode** and **Swarm Mode**.
+
+---
+
+## Prerequisites
+
+Before starting multi-node training, ensure that:
+
+1. All nodes have AgentJet installed and configured (see [Installation](../en/installation.md)).
+2. All nodes can communicate with each other over the network.
+3. A [Ray](https://docs.ray.io/) cluster is properly set up across all nodes.
+
+### Setting Up the Ray Cluster
+
+You have two options to set up Ray:
+
+=== "Auto Configuration"
+
+    Use the built-in helper to automatically configure Ray based on cluster environment variables:
+
+    ```bash
+    ajet --with-ray-cluster
+    ```
+
+    This command reads the following environment variables and initializes a Ray cluster automatically:
+
+    | Environment Variable | Description |
+    |---|---|
+    | `MASTER_ADDR` | The hostname or IP address of the head node. AgentJet compares this with the current node's hostname (`os.uname().nodename`) to determine whether to start a Ray **head** node or a **worker** node. |
+    | `MASTER_PORT` | The port used by the Ray head node for cluster communication. |
+
+    **How it works:**
+
+    - If the current node's hostname matches `MASTER_ADDR`, AgentJet starts a **Ray head** node: `ray start --head --node-ip-address=$MASTER_ADDR --port=$MASTER_PORT`
+    - Otherwise, AgentJet starts a **Ray worker** node that connects to the head: `ray start --address=$MASTER_ADDR:$MASTER_PORT`
+
+    !!! warning "Cluster Compatibility"
+        Currently, `--with-ray-cluster` is designed for clusters that provide `MASTER_ADDR` and `MASTER_PORT` environment variables (e.g., Alibaba PAI DLC). For other cluster schedulers (SLURM, Kubernetes, etc.), you may need to set these environment variables manually or use the manual configuration method below.
+
+=== "Manual Configuration"
+
+    Set up Ray manually by starting a head node and connecting worker nodes:
+
+    ```bash
+    # On the head node
+    ray start --head --port=6379
+
+    # On each worker node
+    ray start --address='<head-node-ip>:6379'
+    ```
+
+    Verify the cluster is running:
+
+    ```bash
+    ray status
+    ```
+
+---
+
+## Classic Mode
+
+In Classic Mode, multi-node training is straightforward. After the Ray cluster is ready, simply update your YAML configuration to specify the number of nodes and GPUs per node.
+
+### Step 1: Configure Ray Cluster
+
+Set up the Ray cluster using either method above.
+
+### Step 2: Update Training Configuration
+
+Modify your YAML config to specify the multi-node topology:
+
+```yaml
+ajet:
+  trainer_common:
+    nnodes: 4          # number of machines
+    n_gpus_per_node: 8 # number of GPUs per machine
+```
+
+### Step 3: Launch Training
+
+Run training as usual:
+
+```bash
+ajet --conf your_config.yaml --backbone='verl'
+```
+
+AgentJet will automatically distribute the workload across all 4 nodes (32 GPUs total in this example).
+
+!!! tip "Scaling Tips"
+    - Set `nnodes` to the total number of machines in your Ray cluster.
+    - Set `n_gpus_per_node` to match the number of GPUs available on each machine.
+    - Ensure all nodes have identical GPU configurations for optimal performance.
+
+---
+
+## Swarm Mode
+
+In Swarm Mode, multi-node training allows you to launch a distributed swarm server that spans multiple GPU machines, enabling training of larger models.
+
+### Step 1: Configure Ray Cluster
+
+Set up the Ray cluster across all GPU nodes using either method above.
+
+### Step 2: Start the Swarm Server
+
+Launch the swarm server on the head node:
+
+```bash
+ajet-swarm start
+```
+
+The swarm server will leverage the entire Ray cluster for model hosting and training.
+
+### Step 3: Submit a Multi-Node Job from Client
+
+From any machine (including a GPU-less laptop), submit a training job with multi-node configuration:
+
+```python
+from ajet_swarm import AgentJetJob
+
+ajet_job = AgentJetJob(
+    base_yaml_config="tutorial/example_werewolves_swarm/werewolves.yaml",
+    # the YAML config should also set nnodes > 1, e.g.:
+    #   ajet:
+    #     trainer_common:
+    #       nnodes: 4
+    #       n_gpus_per_node: 8
+)
+```
+
+The YAML referenced by `base_yaml_config` should contain the same multi-node settings:
+
+```yaml
+ajet:
+  trainer_common:
+    nnodes: 4          # number of machines
+    n_gpus_per_node: 8 # number of GPUs per machine
+```
+
+!!! info "Swarm Advantage"
+    With Swarm Mode, you can submit multi-node training jobs remotely without direct access to the GPU cluster. The swarm server coordinates all distributed training internally.
+
+---
+
+## Summary
+
+| | Classic Mode | Swarm Mode |
+|---|---|---|
+| **Ray Setup** | Required on all nodes | Required on all GPU nodes |
+| **Config** | Set `nnodes` and `n_gpus_per_node` in YAML | Same YAML config, submitted via `AgentJetJob` |
+| **Launch** | `ajet --conf ...` on head node | `ajet-swarm start` on head node, submit from client |
+| **Remote Submit** | Not supported | Supported (GPU-less laptop) |
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -29,6 +29,7 @@ nav:
       - Tune First Agent: en/tune_your_first_agent.md
       - Swarm Training Intro: en/swarm_intro_blog_en.md
       - Agentic Frameworks: en/agent_framework_support.md
+      - Multi-Node Training: en/multi_node_training.md
 
   - • Classic Examples:
       - Math Agent: en/example_math_agent.md
diff --git a/tests/bench/benchmark_appworldlora/benchmark_appworldlora.py b/tests/bench/benchmark_appworldlora/benchmark_appworldlora.py
@@ -27,4 +27,4 @@ def __init__(self):
         self.probe_key = "reward_probe"
 
     def __call__(self, key, log_dict):
-        return super().__call__(key, log_dict)
+        return super().__call__(key, log_dict)
diff --git a/tests/bench/benchmark_appworldlora/benchmark_appworldlora.yaml b/tests/bench/benchmark_appworldlora/benchmark_appworldlora.yaml
@@ -81,4 +81,4 @@ defaults:
   - verl_default # verl inherit 1/1
   - trinity_default # trinity inherit 1/1
   - ajet_default
-  - _self_
+  - _self_
diff --git a/tests/bench/benchmark_appworldlora/execute_benchmark_appworldlora.py b/tests/bench/benchmark_appworldlora/execute_benchmark_appworldlora.py
@@ -100,4 +100,4 @@ def install_appworld(self):
         )
         # write
         os.environ["APPWORLD_PATH"] = "/tmp/pack_all_in_one"
-        os.environ["APPWORLD_SCRIPT"] = "bash EnvService/env_sandbox/appworld.sh"
+        os.environ["APPWORLD_SCRIPT"] = "bash EnvService/env_sandbox/appworld.sh"
diff --git a/tests/bench/benchmark_countdownlora/benchmark_countdownlora.py b/tests/bench/benchmark_countdownlora/benchmark_countdownlora.py
@@ -27,4 +27,4 @@ def __init__(self):
         self.probe_key = "reward_probe"
 
     def __call__(self, key, log_dict):
-        return super().__call__(key, log_dict)
+        return super().__call__(key, log_dict)
diff --git a/tests/bench/benchmark_countdownlora/benchmark_countdownlora.yaml b/tests/bench/benchmark_countdownlora/benchmark_countdownlora.yaml
@@ -142,4 +142,4 @@ defaults:
   - verl_default # verl inherit 1/1
   - trinity_default # trinity inherit 1/1
   - ajet_default
-  - _self_
+  - _self_
diff --git a/tests/bench/benchmark_countdownlora/execute_benchmark_countdownlora.py b/tests/bench/benchmark_countdownlora/execute_benchmark_countdownlora.py
@@ -36,4 +36,4 @@ def test_02_begin_trinity(self):
             probe_target=PROBE_TARGET,
             target_name=TARGET_NAME,
             python_executable=PYTHON_EXECUTABLE,
-        )
+        )
diff --git a/tests/bench/benchmark_frozenlakelora/benchmark_frozenlakelora.py b/tests/bench/benchmark_frozenlakelora/benchmark_frozenlakelora.py
@@ -27,4 +27,4 @@ def __init__(self):
         self.probe_key = "reward_probe"
 
     def __call__(self, key, log_dict):
-        return super().__call__(key, log_dict)
+        return super().__call__(key, log_dict)
diff --git a/tests/bench/benchmark_frozenlakelora/benchmark_frozenlakelora.yaml b/tests/bench/benchmark_frozenlakelora/benchmark_frozenlakelora.yaml
@@ -102,4 +102,4 @@ defaults:
   - verl_default    # verl inherit 1/1
   - trinity_default # trinity inherit 1/1
   - ajet_default
-  - _self_
+  - _self_
diff --git a/tests/bench/benchmark_frozenlakelora/execute_benchmark_frozenlakelora.py b/tests/bench/benchmark_frozenlakelora/execute_benchmark_frozenlakelora.py
@@ -33,4 +33,4 @@ def test_02_begin_trinity(self):
             probe_target=PROBE_TARGET,
             target_name=TARGET_NAME,
             python_executable=PYTHON_EXECUTABLE,
-        )
+        )
diff --git a/tests/bench/benchmark_learn2asklora/benchmark_learn2asklora.py b/tests/bench/benchmark_learn2asklora/benchmark_learn2asklora.py
@@ -35,4 +35,4 @@ def __init__(self):
         self.probe_key = "reward_probe"
 
     def __call__(self, key, log_dict):
-        return super().__call__(key, log_dict)
+        return super().__call__(key, log_dict)
diff --git a/tests/bench/benchmark_learn2asklora/benchmark_learn2asklora.yaml b/tests/bench/benchmark_learn2asklora/benchmark_learn2asklora.yaml
@@ -78,4 +78,4 @@ defaults:
   - verl_default # verl inherit 1/1
   - trinity_default # trinity inherit 1/1
   - ajet_default
-  - _self_
+  - _self_
diff --git a/tests/bench/benchmark_learn2asklora/execute_benchmark_learn2asklora.py b/tests/bench/benchmark_learn2asklora/execute_benchmark_learn2asklora.py
@@ -35,4 +35,4 @@ def test_02_begin_trinity(self):
             probe_target=PROBE_TARGET,
             target_name=TARGET_NAME,
             python_executable=PYTHON_EXECUTABLE,
-        )
+        )
diff --git a/tests/bench/benchmark_mathlora/benchmark_mathlora.py b/tests/bench/benchmark_mathlora/benchmark_mathlora.py
@@ -27,4 +27,4 @@ def __init__(self):
         self.probe_key = "reward_probe"
 
     def __call__(self, key, log_dict):
-        return super().__call__(key, log_dict)
+        return super().__call__(key, log_dict)
diff --git a/tests/bench/benchmark_mathlora/benchmark_mathlora.yaml b/tests/bench/benchmark_mathlora/benchmark_mathlora.yaml
@@ -85,4 +85,4 @@ defaults:
   - verl_default # verl inherit
   - trinity_default # trinity inherit
   - ajet_default
-  - _self_
+  - _self_
diff --git a/tests/bench/benchmark_mathlora/execute_benchmark_mathlora.py b/tests/bench/benchmark_mathlora/execute_benchmark_mathlora.py
@@ -34,4 +34,4 @@ def test_02_begin_trinity(self):
             probe_target=PROBE_TARGET,
             target_name=TARGET_NAME,
             python_executable=PYTHON_EXECUTABLE,
-        )
+        )