Skip to content

Commit a5002b4

Browse files
authored
Merge pull request #3136 from odidev/ray_LP
Scale AI workloads with Ray on Google Cloud C4A Axion VM
2 parents 436fc99 + 9a80c85 commit a5002b4

17 files changed

Lines changed: 771 additions & 0 deletions
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
title: Scale AI workloads with Ray on Google Cloud C4A Axion VM
3+
4+
draft: true
5+
cascade:
6+
draft: true
7+
8+
minutes_to_complete: 30
9+
10+
who_is_this_for: This is an introductory topic for DevOps engineers, ML engineers, and software developers who want to deploy and run distributed workloads using Ray on SUSE Linux Enterprise Server (SLES) Arm64, execute parallel tasks, perform hyperparameter tuning, and serve models at scale.
11+
12+
learning_objectives:
13+
- Install and configure Ray on Google Cloud C4A Axion processors for Arm64
14+
- Run distributed tasks and parallel workloads using Ray Core
15+
- Perform distributed training and hyperparameter tuning using Ray Train and Ray Tune
16+
- Deploy scalable APIs using Ray Serve and validate end-to-end execution
17+
18+
prerequisites:
19+
- A [Google Cloud Platform (GCP)](https://cloud.google.com/free) account with billing enabled
20+
- Basic familiarity with Python and distributed systems concepts
21+
22+
author: Pareena Verma
23+
24+
##### Tags
25+
skilllevels: Introductory
26+
subjects: ML
27+
cloud_service_providers:
28+
- Google Cloud
29+
30+
armips:
31+
- Neoverse
32+
33+
tools_software_languages:
34+
- Ray
35+
- Python
36+
- PyTorch
37+
38+
operatingsystems:
39+
- Linux
40+
41+
# ================================================================================
42+
# FIXED, DO NOT MODIFY
43+
# ================================================================================
44+
45+
further_reading:
46+
- resource:
47+
title: Ray official documentation
48+
link: https://docs.ray.io/
49+
type: documentation
50+
51+
- resource:
52+
title: Ray GitHub repository
53+
link: https://github.com/ray-project/ray
54+
type: documentation
55+
56+
- resource:
57+
title: PyTorch documentation
58+
link: https://pytorch.org/docs/stable/index.html
59+
type: documentation
60+
61+
weight: 1
62+
layout: "learningpathall"
63+
learning_path_main_page: yes
64+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: Get started with Ray on Google Axion C4A
3+
weight: 2
4+
5+
layout: "learningpathall"
6+
---
7+
8+
## Explore Axion C4A Arm instances in Google Cloud
9+
10+
Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machines offer strong performance for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
11+
12+
The C4A series provides a cost-effective alternative to x86 virtual machines while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.
13+
14+
To learn more, see the Google blog [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu).
15+
16+
17+
18+
## Explore Ray on Google Axion C4A (Arm Neoverse V2)
19+
20+
Ray is an open-source distributed computing framework designed to scale Python applications across multiple cores and nodes. It is widely used for machine learning, data processing, hyperparameter tuning, and model serving.
21+
22+
Ray provides a unified platform with components such as:
23+
24+
* **Ray Core** for parallel and distributed execution
25+
* **Ray Train** for distributed machine learning
26+
* **Ray Tune** for hyperparameter optimization
27+
* **Ray Serve** for scalable model deployment
28+
29+
Running Ray on Google Axion C4A Arm-based infrastructure enables efficient parallel execution of workloads by leveraging multi-core CPUs and shared memory architecture. This results in improved performance per watt, reduced infrastructure costs, and better scalability for distributed applications.
30+
31+
Common use cases include distributed machine learning training, hyperparameter tuning, real-time inference serving, data processing pipelines, and building scalable backend services.
32+
33+
To learn more, visit the [Ray documentation](https://docs.ray.io/) and explore the [Ray GitHub repository](https://github.com/ray-project/ray).
34+
35+
## What you've accomplished and what's next
36+
37+
In this section, you:
38+
39+
* Explored Google Axion C4A Arm-based VMs and their performance advantages for distributed workloads
40+
* Reviewed Ray components, including Ray Core, Ray Train, Ray Tune, and Ray Serve
41+
* Understood how Arm architecture enables efficient parallel execution and scalability
42+
43+
Next, you'll create a firewall rule to enable remote access to the Ray Dashboard and APIs used in this Learning Path.
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
title: Run Distributed Workloads with Ray
3+
weight: 6
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Run Distributed Workloads with Ray
10+
11+
This section demonstrates how to execute parallel tasks and distributed training workloads using Ray on Arm.
12+
13+
You will run simple distributed functions and then scale to multi-worker training using Ray.
14+
15+
## Run distributed tasks
16+
17+
Create a Python script to execute parallel tasks:
18+
19+
```bash
20+
vi ray_test.py
21+
```
22+
23+
```python
24+
import ray
25+
ray.init()
26+
27+
@ray.remote
28+
def square(x):
29+
return x * x
30+
31+
results = ray.get([square.remote(i) for i in range(10)])
32+
print("Results:", results)
33+
```
34+
35+
### Explanation
36+
37+
* `ray.init()` → connects to the running Ray cluster
38+
* `@ray.remote` → converts a function into a distributed task
39+
* `square.remote(i)` → submits tasks asynchronously
40+
* `ray.get()` → collects results from all workers
41+
42+
### Execute the script
43+
44+
```bash
45+
python3 ray_test.py
46+
```
47+
48+
The output is similar to:
49+
```output
50+
Results: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
51+
```
52+
53+
This confirms parallel execution across CPU cores.
54+
55+
## Run distributed training
56+
57+
Create a script for distributed model training:
58+
59+
```bash
60+
vi ray_train.py
61+
```
62+
63+
```python
64+
import ray
65+
from ray.train.torch import TorchTrainer
66+
from ray.train import ScalingConfig
67+
import torch
68+
69+
def train_func():
70+
x = torch.randn(100, 10)
71+
y = torch.randn(100, 1)
72+
73+
model = torch.nn.Linear(10, 1)
74+
loss_fn = torch.nn.MSELoss()
75+
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
76+
77+
for epoch in range(5):
78+
optimizer.zero_grad()
79+
loss = loss_fn(model(x), y)
80+
loss.backward()
81+
optimizer.step()
82+
print(f"Loss: {loss.item()}")
83+
84+
trainer = TorchTrainer(
85+
train_func,
86+
scaling_config=ScalingConfig(
87+
num_workers=2,
88+
use_gpu=False
89+
)
90+
)
91+
92+
trainer.fit()
93+
```
94+
95+
### Execute training
96+
97+
```bash
98+
python3 ray_train.py
99+
```
100+
101+
The output is similar to:
102+
```output
103+
(TrainController pid=5522) Attempting to start training worker group of size 2 with the following resources: [{'CPU': 1}] * 2
104+
(TrainController pid=5522) Started training worker group of size 2:
105+
(TrainController pid=5522) - (ip=10.0.0.19, pid=5563) world_rank=0, local_rank=0, node_rank=0
106+
(TrainController pid=5522) - (ip=10.0.0.19, pid=5564) world_rank=1, local_rank=1, node_rank=0
107+
(RayTrainWorker pid=5563) Setting up process group for: env:// [rank=0, world_size=2]
108+
(RayTrainWorker pid=5563) Loss: 0.9711737036705017
109+
(RayTrainWorker pid=5563) Loss: 0.9491967558860779
110+
(RayTrainWorker pid=5563) Loss: 0.9295402765274048
111+
(RayTrainWorker pid=5563) Loss: 0.911673903465271
112+
(RayTrainWorker pid=5563) Loss: 0.895072340965271
113+
(RayTrainWorker pid=5564) Loss: 1.635019063949585 [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
114+
```
115+
116+
This confirms distributed training across multiple workers.
117+
118+
## Explanation
119+
120+
* `TorchTrainer` → handles distributed training execution
121+
* `ScalingConfig(num_workers=2)` → runs training on 2 workers
122+
* Each worker executes training in parallel
123+
* Logs may appear from multiple processes
124+
125+
## Ray Jobs View (Tasks & Training)
126+
127+
![Ray Dashboard Jobs tab showing successful execution of ray_test.py and ray_train.py#center](images/ray-jobs.png "Ray Jobs tab showing distributed tasks and training execution status")
128+
129+
* Each script execution appears as a job
130+
* Status shows **SUCCEEDED**
131+
* Confirms correct distributed execution
132+
133+
## What you've learned and what's next
134+
135+
You have successfully:
136+
137+
* Executed parallel tasks using Ray Core
138+
* Converted functions into distributed workloads
139+
* Performed distributed training using multiple workers
140+
* Observed execution in the Ray dashboard
141+
142+
Next, you will perform hyperparameter tuning, deploy models, and benchmark performance.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: Create a firewall rule for Ray Dashboard and Serve
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
Create a firewall rule in Google Cloud Console to expose required ports for the Ray dashboard and Ray Serve API.
10+
11+
{{% notice Note %}}
12+
For help with GCP setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).
13+
{{% /notice %}}
14+
15+
## Configure the firewall rule
16+
17+
Navigate to the [Google Cloud Console](https://console.cloud.google.com/), go to **VPC Network > Firewall**, and select **Create firewall rule**.
18+
19+
![Google Cloud Console VPC Network Firewall page showing the Create firewall rule button in the top menu bar alt-txt#center](images/firewall-rule.png "Create a firewall rule in Google Cloud Console")
20+
21+
Next, create the firewall rule that exposes required ports for Ray.
22+
23+
Set the **Name** of the new rule to "allow-ray-ports". Select your network that you intend to bind to your VM.
24+
25+
Set **Direction of traffic** to "Ingress". Set **Allow on match** to "Allow" and **Targets** to "Specified target tags". Enter "allow-ray-ports" in the **Target tags** text field. Set **Source IPv4 ranges** to "0.0.0.0/0".
26+
27+
![Google Cloud Console Create firewall rule form with Name set to allow-ray-ports and Direction of traffic set to Ingress alt-txt#center](images/network-rule.png "Configuring the allow-ray-ports firewall rule")
28+
29+
Finally, select **Specified protocols and ports** under the **Protocols and ports** section. Select the **TCP** checkbox and enter:
30+
31+
```text
32+
8265,8000,6379
33+
```
34+
35+
* **8265** → Ray Dashboard
36+
* **8000** → Ray Serve API
37+
* **6379** → Ray Head Node
38+
39+
Then select **Create**.
40+
41+
![Google Cloud Console Protocols and ports section with TCP ports configured alt-txt#center](images/network-port.png "Setting Ray ports in the firewall rule")
42+
43+
## What you've accomplished and what's next
44+
45+
In this section, you:
46+
47+
* Created a firewall rule to expose Ray Dashboard and Serve API
48+
* Enabled external access to monitor jobs and access deployed services
49+
50+
Next, you'll deploy and run Ray workloads on your ARM-based virtual machine.
179 KB
Loading
23.5 KB
Loading
22.3 KB
Loading
261 KB
Loading
338 KB
Loading

0 commit comments

Comments
 (0)