ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/_index.md‎
Lines changed: 64 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/_index.md‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/_next-steps.md‎
Lines changed: 8 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/_next-steps.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/background.md‎
Lines changed: 43 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/background.md‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/distributed_workloads.md‎
Lines changed: 142 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/distributed_workloads.md‎
Lines changed: 142 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/firewall-setup.md‎
Lines changed: 50 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/firewall-setup.md‎
Lines changed: 50 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/firewall-rule.png‎
179 KB b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/firewall-rule.png‎
179 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/gcp-pubip-ssh.png‎
23.5 KB b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/gcp-pubip-ssh.png‎
23.5 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/gcp-shell.png‎
22.3 KB b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/gcp-shell.png‎
22.3 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/gcp-vm.png‎
261 KB b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/gcp-vm.png‎
261 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/network-port.png‎
338 KB b/‎content/learning-paths/servers-and-cloud-computing/ray-on-axion/images/network-port.png‎
338 KB
@@ -0,0 +1,64 @@
+---
+title: Scale AI workloads with Ray on Google Cloud C4A Axion VM
+
+draft: true
+cascade:
+    draft: true
+    
+minutes_to_complete: 30
+
+who_is_this_for: This is an introductory topic for DevOps engineers, ML engineers, and software developers who want to deploy and run distributed workloads using Ray on SUSE Linux Enterprise Server (SLES) Arm64, execute parallel tasks, perform hyperparameter tuning, and serve models at scale.
+
+learning_objectives:
+    - Install and configure Ray on Google Cloud C4A Axion processors for Arm64
+    - Run distributed tasks and parallel workloads using Ray Core
+    - Perform distributed training and hyperparameter tuning using Ray Train and Ray Tune
+    - Deploy scalable APIs using Ray Serve and validate end-to-end execution
+
+prerequisites:
+  - A [Google Cloud Platform (GCP)](https://cloud.google.com/free) account with billing enabled
+  - Basic familiarity with Python and distributed systems concepts
+
+author: Pareena Verma
+
+##### Tags
+skilllevels: Introductory
+subjects: ML
+cloud_service_providers:
+  - Google Cloud
+
+armips:
+  - Neoverse
+
+tools_software_languages:
+  - Ray
+  - Python
+  - PyTorch
+
+operatingsystems:
+  - Linux
+
+# ================================================================================
+#       FIXED, DO NOT MODIFY
+# ================================================================================
+
+further_reading:
+  - resource:
+      title: Ray official documentation
+      link: https://docs.ray.io/
+      type: documentation
+
+  - resource:
+      title: Ray GitHub repository
+      link: https://github.com/ray-project/ray
+      type: documentation
+
+  - resource:
+      title: PyTorch documentation
+      link: https://pytorch.org/docs/stable/index.html
+      type: documentation
+
+weight: 1
+layout: "learningpathall"
+learning_path_main_page: yes
+---
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
@@ -0,0 +1,43 @@
+---
+title: Get started with Ray on Google Axion C4A
+weight: 2
+
+layout: "learningpathall"
+---
+
+## Explore Axion C4A Arm instances in Google Cloud
+
+Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machines offer strong performance for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
+
+The C4A series provides a cost-effective alternative to x86 virtual machines while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.
+
+To learn more, see the Google blog [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu).
+
+
+
+## Explore Ray on Google Axion C4A (Arm Neoverse V2)
+
+Ray is an open-source distributed computing framework designed to scale Python applications across multiple cores and nodes. It is widely used for machine learning, data processing, hyperparameter tuning, and model serving.
+
+Ray provides a unified platform with components such as:
+
+* **Ray Core** for parallel and distributed execution
+* **Ray Train** for distributed machine learning
+* **Ray Tune** for hyperparameter optimization
+* **Ray Serve** for scalable model deployment
+
+Running Ray on Google Axion C4A Arm-based infrastructure enables efficient parallel execution of workloads by leveraging multi-core CPUs and shared memory architecture. This results in improved performance per watt, reduced infrastructure costs, and better scalability for distributed applications.
+
+Common use cases include distributed machine learning training, hyperparameter tuning, real-time inference serving, data processing pipelines, and building scalable backend services.
+
+To learn more, visit the [Ray documentation](https://docs.ray.io/) and explore the [Ray GitHub repository](https://github.com/ray-project/ray).
+
+## What you've accomplished and what's next
+
+In this section, you:
+
+* Explored Google Axion C4A Arm-based VMs and their performance advantages for distributed workloads
+* Reviewed Ray components, including Ray Core, Ray Train, Ray Tune, and Ray Serve
+* Understood how Arm architecture enables efficient parallel execution and scalability
+
+Next, you'll create a firewall rule to enable remote access to the Ray Dashboard and APIs used in this Learning Path.
@@ -0,0 +1,142 @@
+---
+title: Run Distributed Workloads with Ray
+weight: 6
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Run Distributed Workloads with Ray
+
+This section demonstrates how to execute parallel tasks and distributed training workloads using Ray on Arm.
+
+You will run simple distributed functions and then scale to multi-worker training using Ray.
+
+## Run distributed tasks
+
+Create a Python script to execute parallel tasks:
+
+```bash
+vi ray_test.py
+```
+
+```python
+import ray
+ray.init()
+
+@ray.remote
+def square(x):
+    return x * x
+
+results = ray.get([square.remote(i) for i in range(10)])
+print("Results:", results)
+```
+
+### Explanation
+
+* `ray.init()` → connects to the running Ray cluster
+* `@ray.remote` → converts a function into a distributed task
+* `square.remote(i)` → submits tasks asynchronously
+* `ray.get()` → collects results from all workers
+
+### Execute the script
+
+```bash
+python3 ray_test.py
+```
+
+The output is similar to:
+```output
+Results: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
+```
+
+This confirms parallel execution across CPU cores.
+
+## Run distributed training
+
+Create a script for distributed model training:
+
+```bash
+vi ray_train.py
+```
+
+```python
+import ray
+from ray.train.torch import TorchTrainer
+from ray.train import ScalingConfig
+import torch
+
+def train_func():
+    x = torch.randn(100, 10)
+    y = torch.randn(100, 1)
+
+    model = torch.nn.Linear(10, 1)
+    loss_fn = torch.nn.MSELoss()
+    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
+
+    for epoch in range(5):
+        optimizer.zero_grad()
+        loss = loss_fn(model(x), y)
+        loss.backward()
+        optimizer.step()
+        print(f"Loss: {loss.item()}")
+
+trainer = TorchTrainer(
+    train_func,
+    scaling_config=ScalingConfig(
+        num_workers=2,
+        use_gpu=False
+    )
+)
+
+trainer.fit()
+```
+
+### Execute training
+
+```bash
+python3 ray_train.py
+```
+
+The output is similar to:
+```output
+(TrainController pid=5522) Attempting to start training worker group of size 2 with the following resources: [{'CPU': 1}] * 2
+(TrainController pid=5522) Started training worker group of size 2: 
+(TrainController pid=5522) - (ip=10.0.0.19, pid=5563) world_rank=0, local_rank=0, node_rank=0
+(TrainController pid=5522) - (ip=10.0.0.19, pid=5564) world_rank=1, local_rank=1, node_rank=0
+(RayTrainWorker pid=5563) Setting up process group for: env:// [rank=0, world_size=2]
+(RayTrainWorker pid=5563) Loss: 0.9711737036705017
+(RayTrainWorker pid=5563) Loss: 0.9491967558860779
+(RayTrainWorker pid=5563) Loss: 0.9295402765274048
+(RayTrainWorker pid=5563) Loss: 0.911673903465271
+(RayTrainWorker pid=5563) Loss: 0.895072340965271
+(RayTrainWorker pid=5564) Loss: 1.635019063949585 [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
+```
+
+This confirms distributed training across multiple workers.
+
+## Explanation
+
+* `TorchTrainer` → handles distributed training execution
+* `ScalingConfig(num_workers=2)` → runs training on 2 workers
+* Each worker executes training in parallel
+* Logs may appear from multiple processes
+
+## Ray Jobs View (Tasks & Training)
+
+![Ray Dashboard Jobs tab showing successful execution of ray_test.py and ray_train.py#center](images/ray-jobs.png "Ray Jobs tab showing distributed tasks and training execution status")
+
+* Each script execution appears as a job
+* Status shows **SUCCEEDED**
+* Confirms correct distributed execution
+
+## What you've learned and what's next
+
+You have successfully:
+
+* Executed parallel tasks using Ray Core
+* Converted functions into distributed workloads
+* Performed distributed training using multiple workers
+* Observed execution in the Ray dashboard
+
+Next, you will perform hyperparameter tuning, deploy models, and benchmark performance.
@@ -0,0 +1,50 @@
+---
+title: Create a firewall rule for Ray Dashboard and Serve
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+Create a firewall rule in Google Cloud Console to expose required ports for the Ray dashboard and Ray Serve API.
+
+{{% notice Note %}}
+For help with GCP setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).
+{{% /notice %}}
+
+## Configure the firewall rule
+
+Navigate to the [Google Cloud Console](https://console.cloud.google.com/), go to **VPC Network > Firewall**, and select **Create firewall rule**.
+
+![Google Cloud Console VPC Network Firewall page showing the Create firewall rule button in the top menu bar alt-txt#center](images/firewall-rule.png "Create a firewall rule in Google Cloud Console")
+
+Next, create the firewall rule that exposes required ports for Ray.
+
+Set the **Name** of the new rule to "allow-ray-ports". Select your network that you intend to bind to your VM.
+
+Set **Direction of traffic** to "Ingress". Set **Allow on match** to "Allow" and **Targets** to "Specified target tags". Enter "allow-ray-ports" in the **Target tags** text field. Set **Source IPv4 ranges** to "0.0.0.0/0".
+
+![Google Cloud Console Create firewall rule form with Name set to allow-ray-ports and Direction of traffic set to Ingress alt-txt#center](images/network-rule.png "Configuring the allow-ray-ports firewall rule")
+
+Finally, select **Specified protocols and ports** under the **Protocols and ports** section. Select the **TCP** checkbox and enter:
+
+```text
+8265,8000,6379
+```
+
+* **8265** → Ray Dashboard
+* **8000** → Ray Serve API
+* **6379** → Ray Head Node
+
+Then select **Create**.
+
+![Google Cloud Console Protocols and ports section with TCP ports configured alt-txt#center](images/network-port.png "Setting Ray ports in the firewall rule")
+
+## What you've accomplished and what's next
+
+In this section, you:
+
+* Created a firewall rule to expose Ray Dashboard and Serve API
+* Enabled external access to monitor jobs and access deployed services
+
+Next, you'll deploy and run Ray workloads on your ARM-based virtual machine.