Skip to content

Commit d4bd5c1

Browse files
authored
Update "Shared Pathways Service" README (#132)
Add elaborate instructions to validate that the service components are running.
1 parent fc736dd commit d4bd5c1

2 files changed

Lines changed: 104 additions & 35 deletions

File tree

pathwaysutils/experimental/shared_pathways_service/README.md

Lines changed: 103 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -8,54 +8,122 @@ service that manages scheduling and error handling.
88

99
## Requirements
1010

11-
Make sure that your GKE cluster is running the Resource Manager and Worker pods.
12-
You can follow the steps
13-
<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>
14-
to confirm the status of these pods. If you haven't started the Pathways pods
15-
yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml).
16-
Make sure to modify the following values to deploy these pods:
17-
18-
- A unique Jobset name for the cluster's Pathways pods
11+
### 1. Create a GKE cluster with TPUs
12+
13+
You have a GKE cluster with at least 1 TPU slice (v5e, v5p or v6e).
14+
15+
<a name="pw-service-yaml"></a>
16+
17+
### 2. Deploy the Pathways head pod
18+
19+
Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml).
20+
Make sure to modify the following values to deploy the Pathways pods:
21+
22+
- A unique Jobset name for the head pod
1923
- GCS bucket path
2024
- TPU type and topology
2125
- Number of slices
2226

23-
These fields are highlighted in the YAML file with trailing comments for easier
24-
understanding.
27+
### 3. Verify that the pods created in [Step#2](#2-deploy-the-pathways-head-pod) are running
2528

26-
## Instructions
29+
Verify that the Shared Pathways Service components are started, specifically the Pathways resource manager (RM) and
30+
Pathways workers.
31+
32+
```shell
33+
# Set the environment variables.
34+
$ PROJECT=<your-project>
35+
$ CLUSTER_NAME=<your-cluster>
36+
$ REGION=<cluster-region> # e.g., us-central2
37+
38+
# Get credentials for your cluster.
39+
$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
40+
```
41+
42+
#### Option 1: List all pods
43+
44+
```shell
45+
$ kubectl get pods
46+
47+
# Sample expected output (1 Head pod and 1 or more Worker pods)
48+
NAME READY STATUS RESTARTS AGE
49+
pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD
50+
pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0
51+
pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1
52+
```
53+
54+
#### Option 2: Check the status of the specific pods that belong to your Pathways Service
55+
56+
```shell
57+
# e.g., pathways-cluster
58+
$ JOBSET_NAME=<your-jobset-name> # same as you used in [pw-service-example.yaml](#pw-service-yaml)
2759

28-
1. Clone `pathwaysutils`.
60+
# e.g., pathways-cluster-pathways-head-0-0-zzmn2
61+
$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head)
2962

30-
`git clone https://github.com/AI-Hypercomputer/pathways-utils.git`
63+
# e.g., pathways-cluster-worker-0-0-bdzq4
64+
$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')
65+
```
3166

32-
2. Install portpicker
67+
#### Option 3: Check project logs
3368

34-
`pip install portpicker`
69+
Find the detailed instructions
70+
<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>).
3571

36-
3. Import `isc_pathways` and move your workload under
37-
`with isc_pathways.connect()` statement. Refer to
38-
[run_connect_example.py](run_connect_example.py) for reference. Example code:
72+
<a name="find-pw-service"></a>
73+
### 4. Find the Pathways service address
74+
Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
75+
```shell
76+
$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
3977

78+
I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
4079
```
41-
from pathwaysutils.experimental.shared_pathways_service import isc_pathways
42-
43-
with isc_pathways.connect(
44-
cluster="my-cluster",
45-
project="my-project",
46-
region="region",
47-
gcs_bucket="gs://user-bucket",
48-
pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
49-
expected_tpu_instances={"tpuv6e:2x2": 2},
50-
) as tm:
51-
import jax.numpy as jnp
52-
import pathwaysutils
53-
import pprint
54-
55-
pathwaysutils.initialize()
56-
orig_matrix = jnp.zeros(5)
57-
...
80+
81+
## Instructions
82+
83+
### 1. Clone `pathwaysutils`.
84+
85+
```shell
86+
git clone https://github.com/AI-Hypercomputer/pathways-utils.git
87+
```
88+
89+
### 2. Use the `isc_pathways` Context Manager
90+
91+
In your script,
92+
93+
1. Import `isc_pathways`
94+
2. Add `with isc_pathways.connect(...)` statement. The function takes the below values:
95+
- Cluster name
96+
- Project name
97+
- Region
98+
- GCS bucket name
99+
- Pathways Service (See instructions to find the RM address [here](#4-find-the-pathways-service-address))
100+
<a name="ml-code"></a>
101+
3. Write your ML code under this context manager (the `with` block) to run your JAX code on the underlying TPUs.
102+
103+
See [run_connect_example.py](run_connect_example.py) for reference. Example code:
104+
105+
```shell
106+
107+
python3 pathwaysutils/experimental/shared_pathways_service/run_connect_example.py \
108+
--cluster="my-cluster" \
109+
--project="my-project" \
110+
--region="cluster-region" \
111+
--gcs_bucket="gs://user-bucket" \
112+
--pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001" \
113+
--tpu_type="tpuv6e:2x2" \
114+
--tpu_count=1
58115
```
59116

60117
The connect block will deploy a proxy pod dedicated to your client and connect
61118
your local runtime environment to the proxy pod via port-forwarding.
119+
120+
4. You can start another client that uses the same `pathways_service` (similar to [Step#3](#ml-code)). If the Shared Pathways
121+
Service finds available TPU(s) that match your request, your workload will start running on these available resources.
122+
However, if all TPUs are occupied, you can expect your script to halt until the TPUs are available again.
123+
124+
## Troubleshooting
125+
- Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways)
126+
if your Pathways pods do not come up!
127+
128+
- Known errors: The cleanup process of the service is not as clean right now. You can safely ignore the
129+
`Segmentation fault` error, if you see any, after your ML job completes.

pathwaysutils/sidecar/python/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ jax>=0.5.1
33
tensorflow-datasets
44
tiktoken
55
grain-nightly>=0.0.1
6+
portpicker

0 commit comments

Comments
 (0)