@@ -8,54 +8,122 @@ service that manages scheduling and error handling.
88
99## Requirements
1010
11- Make sure that your GKE cluster is running the Resource Manager and Worker pods.
12- You can follow the steps
13- <a href =" https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring " target =" _blank " >here</a >
14- to confirm the status of these pods. If you haven't started the Pathways pods
15- yet, you can use [ pw-service-example.yaml] ( yamls/pw-service-example.yaml ) .
16- Make sure to modify the following values to deploy these pods:
17-
18- - A unique Jobset name for the cluster's Pathways pods
11+ ### 1. Create a GKE cluster with TPUs
12+
13+ You have a GKE cluster with at least 1 TPU slice (v5e, v5p or v6e).
14+
15+ <a name =" pw-service-yaml " ></a >
16+
17+ ### 2. Deploy the Pathways head pod
18+
19+ Start the Shared Pathways Service by using [ pw-service-example.yaml] ( yamls/pw-service-example.yaml ) .
20+ Make sure to modify the following values to deploy the Pathways pods:
21+
22+ - A unique Jobset name for the head pod
1923- GCS bucket path
2024- TPU type and topology
2125- Number of slices
2226
23- These fields are highlighted in the YAML file with trailing comments for easier
24- understanding.
27+ ### 3. Verify that the pods created in [ Step #2 ] ( #2-deploy-the-pathways-head-pod ) are running
2528
26- ## Instructions
29+ Verify that the Shared Pathways Service components are started, specifically the Pathways resource manager (RM) and
30+ Pathways workers.
31+
32+ ``` shell
33+ # Set the environment variables.
34+ $ PROJECT=< your-project>
35+ $ CLUSTER_NAME=< your-cluster>
36+ $ REGION=< cluster-region> # e.g., us-central2
37+
38+ # Get credentials for your cluster.
39+ $ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
40+ ```
41+
42+ #### Option 1: List all pods
43+
44+ ``` shell
45+ $ kubectl get pods
46+
47+ # Sample expected output (1 Head pod and 1 or more Worker pods)
48+ NAME READY STATUS RESTARTS AGE
49+ pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD
50+ pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0
51+ pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1
52+ ```
53+
54+ #### Option 2: Check the status of the specific pods that belong to your Pathways Service
55+
56+ ``` shell
57+ # e.g., pathways-cluster
58+ $ JOBSET_NAME=< your-jobset-name> # same as you used in [pw-service-example.yaml](#pw-service-yaml)
2759
28- 1 . Clone ` pathwaysutils ` .
60+ # e.g., pathways-cluster-pathways-head-0-0-zzmn2
61+ $ HEAD_POD_NAME=$( kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath=' {.items[?(@.status.phase=="Running")].metadata.name}' | sed ' s/ /\n/g' | grep head)
2962
30- ` git clone https://github.com/AI-Hypercomputer/pathways-utils.git `
63+ # e.g., pathways-cluster-worker-0-0-bdzq4
64+ $ WORKER0_POD_NAME=$( kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath=' {.items[?(@.status.phase=="Running")].metadata.name}' | sed ' s/ /\n/g' | grep ' worker-0-0-' )
65+ ```
3166
32- 2 . Install portpicker
67+ #### Option 3: Check project logs
3368
34- ` pip install portpicker `
69+ Find the detailed instructions
70+ <a href =" https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring " target =" _blank " >here</a >).
3571
36- 3 . Import ` isc_pathways ` and move your workload under
37- ` with isc_pathways.connect() ` statement. Refer to
38- [ run_connect_example.py] ( run_connect_example.py ) for reference. Example code:
72+ <a name =" find-pw-service " ></a >
73+ ### 4. Find the Pathways service address
74+ Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
75+ ``` shell
76+ $ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep " \-\-resource_manager_address"
3977
78+ I1208 20:10:18.148825 ...] argv[2]: ' --resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
4079```
41- from pathwaysutils.experimental.shared_pathways_service import isc_pathways
42-
43- with isc_pathways.connect(
44- cluster="my-cluster",
45- project="my-project",
46- region="region",
47- gcs_bucket="gs://user-bucket",
48- pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
49- expected_tpu_instances={"tpuv6e:2x2": 2},
50- ) as tm:
51- import jax.numpy as jnp
52- import pathwaysutils
53- import pprint
54-
55- pathwaysutils.initialize()
56- orig_matrix = jnp.zeros(5)
57- ...
80+
81+ ## Instructions
82+
83+ ### 1. Clone ` pathwaysutils ` .
84+
85+ ``` shell
86+ git clone https://github.com/AI-Hypercomputer/pathways-utils.git
87+ ```
88+
89+ ### 2. Use the ` isc_pathways ` Context Manager
90+
91+ In your script,
92+
93+ 1 . Import ` isc_pathways `
94+ 2 . Add ` with isc_pathways.connect(...) ` statement. The function takes the below values:
95+ - Cluster name
96+ - Project name
97+ - Region
98+ - GCS bucket name
99+ - Pathways Service (See instructions to find the RM address [ here] ( #4-find-the-pathways-service-address ) )
100+ <a name =" ml-code " ></a >
101+ 3 . Write your ML code under this context manager (the ` with ` block) to run your JAX code on the underlying TPUs.
102+
103+ See [ run_connect_example.py] ( run_connect_example.py ) for reference. Example code:
104+
105+ ``` shell
106+
107+ python3 pathwaysutils/experimental/shared_pathways_service/run_connect_example.py \
108+ --cluster=" my-cluster" \
109+ --project=" my-project" \
110+ --region=" cluster-region" \
111+ --gcs_bucket=" gs://user-bucket" \
112+ --pathways_service=" pathways-cluster-pathways-head-0-0.pathways-cluster:29001" \
113+ --tpu_type=" tpuv6e:2x2" \
114+ --tpu_count=1
58115```
59116
60117The connect block will deploy a proxy pod dedicated to your client and connect
61118your local runtime environment to the proxy pod via port-forwarding.
119+
120+ 4 . You can start another client that uses the same ` pathways_service ` (similar to [ Step #3 ] ( #ml-code ) ). If the Shared Pathways
121+ Service finds available TPU(s) that match your request, your workload will start running on these available resources.
122+ However, if all TPUs are occupied, you can expect your script to halt until the TPUs are available again.
123+
124+ ## Troubleshooting
125+ - Refer to [ this guide] ( https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways )
126+ if your Pathways pods do not come up!
127+
128+ - Known errors: The cleanup process of the service is not as clean right now. You can safely ignore the
129+ ` Segmentation fault ` error, if you see any, after your ML job completes.
0 commit comments