Update "Shared Pathways Service" README (#132)

guptaaka · web-flow · commit d4bd5c1c2563 · 2025-12-17T14:06:46.000-08:00
Add elaborate instructions to validate that the service components are running.
diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md
@@ -8,54 +8,122 @@ service that manages scheduling and error handling.
 
 ## Requirements
 
-Make sure that your GKE cluster is running the Resource Manager and Worker pods.
-You can follow the steps
-<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>
-to confirm the status of these pods. If you haven't started the Pathways pods
-yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml).
-Make sure to modify the following values to deploy these pods:
-
-- A unique Jobset name for the cluster's Pathways pods
+### 1. Create a GKE cluster with TPUs
+
+You have a GKE cluster with at least 1 TPU slice (v5e, v5p or v6e).
+
+<a name="pw-service-yaml"></a>
+
+### 2. Deploy the Pathways head pod
+
+Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml).
+Make sure to modify the following values to deploy the Pathways pods:
+
+- A unique Jobset name for the head pod
 - GCS bucket path
 - TPU type and topology
 - Number of slices
 
-These fields are highlighted in the YAML file with trailing comments for easier
-understanding.
+### 3. Verify that the pods created in [Step#2](#2-deploy-the-pathways-head-pod) are running
 
-## Instructions
+Verify that the Shared Pathways Service components are started, specifically the Pathways resource manager (RM) and
+Pathways workers.
+
+```shell
+# Set the environment variables.
+$ PROJECT=<your-project>
+$ CLUSTER_NAME=<your-cluster>
+$ REGION=<cluster-region>  # e.g., us-central2
+
+# Get credentials for your cluster.
+$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
+```
+
+#### Option 1: List all pods
+
+```shell
+$ kubectl get pods
+
+# Sample expected output (1 Head pod and 1 or more Worker pods)
+NAME                                       READY   STATUS    RESTARTS   AGE
+pathways-cluster-pathways-head-0-0-zzmn2   2/2     Running   0          3m49s   # HEAD POD
+pathways-cluster-worker-0-0-bdzq4          1/1     Running   0          3m36s   # WORKER 0
+pathways-cluster-worker-1-0-km2rf          1/1     Running   0          3m36s   # WORKER 1
+```
+
+#### Option 2: Check the status of the specific pods that belong to your Pathways Service
+
+```shell
+# e.g., pathways-cluster
+$ JOBSET_NAME=<your-jobset-name>  # same as you used in [pw-service-example.yaml](#pw-service-yaml)
 
-1. Clone `pathwaysutils`.
+# e.g., pathways-cluster-pathways-head-0-0-zzmn2
+$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head)
 
-`git clone https://github.com/AI-Hypercomputer/pathways-utils.git`
+# e.g., pathways-cluster-worker-0-0-bdzq4
+$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')
+```
 
-2. Install portpicker
+#### Option 3: Check project logs
 
-`pip install portpicker`
+Find the detailed instructions
+<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>).
 
-3. Import `isc_pathways` and move your workload under
-`with isc_pathways.connect()` statement. Refer to
-[run_connect_example.py](run_connect_example.py) for reference. Example code:
+<a name="find-pw-service"></a>
+### 4. Find the Pathways service address
+Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
+```shell
+$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
 
+I1208 20:10:18.148825       ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
 ```
- from pathwaysutils.experimental.shared_pathways_service import isc_pathways
-
- with isc_pathways.connect(
-     cluster="my-cluster",
-     project="my-project",
-     region="region",
-     gcs_bucket="gs://user-bucket",
-     pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
-     expected_tpu_instances={"tpuv6e:2x2": 2},
- ) as tm:
-   import jax.numpy as jnp
-   import pathwaysutils
-   import pprint
-
-   pathwaysutils.initialize()
-   orig_matrix = jnp.zeros(5)
-   ...
+
+## Instructions
+
+### 1. Clone `pathwaysutils`.
+
+```shell
+git clone https://github.com/AI-Hypercomputer/pathways-utils.git
+```
+
+### 2. Use the `isc_pathways` Context Manager
+
+In your script,
+
+1.  Import `isc_pathways`
+2. Add `with isc_pathways.connect(...)` statement. The function takes the below values:
+    - Cluster name
+    - Project name
+    - Region
+    - GCS bucket name
+    - Pathways Service (See instructions to find the RM address [here](#4-find-the-pathways-service-address))
+<a name="ml-code"></a>
+3. Write your ML code under this context manager (the `with` block) to run your JAX code on the underlying TPUs.
+
+See [run_connect_example.py](run_connect_example.py) for reference. Example code:
+
+```shell
+
+python3 pathwaysutils/experimental/shared_pathways_service/run_connect_example.py \
+--cluster="my-cluster" \
+--project="my-project" \
+--region="cluster-region" \
+--gcs_bucket="gs://user-bucket" \
+--pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001" \
+--tpu_type="tpuv6e:2x2" \
+--tpu_count=1
 ```
 
 The connect block will deploy a proxy pod dedicated to your client and connect
 your local runtime environment to the proxy pod via port-forwarding.
+
+4. You can start another client that uses the same `pathways_service` (similar to [Step#3](#ml-code)). If the Shared Pathways
+Service finds available TPU(s) that match your request, your workload will start running on these available resources.
+However, if all TPUs are occupied, you can expect your script to halt until the TPUs are available again.
+
+## Troubleshooting
+- Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways)
+if your Pathways pods do not come up!
+
+- Known errors: The cleanup process of the service is not as clean right now. You can safely ignore the
+`Segmentation fault` error, if you see any, after your ML job completes.
diff --git a/pathwaysutils/sidecar/python/requirements.txt b/pathwaysutils/sidecar/python/requirements.txt
@@ -3,3 +3,4 @@ jax>=0.5.1
 tensorflow-datasets
 tiktoken
 grain-nightly>=0.0.1
+portpicker