Skip to content

Commit 0a5e4e2

Browse files
committed
Improve environment variable setup in MaxText docs.
1 parent 46dbaf5 commit 0a5e4e2

13 files changed

Lines changed: 322 additions & 155 deletions

File tree

docs/conf.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,8 @@
162162
r"https://cla\.developers\.google\.com/.*",
163163
# Ignore GitHub commit history links which frequently trigger rate limiting (429)
164164
r"https://github\.com/jax-ml/jax/commits/.*",
165+
# Ignore Hugging Face settings links which require login
166+
r"https://huggingface\.co/settings/tokens",
165167
]
166168

167169

docs/guides/monitoring_and_debugging/megascale_hang_playbook.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,4 +108,4 @@ After [creating an HLO Dump](https://openxla.org/xla/hlo_dumps), you can share i
108108
gcloud storage cp -r /tmp/xla_dump gs://<bucket_location>
109109
```
110110

111-
When sharing the HLO dump, you will need to give Google permission to access the GCS bucket. A Google user can then download the HLO graph using `gsutil`.
111+
When sharing the HLO dump, you will need to give Google permission to access the GCS bucket. A Google user can then download the HLO graph using `gcloud storage`.

docs/run_maxtext/run_maxtext_single_host_gpu.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -135,15 +135,22 @@ https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/configs/gpu/a3
135135
echo "Running 1vm.sh"
136136

137137
# Example command to invoke this script via XPK
138-
# python3 xpk/xpk.py workload create --cluster ${CLUSTER_NAME?} \
139-
# --workload ${WORKLOAD_NAME?} --docker-image=gcr.io/supercomputer-testing/${LOCAL_IMAGE_NAME?} \
138+
# python3 xpk/xpk.py workload create --cluster ${GKE_CLUSTER?} \
139+
# --workload ${RUN_NAME?} --docker-image=gcr.io/supercomputer-testing/${LOCAL_IMAGE_NAME?} \
140140
# --device-type ${DEVICE_TYPE?} --num-slices 1 \
141141
# --command "bash src/maxtext/configs/gpu/a3/llama_2_7b/1vm.sh"
142142

143143
# Stop execution if any command exits with error
144144
set -e
145145

146-
export OUTPUT_PATH="provide an output path"
146+
# Use a GCS bucket you own to store logs and checkpoints. Ideally in the same
147+
# region as your GPUs to minimize latency and costs.
148+
# You can list your buckets and their locations in the
149+
# [Cloud Console](https://console.cloud.google.com/storage/browser).
150+
export BASE_OUTPUT_DIRECTORY=<gcs bucket path> # e.g., gs://my-bucket/maxtext-runs
151+
152+
# An arbitrary string to identify this specific run.
153+
# Note: Kubernetes requires workload names to be valid DNS labels (lowercase, no underscores or periods).
147154
export RUN_NAME="llama-2-1vm-$(date +%Y-%m-%d-%H-%M)"
148155

149156
# Set environment variables
@@ -152,7 +159,7 @@ for ARGUMENT in "$@"; do
152159
export "$KEY"="$VALUE"
153160
done
154161

155-
export XLA_FLAGS="--xla_dump_to=${OUTPUT_PATH?}/${RUN_NAME?}/HLO_dumps/
162+
export XLA_FLAGS="--xla_dump_to=${BASE_OUTPUT_DIRECTORY?}/${RUN_NAME?}/HLO_dumps/
156163
--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false
157164
--xla_gpu_enable_command_buffer='' --xla_gpu_enable_highest_priority_async_stream=true
158165
--xla_gpu_all_reduce_combine_threshold_bytes=134217728 --xla_gpu_all_gather_combine_threshold_bytes=134217728
@@ -165,6 +172,6 @@ export XLA_FLAGS="--xla_dump_to=${OUTPUT_PATH?}/${RUN_NAME?}/HLO_dumps/
165172

166173
# 1 node, DATA_DP=1, ICI_FSDP=8
167174
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/gpu/models/llama2_7b.yml run_name=${RUN_NAME?} dcn_data_parallelism=1 \
168-
ici_fsdp_parallelism=8 base_output_directory=${OUTPUT_PATH?} attention=cudnn_flash_te scan_layers=False \
175+
ici_fsdp_parallelism=8 base_output_directory=${BASE_OUTPUT_DIRECTORY?} attention=cudnn_flash_te scan_layers=False \
169176
use_iota_embed=True hardware=gpu
170177
```

docs/run_maxtext/run_maxtext_via_pathways.md

Lines changed: 42 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -43,20 +43,41 @@ The following commands use placeholder variables. Before running them, set these
4343

4444
```bash
4545
# -- Google Cloud Configuration --
46-
export PROJECT="your-gcp-project-id"
47-
export ZONE="your-gcp-zone"
48-
export CLUSTER="your-gke-cluster-name"
46+
# Your GCP project ID. Find it on the [Cloud Console Dashboard](https://console.cloud.google.com/home/dashboard).
47+
export PROJECT_ID=<GCP project ID>
48+
49+
# The GCP location (listed as "Location" in the UI) and name of your
50+
# TPU-enabled GKE cluster. Both can be found on the
51+
# [Cloud Console](https://console.cloud.google.com/kubernetes/list).
52+
export ZONE=<GCP location> # e.g., 'us-central1'
53+
export GKE_CLUSTER=<cluster name>
4954

5055
# -- Workload Configuration --
51-
export WORKLOAD_NAME="maxtext-job-$(date +%Y%m%d-%H%M%S)"
56+
# An arbitrary string to identify this specific run.
57+
# Note: Kubernetes requires workload names to be valid DNS labels (lowercase, no underscores or periods).
58+
export RUN_NAME="maxtext-run-$(date +%Y%m%d-%H%M%S)"
59+
60+
# For a full list of MaxText-supported TPU types, see: `src/maxtext/utils/accelerator_to_spec_map.py`. To see the TPU type
61+
# of your cluster:
62+
63+
# 1. Connect to the cluster (required for kubectl commands later):
64+
# gcloud container clusters get-credentials ${GKE_CLUSTER?} --location ${ZONE?} --project ${PROJECT_ID?}
65+
66+
# 2. Find your TPU type (e.g., 'v5p-128') by checking the accelerator labels on your nodes:
67+
# kubectl get nodes -l cloud.google.com/gke-tpu-accelerator -o jsonpath='{.items[*].metadata.labels.cloud\.google\.com/gke-tpu-accelerator}' | tr ' ' '\n' | sort -u
5268
export TPU_TYPE="v5p-8" # Or your desired TPU type, e.g., v5e-4
53-
export WORKLOAD_NODEPOOL_COUNT=1 # Number of TPU slices for your job
69+
export NUM_SLICES=1 # Number of TPU slices for your job
5470

5571
# -- MaxText & Storage Configuration --
56-
export BUCKET_NAME="your-gcs-bucket-name"
57-
export RUN_NAME="maxtext-run-1"
72+
# Use a GCS bucket you own to store logs and checkpoints. Ideally in the same
73+
# region as your TPUs to minimize latency and costs.
74+
# You can list your buckets and their locations in the
75+
# [Cloud Console](https://console.cloud.google.com/storage/browser).
76+
export BASE_OUTPUT_DIRECTORY=<gcs bucket path> # e.g., gs://my-bucket/maxtext-runs
77+
5878
# The Docker image you pushed in the prerequisite step
59-
export DOCKER_IMAGE="gcr.io/${PROJECT?}/${CLOUD_IMAGE_NAME}"
79+
export CLOUD_IMAGE_NAME=<image name>
80+
export DOCKER_IMAGE="gcr.io/${PROJECT_ID?}/${CLOUD_IMAGE_NAME?}"
6081
```
6182

6283
## 3. Running a batch workload
@@ -69,15 +90,15 @@ Use the `xpk workload create-pathways` command to start the job.
6990

7091
```bash
7192
xpk workload create-pathways \
72-
--workload=${WORKLOAD_NAME?} \
73-
--cluster=${CLUSTER?} \
74-
--num-slices=${WORKLOAD_NODEPOOL_COUNT?} \
93+
--workload=${RUN_NAME?} \
94+
--cluster=${GKE_CLUSTER?} \
95+
--num-slices=${NUM_SLICES?} \
7596
--tpu-type=${TPU_TYPE?} \
76-
--project=${PROJECT?} \
97+
--project=${PROJECT_ID?} \
7798
--zone=${ZONE?} \
7899
--docker-image=${DOCKER_IMAGE?} \
79100
--command="python3 -m maxtext.trainers.pre_train.train \
80-
base_output_directory=gs://${BUCKET_NAME?} \
101+
base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
81102
per_device_batch_size=1 \
82103
enable_checkpointing=false \
83104
dataset_type=synthetic \
@@ -90,7 +111,7 @@ xpk workload create-pathways \
90111
You can check the status of your running workloads with the `xpk workload list` command.
91112

92113
```bash
93-
xpk workload list --cluster=${CLUSTER?} --project=${PROJECT?} --zone=${ZONE?}
114+
xpk workload list --cluster=${GKE_CLUSTER?} --project=${PROJECT_ID?} --zone=${ZONE?}
94115
```
95116

96117
## 4. Running a headless (interactive) workload
@@ -104,12 +125,12 @@ This command reserves the TPUs and starts the Pathways head service on the clust
104125
```bash
105126
xpk workload create-pathways \
106127
--headless \
107-
--workload=${WORKLOAD_NAME?} \
108-
--num-slices=${WORKLOAD_NODEPOOL_COUNT?} \
128+
--workload=${RUN_NAME?} \
129+
--num-slices=${NUM_SLICES?} \
109130
--tpu-type=${TPU_TYPE?} \
110-
--project=${PROJECT?} \
131+
--project=${PROJECT_ID?} \
111132
--zone=${ZONE?} \
112-
--cluster=${CLUSTER?}
133+
--cluster=${GKE_CLUSTER?}
113134
```
114135

115136
### Step 2: Connect to the cluster via port forwarding
@@ -120,7 +141,7 @@ This command forwards local port 29000 to the controller pod in the cluster. It
120141

121142
```bash
122143
kubectl port-forward \
123-
"$(kubectl get pods -o name | grep ${WORKLOAD_NAME?}-pathways-head)" \
144+
"$(kubectl get pods -o name | grep ${RUN_NAME?}-pathways-head)" \
124145
29000:29000 &> /dev/null &
125146
```
126147

@@ -135,7 +156,7 @@ export JAX_BACKEND_TARGET=grpc://127.0.0.1:29000
135156

136157
# Run the training script
137158
python3 -m maxtext.trainers.pre_train.train \
138-
base_output_directory=gs://${BUCKET_NAME?} \
159+
base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
139160
per_device_batch_size=1 \
140161
enable_checkpointing=false \
141162
dataset_type=synthetic \
@@ -153,7 +174,7 @@ The output streams directly to your terminal, just as if you were running on a l
153174
- Ensure you have successfully pushed the image to your project's Artifact Registry.
154175
- Check that your GKE cluster has permissions to pull from the registry.
155176
- **`kubectl port-forward` fails**:
156-
- Confirm that the pod from Step 1 is running (`kubectl get pods`). The name should match `${WORKLOAD_NAME?}-pathways-head-0`.
177+
- Confirm that the pod from Step 1 is running (`kubectl get pods`). The name should match `${RUN_NAME?}-pathways-head-0`.
157178
- Ensure you are authenticated with `kubectl` and have the correct context set for your GKE cluster.
158179
- Make sure you import `pathwaysutils` package and call `pathwaysutils.initialize()` in your script when running the workload.
159180

docs/run_maxtext/run_maxtext_via_xpk.md

Lines changed: 55 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -115,51 +115,76 @@ This guide focuses on submitting workloads to an existing cluster. Cluster creat
115115

116116
1. **Set your configuration**
117117

118-
```
119-
export PROJECT_ID="your-gcp-project-id"
120-
export ZONE="your-gcp-zone" # e.g., us-central1-a
121-
export CLUSTER_NAME="your-existing-cluster-name"
122-
export BASE_OUTPUT_DIR="gs://your-output-bucket/"
118+
Set up the following environment variables to configure your training run. Replace
119+
placeholders with your actual values.
120+
121+
```bash
122+
# -- Google Cloud Configuration --
123+
# Your GCP project ID. Find it on the [Cloud Console Dashboard](https://console.cloud.google.com/home/dashboard).
124+
# If you've already set it in your local config, you can retrieve it via:
125+
# gcloud config get-value project
126+
export PROJECT_ID=<GCP project ID>
127+
128+
# The GCP location (listed as "Location" in the UI) and name of your
129+
# TPU-enabled (or GPU-enabled) GKE cluster. Both can be found on the
130+
# [Cloud Console](https://console.cloud.google.com/kubernetes/list).
131+
export ZONE=<GCP location> # e.g., 'us-central1' or 'us-central1-a'
132+
export GKE_CLUSTER=<cluster name>
133+
134+
# -- Workload Configuration --
135+
# An arbitrary string to identify this specific run.
136+
# Note: Kubernetes requires workload names to be valid DNS labels (lowercase, no underscores or periods).
137+
export RUN_NAME="maxtext-run-$(date +%Y%m%d-%H%M%S)"
138+
139+
# Number of TPU slices (for TPU clusters) or number of nodes (for GPU clusters)
140+
export NUM_SLICES=1
141+
142+
# -- MaxText & Storage Configuration --
143+
# Use a GCS bucket you own to store logs and checkpoints. Ideally in the same
144+
# region as your TPUs to minimize latency and costs.
145+
# You can list your buckets and their locations in the
146+
# [Cloud Console](https://console.cloud.google.com/storage/browser).
147+
export BASE_OUTPUT_DIRECTORY=<gcs bucket path> # e.g., gs://my-bucket/maxtext-runs
123148
export DATASET_PATH="gs://your-dataset-bucket/"
124149
```
125150

126151
2. **Configure gcloud CLI**
127152

128-
```
153+
```bash
129154
gcloud config set project ${PROJECT_ID?}
130155
gcloud config set compute/zone ${ZONE?}
131156
```
132157

133158
### A Note on multi-slice and multi-node runs
134159

135-
The examples below run on a single TPU slice (`--num-slices=1`) or a small number of GPU nodes (`--num-nodes=2`). To scale your job to a larger, multi-host configuration, you simply increase these values.
160+
The examples below run on a single TPU slice (`--num-slices=1`) or a small number of GPU nodes (`--num-nodes=2`). To scale your job to a larger, multi-host configuration, you simply increase the `NUM_SLICES` value.
136161

137-
For instance, to run a job across **four TPU slices**, you would change `--num-slices=1` to `--num-slices=4`. This tells XPK to allocate four `v5litepod-256` slices and orchestrate the training job across all of them as a single workload. Similarly, for GPUs, you would increase the `--num-nodes` value.
162+
For instance, to run a job across **four TPU slices**, you would change `export NUM_SLICES=1` to `export NUM_SLICES=4`. This tells XPK to allocate four `v5litepod-256` slices and orchestrate the training job across all of them as a single workload. Similarly, for GPUs, you would increase the value.
138163

139164
3. **Create the workload (run the job)**
140165

141166
- **On your TPU cluster:**
142167

143-
```
144-
xpk workload create\
145-
--cluster ${CLUSTER_NAME?}\
146-
--workload ${USER}-tpu-job\
147-
--base-docker-image maxtext_base_image\
148-
--tpu-type v5litepod-256\
149-
--num-slices 1\
150-
--command "python3 -m maxtext.trainers.pre_train.train run_name=${USER}-tpu-job base_output_directory=${BASE_OUTPUT_DIR?} dataset_path=${DATASET_PATH?} steps=100"
168+
```bash
169+
xpk workload create \
170+
--cluster ${GKE_CLUSTER?} \
171+
--workload ${RUN_NAME?} \
172+
--base-docker-image maxtext_base_image \
173+
--tpu-type v5litepod-256 \
174+
--num-slices ${NUM_SLICES?} \
175+
--command "python3 -m maxtext.trainers.pre_train.train run_name=${RUN_NAME?} base_output_directory=${BASE_OUTPUT_DIRECTORY?} dataset_path=${DATASET_PATH?} steps=100"
151176
```
152177

153178
- **On your GPU cluster:**
154179

155-
```
156-
xpk workload create\
157-
--cluster ${CLUSTER_NAME?}\
158-
--workload ${USER}-gpu-job\
159-
--base-docker-image maxtext_base_image\
160-
--device-type h100-80gb-8\
161-
--num-nodes 2\
162-
--command "python3 -m maxtext.trainers.pre_train.train run_name=${USER}-gpu-job base_output_directory=${BASE_OUTPUT_DIR?} dataset_path=${DATASET_PATH?} steps=100"
180+
```bash
181+
xpk workload create \
182+
--cluster ${GKE_CLUSTER?} \
183+
--workload ${RUN_NAME?} \
184+
--base-docker-image maxtext_base_image \
185+
--device-type h100-80gb-8 \
186+
--num-nodes ${NUM_SLICES?} \
187+
--command "python3 -m maxtext.trainers.pre_train.train run_name=${RUN_NAME?} base_output_directory=${BASE_OUTPUT_DIRECTORY?} dataset_path=${DATASET_PATH?} steps=100"
163188
```
164189

165190
______________________________________________________________________
@@ -172,20 +197,20 @@ ______________________________________________________________________
172197

173198
2. Go to **Workloads**.
174199

175-
3. Find your workload (e.g., `${USER}-tpu-job`) and click on it.
200+
3. Find your workload (e.g., `${RUN_NAME?}`) and click on it.
176201

177202
4. Select the **Logs** tab to view the container logs.
178203

179204
- **List your jobs:**
180205

181-
```
182-
xpk workload list --cluster ${CLUSTER_NAME?}
206+
```bash
207+
xpk workload list --cluster ${GKE_CLUSTER?}
183208
```
184209

185-
- **Analyze output:** Checkpoints and other artifacts will be saved to the Google Cloud Storage bucket you specified in `BASE_OUTPUT_DIR`.
210+
- **Analyze output:** Checkpoints and other artifacts will be saved to the Google Cloud Storage bucket you specified in `BASE_OUTPUT_DIRECTORY`.
186211

187212
- **Delete a job:**
188213

189-
```
190-
xpk workload delete --cluster ${CLUSTER_NAME?} --workload <your-workload-name>
214+
```bash
215+
xpk workload delete --cluster ${GKE_CLUSTER?} --workload ${RUN_NAME?}
191216
```

docs/tutorials/posttraining/full_finetuning.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -34,13 +34,26 @@ Login to Hugging Face. Provide your access token when prompted:
3434
hf auth login
3535
```
3636

37-
```sh
37+
Set up the following environment variables to configure your training run. Replace
38+
placeholders with your actual values.
39+
40+
```bash
3841
# -- Model configuration --
39-
export MODEL=<model name> # e.g., 'llama3.1-8b-Instruct'
42+
# The MaxText model name. See `src/maxtext/configs/types.py` for `ModelName` for a
43+
# full list of supported models.
44+
export MODEL=<MaxText Model> # e.g., 'llama3.1-8b-Instruct'
4045

4146
# -- MaxText configuration --
42-
export BASE_OUTPUT_DIRECTORY=<output directory to store run logs> # e.g., gs://my-bucket/my-output-directory
43-
export RUN_NAME=<name for this run> # e.g., $(date +%Y-%m-%d-%H-%M-%S)
47+
# Use a GCS bucket you own to store logs and checkpoints. Ideally in the same
48+
# region as your TPUs to minimize latency and costs.
49+
# You can list your buckets and their locations in the
50+
# [Cloud Console](https://console.cloud.google.com/storage/browser).
51+
export BASE_OUTPUT_DIRECTORY=<gcs bucket path> # e.g., gs://my-bucket/maxtext-runs
52+
53+
# An arbitrary string to identify this specific run.
54+
# We recommend to include the model, user, and timestamp.
55+
# Note: Kubernetes requires workload names to be valid DNS labels (lowercase, no underscores or periods).
56+
export RUN_NAME=<Name for this run>
4457
```
4558

4659
## Hugging Face checkpoint to Maxtext checkpoint
@@ -77,10 +90,10 @@ Run these steps once per project prior to any local development or cluster exper
7790
MaxText assumes these GCS buckets are created in the same project and that it has permissions to read and write from them.
7891

7992
```sh
80-
export PROJECT=<Google Cloud Project ID>
93+
export PROJECT_ID=<Google Cloud Project ID>
8194
export DATASET_GCS_BUCKET=<GCS for dataset> # e.g., gs://my-bucket/my-dataset
8295

83-
bash tools/data_generation/download_dataset.sh ${PROJECT?} ${DATASET_GCS_BUCKET?}
96+
bash tools/data_generation/download_dataset.sh ${PROJECT_ID?} ${DATASET_GCS_BUCKET?}
8497
```
8598

8699
The above will download the c4 dataset to the GCS BUCKET.

0 commit comments

Comments
 (0)