Merge pull request #3545 from AI-Hypercomputer:bvandermoon-uxr

Google-ML-Automation · Google-ML-Automation · commit e28978eb2367 · 2026-04-02T14:11:35.000-07:00
PiperOrigin-RevId: 893695294
diff --git a/docs/_static/js/editable_commands.js b/docs/_static/js/editable_commands.js
@@ -10,43 +10,65 @@ document.addEventListener('DOMContentLoaded', () => {
     const originalHTML = block.innerHTML;
 
     const placeholders = [
-      "<your virtual env name>",
-      "<model name>",
-      "<tokenizer path>",
-      "<Hugging Face access token>",
-      "<output directory to store run logs>",
-      "<name for this run>",
-      "<number of fine-tuning steps to run>",
       "<batch size per device>",
-      "<Hugging Face dataset name>",
-      "<data split for train>",
+      "<bucket>",
+      "<cluster name>",
       "<data columns to train on>",
+      "<Data Columns to Train on>",
+      "<data split for train>",
+      "<Data Split for Train>",
+      "<dataset path>",
+      "<Docker Image Name>",
+      "<Fine-Tuning Steps>",
+      "<Flag to lazy load>",
+      "<Flag to use ocdbt>",
+      "<Flag to use zarr3>",
+      "<folder>",
       "<gcs path for MaxText checkpoint>",
-      "<Google Cloud Project ID>",
-      "<Name of GKE Cluster>",
-      "<GKE Cluster Zone>",
-      "<Name of Workload>",
-      "<TPU Type>",
       "<GCS Path for Output/Logs>",
-      "<Fine-Tuning Steps>",
+      "<GCS for dataset>",
+      "<GCP project ID>",
+      "<GCP zone>",
+      "<gke version>",
+      "<GKE Cluster Zone>",
+      "<Google Cloud Project ID>",
       "<Hugging Face Access Token>",
-      "<Model Name>",
-      "<Model Tokenizer>",
+      "<Hugging Face access token>",
       "<Hugging Face Dataset Name>",
-      "<Data Split for Train>",
-      "<Data Columns to Train on>",
-      "<cluster name>",
-      "<GCP project ID>",
-      "<zone name>",
-      "<path/to/gcr.io>",
-      "<number of slices>",
-      "<Flag to use zarr3>",
-      "<Flag to use ocdbt>",
+      "<Hugging Face dataset name>",
       "<Hugging Face Model>",
+      "<Hugging Face Model to be converted to MaxText>",
       "<MaxText Model>",
-      "<Tokenizer>",
+      "<MaxText model name>",
+      "<Model Name>",
+      "<model name>",
+      "<Model Tokenizer>",
+      "<name for this run>",
       "<Name for this run>",
-      "<Docker Image Name>"
+      "<Name of GKE Cluster>",
+      "<Name of Workload>",
+      "<number of fine-tuning steps to run>",
+      "<number of slices>",
+      "<output directory to store Hugging Face checkpoint>",
+      "<output directory to store MaxText checkpoint>",
+      "<output directory to store run logs>",
+      "<path to Hugging Face checkpoint>",
+      "<path/to/gcr.io>",
+      "<project id>",
+      "<project ID>",
+      "<project>",
+      "<ramdisk size>",
+      "<steps>",
+      "<the number of chips per VM>",
+      "<Tokenizer>",
+      "<tokenizer path>",
+      "<TPU Type>",
+      "<virtual env name>",
+      "<your virtual env name>",
+      "<your zone>",
+      "<YOUR WORKLOAD NAME>",
+      "<zone>",
+      "<zone name>"
     ];
 
     let newHTML = originalHTML;
diff --git a/docs/guides/data_input_pipeline/data_input_grain.md b/docs/guides/data_input_pipeline/data_input_grain.md
@@ -34,7 +34,7 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
 
 1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
    - **Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
-2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh). The script configures some parameters for the mount.
+2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/4e44e065cc6379e76f9f1ac4785f81c05cafb58f/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount.
 
 ```sh
 bash tools/setup/setup_gcsfuse.sh \
@@ -47,7 +47,7 @@ Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pr
 
 1. Set `dataset_type=grain`, `grain_file_type={arrayrecord|parquet|tfrecord}`, `grain_train_files` in `src/maxtext/configs/base.yml` or through command line arguments to match the file pattern on the mounted local path.
 
-2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh) to avoid gcsfuse throttling.
+2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/4e44e065cc6379e76f9f1ac4785f81c05cafb58f/src/dependencies/scripts/setup_gcsfuse.sh) to avoid gcsfuse throttling.
 
 3. ArrayRecord Only: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:
 
diff --git a/docs/tutorials/pretraining.md b/docs/tutorials/pretraining.md
@@ -87,7 +87,7 @@ eval metrics after step: 9, loss=9.420, total_weights=75264.0
 
 Grain is a library for reading data for training and evaluating JAX models. It is the recommended input pipeline for determinism and resilience! It supports data formats like ArrayRecord and Parquet. You can check [Grain pipeline](../guides/data_input_pipeline/data_input_grain.md) for more details.
 
-**Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/tools/setup/setup_gcsfuse.sh).
+**Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/4e44e065cc6379e76f9f1ac4785f81c05cafb58f/src/dependencies/scripts/setup_gcsfuse.sh).
 
 - For example, we can mount the bucket `gs://maxtext-dataset` on the local path `/tmp/gcsfuse` before training
   ```bash