Skip to content

Commit 61dc465

Browse files
Merge pull request #3504 from AI-Hypercomputer:link_checker
PiperOrigin-RevId: 890570501
2 parents 95fe19e + bd84236 commit 61dc465

19 files changed

Lines changed: 90 additions & 30 deletions

.github/workflows/check_docs_build.yml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,24 +16,47 @@ jobs:
1616
uses: actions/checkout@v5
1717
with:
1818
persist-credentials: false
19+
fetch-depth: 0
20+
21+
- name: Check if only documentation changed
22+
id: check
23+
run: |
24+
git fetch origin ${GITHUB_BASE_REF}
25+
CHANGED_FILES=$(git diff --name-only origin/${GITHUB_BASE_REF}...HEAD)
26+
27+
# Check for documentation changes
28+
if echo "$CHANGED_FILES" | grep -E '\.(md)$|^docs/' > /dev/null; then
29+
echo "Documentation files changed, enabling docs build."
30+
echo "build_docs=true" >> $GITHUB_OUTPUT
31+
else
32+
echo "No documentation changes, skipping docs build."
33+
echo "build_docs=false" >> $GITHUB_OUTPUT
34+
fi
1935
2036
- name: Install uv and set the Python version
37+
if: steps.check.outputs.build_docs == 'true'
2138
uses: astral-sh/setup-uv@eb1897b8dc4b5d5bfe39a428a8f2304605e0983c # v7.0.0
2239
with:
2340
python-version: '3.12'
2441
enable-cache: true
2542

2643
- name: Set venv
44+
if: steps.check.outputs.build_docs == 'true'
2745
run: uv venv --python 3.12 $GITHUB_WORKSPACE/venv
2846

2947
- name: Install dependencies
48+
if: steps.check.outputs.build_docs == 'true'
3049
run: . $GITHUB_WORKSPACE/venv/bin/activate && uv pip install -r src/dependencies/requirements/requirements_docs.txt
3150

3251
- name: Build documentation
52+
if: steps.check.outputs.build_docs == 'true'
3353
run: |
3454
. $GITHUB_WORKSPACE/venv/bin/activate
3555
uv pip install -e . --no-deps
3656
uv pip install torch
57+
# verify links; the build fails if errors are found
58+
sphinx-build -b linkcheck docs docs/_build/linkcheck -q --keep-going
59+
# generates the actual website
3760
sphinx-build -b html docs docs/_build/html
3861
env:
3962
JAX_PLATFORMS: cpu

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,5 +29,5 @@ This project follows
2929

3030
All submissions, including submissions by project members, require review. We
3131
use GitHub pull requests for this purpose. Consult
32-
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
32+
[GitHub Help](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) for more
3333
information on using pull requests.

docs/conf.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,43 @@
124124
os.path.join("reference", "api_generated", "dependencies.github_deps.install_pre_train_deps.rst"),
125125
]
126126

127+
# -- Options for linkcheck ----------------------------------------------------
128+
# Only report broken links (status 'broken') in the output
129+
linkcheck_report_timeouts_as_broken = True
130+
131+
# Enable anchor checking so Sphinx looks for the #section-name
132+
linkcheck_anchors = True
133+
134+
# Ignore the dynamic anchors that are generated in the documentation which can cause false positives in link checking
135+
linkcheck_anchors_ignore = [
136+
r"^L\d+",
137+
r"badput-breakdown-details",
138+
r"online-inference",
139+
r"install-the-seed-env-tool",
140+
r"collect-stack-traces",
141+
r"1-prerequisites",
142+
]
143+
144+
# Disable reporting of allowed redirects to reduce noise in the output
145+
linkcheck_allowed_redirects = {
146+
r"https://github\.com/google/maxtext": r"https://github\.com/AI-Hypercomputer/maxtext/.*",
147+
r"https://cloud\.google\.com/.*": r"https://docs\.cloud\.google\.com/.*",
148+
r"https://jax\.readthedocs\.io/.*": r"https://docs\.jax\.dev/.*",
149+
r"https://twitter\.com/.*": r"https://x\.com/.*",
150+
r"https://www\.sphinx-doc\.org": r"https://www\.sphinx-doc\.org/en/master/.*",
151+
r"https://.*\.readthedocs\.io": r"https://.*\.readthedocs\.io/en/.*",
152+
}
153+
154+
# Ignore specific links that are known to be inaccessible during the build process
155+
linkcheck_ignore = [
156+
# Ignore Google Auth/Console redirects which require login
157+
r"https://accounts\.google\.com/.*",
158+
r"https://console\.cloud\.google\.com/.*",
159+
r"https://cla\.developers\.google\.com/.*",
160+
# Ignore GitHub commit history links which frequently trigger rate limiting (429)
161+
r"https://github\.com/jax-ml/jax/commits/.*",
162+
]
163+
127164

128165
# -- Autogenerate API documentation ------------------------------------------
129166
def run_apidoc(_):

docs/guides/data_input_pipeline/data_input_grain.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
3232

3333
## Using Grain
3434

35-
1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources.html) class.
35+
1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
3636
- **Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
3737
2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh). The script configures some parameters for the mount.
3838

@@ -43,11 +43,11 @@ MOUNT_PATH=${MOUNT_PATH?} \
4343
[FILE_PATH=${MOUNT_PATH?}/my_dataset]
4444
```
4545

46-
Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pre-filling the metadata cache (see ["Performance tuning best practices" on the Google Cloud documentation](https://cloud.google.com/storage/docs/cloud-storage-fuse/performance#improve-first-time-reads)).
46+
Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pre-filling the metadata cache (see ["Performance tuning best practices" on the Google Cloud documentation](https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/performance)).
4747

4848
1. Set `dataset_type=grain`, `grain_file_type={arrayrecord|parquet|tfrecord}`, `grain_train_files` in `src/maxtext/configs/base.yml` or through command line arguments to match the file pattern on the mounted local path.
4949

50-
2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html), [grain_pool.py](https://github.com/google/grain/blob/main/grain/_src/python/grain_pool.py)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh) to avoid gcsfuse throttling.
50+
2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh) to avoid gcsfuse throttling.
5151

5252
3. ArrayRecord Only: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:
5353

docs/guides/model_bringup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ For models with existing Hugging Face support, you can validate parity using the
9393

9494
### 5.2 Eval Benchmark
9595

96-
MaxText integrates with benchmark libraries like [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [evalchemy](https://github.com/mlfoundations/evalchemy) to facilitate rapid verification of common inference scores ([guide](../../benchmarks/api_server)). This is particularly useful for validating decoding outputs or assessing model performance when logits deviate slightly from reference values.
96+
MaxText integrates with benchmark libraries like [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [evalchemy](https://github.com/mlfoundations/evalchemy) to facilitate rapid verification of common inference scores ([guide](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/api_server/README.md)). This is particularly useful for validating decoding outputs or assessing model performance when logits deviate slightly from reference values.
9797

9898
## 6. Completion Checklist
9999

docs/guides/monitoring_and_debugging/megascale_hang_playbook.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Much of this guide is geared towards providing Google with the right data to hel
1414

1515
1. Use `JAX` 0.6 or up, and enable JAX distributed service. This version of JAX contains additional logging that can help identify which workers are experiencing issues.
1616
2. Generate an HLO dump using the `--xla_dump_to` flag when initializing your workload. This is discussed in the [XLA Documentation](https://openxla.org/xla/hlo_dumps).
17-
3. Run your workload with stack traces enabled. XPK users should follow the [XPK-specific instructions](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#collect-stack-traces). Note the `--deploy-stacktrace-sidecar` flag when running the XPK workload command.
17+
3. Run your workload with stack traces enabled. XPK users should follow the [XPK-specific instructions](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/troubleshooting.md#collect-stack-traces). Note the `--deploy-stacktrace-sidecar` flag when running the XPK workload command.
1818
4. Set `--vmodule=real_program_continuator=1` to enable verbose logging for the TPU program execution status.
1919

2020
## Locate the Megascale Hang Detected Error
@@ -70,7 +70,7 @@ If the TPU listed in the log shows a non-zero program counter, it is very likely
7070

7171
If the logged TPU shows a program counter of 0, it is likely that the TPU is waiting on input. We can attempt to confirm the worker is hung during the input pipeline using the stack trace library found in the [cloud-tpu-diagnostics package](https://pypi.org/project/cloud-tpu-diagnostics/).
7272

73-
XPK users should follow the [XPK-specific instructions](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#collect-stack-traces) to emit stack traces. Note the `--deploy-stacktrace-sidecar` flag when running the XPK workload command.
73+
XPK users should follow the [XPK-specific instructions](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/troubleshooting.md#collect-stack-traces) to emit stack traces. Note the `--deploy-stacktrace-sidecar` flag when running the XPK workload command.
7474

7575
Customers can then query Cloud Logging for the stack trace logs from the outlier TPU. The stack trace log will help users determine where in the Python code the program was during the hang.
7676

docs/guides/monitoring_and_debugging/monitor_goodput.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Goodput is the metric that measures the efficiency of model training jobs, i.e.
3030

3131
Badput is the metric that measures time that a workload spent on anything that is not productive training proportional to the total time spent by the workload. For example, the time spent in accelerator initialization, training preparation, program startup, data loading, portions of checkpointing, disruptions and wasted progress since the last checkpoint etc. all contribute to Badput.
3232

33-
The ML Goodput Measurement library exposes Badput Breakdown. Further details of each bucket can be found [here](https://github.com/AI-Hypercomputer/ml-goodput-measurement?tab=readme-ov-file#badput-breakdown-details)
33+
The ML Goodput Measurement library exposes Badput Breakdown. Further details of each bucket can be found [here](https://github.com/AI-Hypercomputer/ml-goodput-measurement/blob/main/README.md#badput-breakdown-details)
3434

3535
## What is Step Time Deviation
3636

@@ -69,8 +69,8 @@ following access scope during node pool creation:
6969
XPK adds this access scope to the GPU, TPU and CPU node pools, so XPK is the recommended method to create clusters and node-pools in you intend to run your workloads on GKE.
7070

7171
Instructions on how to create clusters using XPK can be
72-
found [here](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#cluster-create) and how to create workloads using XPK can be found
73-
[here](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#workload-create).
72+
found [here](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/usage/clusters.md) and how to create workloads using XPK can be found
73+
[here](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/usage/workloads.md).
7474

7575
```{note}
7676
Access Scopes are immutable and workloads can only be migrated to new node pools with required access scopes. Access scopes on already created clusters cannot be updated.
@@ -131,7 +131,7 @@ If checkpointing is enabled, please enable the `enable_checkpoint_cloud_logger`
131131

132132
#### Visualize Goodput, Badput and step deviation on Google Cloud Monitoring
133133

134-
By default, performance data ([goodput](https://cloud.google.com/monitoring/api/metrics_gcp#:~:text=workload/goodput_time), [badput](https://cloud.google.com/monitoring/api/metrics_gcp#:~:text=workload/badput_time), and [step deviation](https://cloud.google.com/monitoring/api/metrics_gcp#:~:text=workload/performance)) is automatically sent to Google Cloud Monitoring, enabling visualization on dashboards.
134+
By default, performance data ([goodput](https://docs.cloud.google.com/monitoring/api/metrics_gcp_c), [badput](https://docs.cloud.google.com/monitoring/api/metrics_gcp_c), and [step deviation](https://docs.cloud.google.com/monitoring/api/metrics_gcp_c)) is automatically sent to Google Cloud Monitoring, enabling visualization on dashboards.
135135

136136
This feature leverages Google VM metadata (project ID, location, accelerator type)
137137
and supports replica IDs for uniquely identifying workloads in multi-replica
@@ -184,13 +184,13 @@ Goodput, Badput and Step Time Deviation metrics can be monitored using GCM Metri
184184

185185
2. Navigate to [Metrics Explorer](https://console.cloud.google.com/monitoring/metrics-explorer). Initiate metric selection by clicking `Select a metric` then search for and select the `Workload` resource. Subsequently, choose the `Workload` metric category.
186186

187-
a. [**Productive Time:**](https://cloud.google.com/monitoring/api/metrics_gcp#:~:text=workload/goodput_time)
187+
a. [**Productive Time:**](https://docs.cloud.google.com/monitoring/api/metrics_gcp_c)
188188
Represents the cumulative duration the workload spent on productive tasks,
189189
measured by `compute.googleapis.com/workload/goodput_time`.\
190-
b. [**Non-Productive Time:**](https://cloud.google.com/monitoring/api/metrics_gcp#:~:text=workload/badput_time)
190+
b. [**Non-Productive Time:**](https://docs.cloud.google.com/monitoring/api/metrics_gcp_c)
191191
Represents the cumulative duration the workload spent on non-productive tasks,
192192
measured by `compute.googleapis.com/workload/badput_time`.\
193-
c. [**Performance:**](https://cloud.google.com/monitoring/api/metrics_gcp#:~:text=workload/performance)
193+
c. [**Performance:**](https://docs.cloud.google.com/monitoring/api/metrics_gcp_c)
194194
Represents the workload's performance metric, specifically step deviation
195195
in this context, measured by `compute.googleapis.com/workload/performance`.
196196

docs/guides/monitoring_and_debugging/use_vertex_ai_tensorboard.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ You can use a single Vertex AI Tensorboard instance to track and compare metrics
2828

2929
## Prerequisites
3030

31-
- Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console.
31+
- Enable [Vertex AI API](https://docs.cloud.google.com/vertex-ai/docs/start/cloud-environment#set_up_a_project) in your Google Cloud console.
3232
- Assign [Vertex AI User IAM role](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.user) to the service account used by the TPU VMs. This is required to create and access the Vertex AI Tensorboard in Google Cloud console. If you are using XPK for MaxText, the necessary Vertex AI User IAM role will be automatically assigned to your node pools by XPK – no need to assign it manually.
3333

3434
## Upload logs to Vertex AI Tensorboard

docs/guides/optimization/custom_model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ Ironwood over ICI:
254254
- `3 * M * 8 / 2 > 12800`
255255
- `M > 1100`
256256

257-
It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/explanations/sharding.md#pp--fsdpdp) for specific challenges regarding PP + FSDP/DP.
257+
It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html) for specific challenges regarding PP + FSDP/DP.
258258

259259
## Step 4. Analyze experiments
260260

docs/guides/optimization/pallas_kernels_performance.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,6 @@ To maximize performance, MaxText uses custom Pallas kernels for memory-bandwidth
6969

7070
> This is an efficient computation method for Mixture-of-Experts (MoE) models like DeepSeek, Llama 4, Mixtral and Qwen-MoE. In MoE, each token is processed by only a few "experts," which is inefficient for standard matrix multiplication. Megablox solves this by having the CPU (**host**) first create a routing plan (**metadata**) that assigns tokens to experts. The accelerator (**device**) then uses this plan to perform many small, dense matrix multiplications in parallel (**Grouped Matrix Multiplication**), avoiding wasted work on unused experts.
7171
72-
- [`src/maxtext/kernels/megablox/gmm.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/kernels/megablox/gmm.py)
73-
7472
**Note:** Megablox accelerates the grouped **matmul**; **routing/gating** is separate code ([`src/maxtext/layers/moe.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/moe.py)).
7573

7674
## 🔧 The Pallas optimization workflow: code → profile → tune → repeat

0 commit comments

Comments
 (0)