Skip to content

Commit 8ad33bb

Browse files
Merge branch 'AI-Hypercomputer:main' into ai-gsutil-migration-4779320b60d44c658f872e9b583efb19
2 parents 7065d89 + c4b5e64 commit 8ad33bb

119 files changed

Lines changed: 2054 additions & 1036 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/build_and_test_maxtext.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ jobs:
262262
tf_force_gpu_allow_growth: false
263263
container_resource_option: "--privileged"
264264
is_scheduled_run: ${{ github.event_name == 'schedule' }}
265-
extra_pip_deps_file: 'src/install_maxtext_extra_deps/extra_post_train_base_deps_from_github.txt'
265+
extra_pip_deps_file: 'src/dependencies/github_deps/post_train_base_deps.txt'
266266
maxtext_sha: ${{ needs.build_and_upload_maxtext_package.outputs.maxtext_sha }}
267267

268268
maxtext_post_training_tpu_unit_tests:
@@ -284,7 +284,7 @@ jobs:
284284
tf_force_gpu_allow_growth: false
285285
container_resource_option: "--privileged"
286286
is_scheduled_run: ${{ github.event_name == 'schedule' }}
287-
extra_pip_deps_file: 'src/install_maxtext_extra_deps/extra_post_train_base_deps_from_github.txt'
287+
extra_pip_deps_file: 'src/dependencies/github_deps/post_train_base_deps.txt'
288288
maxtext_sha: ${{ needs.build_and_upload_maxtext_package.outputs.maxtext_sha }}
289289

290290
maxtext_gpu_integration_tests:

.github/workflows/run_pathways_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ jobs:
8585
source .venv/bin/activate
8686
maxtext_wheel=$(ls maxtext-*-py3-none-any.whl 2>/dev/null)
8787
uv pip install ${maxtext_wheel}[tpu] --resolution=lowest
88-
uv pip install -r src/install_maxtext_extra_deps/extra_deps_from_github.txt
88+
uv pip install -r src/dependencies/github_deps/pre_train_deps.txt
8989
python3 --version
9090
python3 -m pip freeze
9191
- name: Copy test assets files

.github/workflows/run_tests_against_package.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ jobs:
9696
source .venv/bin/activate
9797
maxtext_wheel=$(ls maxtext-*-py3-none-any.whl 2>/dev/null)
9898
uv pip install ${maxtext_wheel}[${MAXTEXT_PACKAGE_EXTRA}] --resolution=lowest
99-
uv pip install -r src/install_maxtext_extra_deps/extra_deps_from_github.txt
99+
uv pip install -r src/dependencies/github_deps/pre_train_deps.txt
100100
python3 --version
101101
python3 -m pip freeze
102102
uv pip install pytest-cov

PREFLIGHT.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,35 @@
11
# Optimization 1: Multihost recommended network settings
2-
We included all the recommended network settings in [rto_setup.sh](https://github.com/google/maxtext/blob/main/rto_setup.sh).
2+
We included all the recommended network settings in [rto_setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/rto_setup.sh).
33

4-
[preflight.sh](https://github.com/google/maxtext/blob/main/preflight.sh) will help you apply them based on GCE or GKE platform.
4+
[preflight.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/preflight.sh) will help you apply them based on GCE or GKE platform.
55

66
Before you run ML workload on Multihost with GCE or GKE, simply apply `bash preflight.sh PLATFORM=[GCE or GKE]` to leverage the best DCN network performance.
77

88
Here is an example for GCE:
99
```
10-
bash preflight.sh PLATFORM=GCE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
10+
bash src/dependencies/scripts/preflight.sh PLATFORM=GCE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
1111
```
1212

1313
Here is an example for GKE:
1414
```
15-
bash preflight.sh PLATFORM=GKE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
15+
bash src/dependencies/scripts/preflight.sh PLATFORM=GKE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
1616
```
1717

1818
# Optimization 2: Numa binding (You can only apply this to v4 and v5p)
1919
NUMA binding is recommended for enhanced performance, as it reduces memory latency and maximizes data throughput, ensuring that your high-performance applications operate more efficiently and effectively.
2020

2121
For GCE,
22-
[preflight.sh](https://github.com/google/maxtext/blob/main/preflight.sh) will help you install `numactl` dependency, so you can use it directly, here is an example:
22+
[preflight.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/preflight.sh) will help you install `numactl` dependency, so you can use it directly, here is an example:
2323

2424
```
25-
bash preflight.sh PLATFORM=GCE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
25+
bash src/dependencies/scripts/preflight.sh PLATFORM=GCE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
2626
```
2727

2828
For GKE,
2929
`numactl` should be built into your docker image from [maxtext_tpu_dependencies.Dockerfile](https://github.com/google/maxtext/blob/main/src/dependencies/dockerfiles/maxtext_tpu_dependencies.Dockerfile), so you can use it directly if you built the maxtext docker image. Here is an example
3030

3131
```
32-
bash preflight.sh PLATFORM=GKE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
32+
bash src/dependencies/scripts/preflight.sh PLATFORM=GKE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
3333
```
3434

3535
1. `numactl`: This is the command-line tool used for controlling NUMA policy for processes or shared memory. It's particularly useful on multi-socket systems where memory locality can impact performance.

docs/build_maxtext.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
<!--
2+
Copyright 2023-2026 Google LLC
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
https://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
# Build and Upload MaxText Docker Images
18+
19+
This guide covers setting up a MaxText development environment and building container images for TPU and GPU workloads. These images can be used to run MaxText on GKE clusters with TPUs or GPUs, and are also required for running MaxText through XPK.
20+
21+
## Prerequisites
22+
23+
Before starting, ensure you have the following tools installed and configured:
24+
25+
1. Environment Prep: Install and configure all [XPK prerequisites](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md#1-prerequisites).
26+
27+
2. Docker Permissions: Follow the steps to [configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/) to run Docker without `sudo`.
28+
29+
3. Artifact Registry Access: Authenticate with [Google Artifact Registry](https://docs.cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) for permission to push your images and other access.
30+
31+
4. Authentication & Access: Run the following commands to authenticate your account and configure Docker:
32+
33+
```bash
34+
# Authenticate your user account for gcloud CLI access
35+
gcloud auth login
36+
37+
# Configure application default credentials for Docker and other tools
38+
gcloud auth application-default login
39+
40+
# Configure Docker credentials and test your access
41+
gcloud auth configure-docker
42+
docker run hello-world
43+
```
44+
45+
## Installation Modes
46+
47+
We recommend building MaxText inside a Python virtual environment using `uv` for speed and dependency management.
48+
49+
### Option 1: From PyPI (Recommended)
50+
51+
This is the easiest way to get started with the latest stable version.
52+
53+
```bash
54+
# Install uv, a fast Python package installer
55+
pip install uv
56+
57+
# Create virtual environment
58+
export VENV_NAME=<your virtual env name> # e.g., docker_venv
59+
uv venv --python 3.12 --seed ${VENV_NAME?}
60+
source ${VENV_NAME?}/bin/activate
61+
62+
# Install MaxText with the [runner] extra
63+
# This enables Docker image building and workload scheduling via XPK
64+
uv pip install maxtext[runner] --resolution=lowest
65+
```
66+
67+
> **Note:** The `maxtext[runner]` extra includes all necessary dependencies for building MaxText Docker images and running workloads through XPK. It automatically installs XPK, so you do not need to install it separately to manage your clusters and workloads.
68+
69+
### Option 2: From Source
70+
71+
If you plan to contribute to MaxText or need the latest unreleased features, install from source.
72+
73+
```bash
74+
# Clone the repository
75+
git clone https://github.com/AI-Hypercomputer/maxtext.git
76+
cd maxtext
77+
78+
# Create virtual environment
79+
export VENV_NAME=<your virtual env name> # e.g., docker_venv
80+
uv venv --python 3.12 --seed ${VENV_NAME?}
81+
source ${VENV_NAME?}/bin/activate
82+
83+
# Install MaxText with the [runner] extra in editable mode
84+
uv pip install .[runner] --resolution=lowest
85+
```
86+
87+
> **Note:** The `maxtext[runner]` extra includes all necessary dependencies for building MaxText Docker images and running workloads through XPK. It automatically installs XPK, so you do not need to install it separately to manage your clusters and workloads.
88+
89+
## Build MaxText Docker Image
90+
91+
Select the appropriate build commands based on your hardware (`TPU` or `GPU`) and your specific workflow (`pre-training` or `post-training`). Each of these commands will generate a local Docker image named `maxtext_base_image`.
92+
93+
### TPU Pre-Training Docker Image
94+
95+
```bash
96+
# Option 1: Build with the stable versions of dependencies (default)
97+
build_maxtext_docker_image
98+
99+
# Option 2: Build with latest nightly versions of jax/jaxlib
100+
build_maxtext_docker_image MODE=nightly
101+
102+
# Option 3: Build with the specified jax/jaxlib version
103+
build_maxtext_docker_image MODE=nightly JAX_VERSION=$JAX_VERSION
104+
```
105+
106+
### GPU Pre-Training Docker Image
107+
108+
```bash
109+
# Option 1: Build with the stable versions of dependencies (default)
110+
build_maxtext_docker_image DEVICE=gpu
111+
112+
# Option 2: Build with latest nightly versions of jax/jaxlib
113+
build_maxtext_docker_image DEVICE=gpu MODE=nightly
114+
115+
# Option 3: Build with base image as `ghcr.io/nvidia/jax:base-2024-12-04`
116+
build_maxtext_docker_image DEVICE=gpu MODE=pinned
117+
118+
# Option 4: Build with the specified jax/jaxlib version
119+
build_maxtext_docker_image DEVICE=gpu MODE=nightly JAX_VERSION=$JAX_VERSION
120+
```
121+
122+
### TPU Post-Training Docker Image
123+
124+
```bash
125+
# This build process takes approximately 10 to 15 minutes.
126+
build_maxtext_docker_image WORKFLOW=post-training
127+
```
128+
129+
## Upload MaxText Docker Image to Artifact Registry
130+
131+
> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access".
132+
133+
```bash
134+
# Make sure to replace <Docker Image Name> with your desired image name.
135+
export CLOUD_IMAGE_NAME=<Docker Image Name>
136+
upload_maxtext_docker_image CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME?}
137+
```

docs/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,8 +120,8 @@
120120
os.path.join("run_maxtext", "run_maxtext_via_multihost_runner.md"),
121121
os.path.join("reference", "core_concepts", "llm_calculator.ipynb"),
122122
os.path.join("reference", "api_generated", "modules.rst"),
123-
os.path.join("reference", "api_generated", "install_maxtext_extra_deps.rst"),
124-
os.path.join("reference", "api_generated", "install_maxtext_extra_deps.install_github_deps.rst"),
123+
os.path.join("reference", "api_generated", "dependencies.github_deps.rst"),
124+
os.path.join("reference", "api_generated", "dependencies.github_deps.install_pre_train_deps.rst"),
125125
]
126126

127127

0 commit comments

Comments
 (0)