@@ -7,7 +7,7 @@ This guide explains how to set up the environment and train the HSTU/DLRM models
77If you are developing on a TPU VM directly, use a virtual environment to avoid conflicts with the system-level Python packages.
88
99### 1. Prerequisites
10- Ensure you have ** Python 3.11 +** installed.
10+ Ensure you have ** Python 3.12 +** installed.
1111``` bash
1212python3 --version
1313```
@@ -23,6 +23,11 @@ source venv/bin/activate
2323```
2424
2525### 3. Install Dependencies
26+
27+ Install the latest version of the jax-tpu-embedding library:
28+ ``` bash
29+ pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
30+ ```
2631``` bash
2732pip install -r requirements.txt
2833```
@@ -41,47 +46,17 @@ python dlrm_experiment_test.py
4146
4247If you prefer not to manage a virtual environment or want to deploy this as a container, you can build a Docker image.
4348
44- ### 1. Create a Dockerfile
45- Create a file named ` Dockerfile ` in the root of the repository:
46-
47- ``` dockerfile
48- # Use an official Python 3.11 runtime as a parent image
49- FROM python:3.11-slim
50-
51- # Set the working directory
52- WORKDIR /app
53-
54- # Copy the current directory contents into the container
55- COPY . /app
56-
57- # This tells Python to look in /app for the 'recml' package
58- ENV PYTHONPATH="${PYTHONPATH}:/app"
59-
60- # Install system tools if needed (e.g., git)
61- RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
62-
63- # Install dependencies
64- RUN pip install --upgrade pip
65- RUN pip install -r requirements.txt
66-
67- # Force install the specific protobuf version
68- RUN pip install "protobuf>=6.31.1" --no-deps
69-
70- # Default command to run the training script
71- CMD ["python" , "recml/examples/dlrm_experiment_test.py" ]
72- ```
73-
74- You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
75-
76- ### 2. Build the Image
49+ ### 1. Build the Image
7750
7851Run this command from the root of the repository. It reads the ` Dockerfile ` , installs all dependencies, and creates a ready-to-run image.
7952
8053``` bash
8154docker build -t recml-training .
8255```
8356
84- ### 3. Run the Image
57+ ### 2. Run the Image
58+
59+ This will run the docker image and execute the command specified, which is currently set to run DLRM.
8560
8661``` bash
8762docker run --rm --privileged \
@@ -90,9 +65,3 @@ docker run --rm --privileged \
9065 --name recml-experiment \
9166 recml-training
9267```
93-
94- ### What is happening here?
95- * ** ` --rm ` ** : Automatically deletes the container after the script finishes to keep your disk clean.
96- * ** ` --privileged ` ** : Grants the container direct access to the host's hardware devices, which is required to see the physical TPU chips.
97- * ** ` --net=host ` ** : Removes the container's network isolation, allowing the script to connect to the TPU runtime listening on local ports (e.g., 8353).
98- * ** ` --ipc=host ` ** : Allows the container to use the host's Shared Memory (IPC), which is critical for high-speed data transfer between the CPU and TPU.
0 commit comments