@@ -6,8 +6,8 @@ This guide explains how to set up the environment and train the HSTU/DLRM models
66
77If you are developing on a TPU VM directly, use a virtual environment to avoid conflicts with the system-level Python packages.
88
9- ### 1. Prerequisites
10- Ensure you have ** Python 3.11 +** installed.
9+ #### 1. Prerequisites
10+ Ensure you have ** Python 3.12 +** installed.
1111``` bash
1212python3 --version
1313```
@@ -23,6 +23,11 @@ source venv/bin/activate
2323```
2424
2525### 3. Install Dependencies
26+
27+ Install the latest version of the jax-tpu-embedding library:
28+ ``` bash
29+ pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
30+ ```
2631``` bash
2732pip install -r requirements.txt
2833```
@@ -41,57 +46,17 @@ python dlrm_experiment_test.py
4146
4247If you prefer not to manage a virtual environment or want to deploy this as a container, you can build a Docker image.
4348
44- ### 1. Create a Dockerfile
45- Create a file named ` Dockerfile ` in the root of the repository:
46-
47- ``` dockerfile
48- # Use an official Python 3.12 runtime as a parent image
49- FROM python:3.12-slim
50-
51- # Set the working directory
52- WORKDIR /app
53-
54- # This tells Python to look in /app for the 'recml' package
55- ENV PYTHONPATH="${PYTHONPATH}:/app"
56-
57- # This tells Python to look in /app for the 'recml' package
58- ENV PYTHONPATH="${PYTHONPATH}:/app"
59-
60- # Install system tools if needed (e.g., git)
61- RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
62-
63- # Install the latest jax-tpu-embedding wheel
64- COPY jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl ./
65- RUN pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
66-
67- # Copy requirements.txt to current directory
68- COPY requirements.txt ./
69-
70- # Install dependencies
71- RUN pip install --upgrade pip
72- RUN pip install -r ./requirements.txt
73-
74- # Force install the specific protobuf version
75- RUN pip install "protobuf>=6.31.1" --no-deps
76-
77- # Copy the current directory contents into the container
78- COPY . /app
79-
80- # Default command to run the training script
81- CMD ["python" , "recml/examples/dlrm_experiment_test.py" ]
82- ```
83-
84- You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
85-
86- ### 2. Build the Image
49+ ### 1. Build the Image
8750
8851Run this command from the root of the repository. It reads the ` Dockerfile ` , installs all dependencies, and creates a ready-to-run image.
8952
9053``` bash
9154docker build -t recml-training .
9255```
9356
94- ### 3. Run the Image
57+ ### 2. Run the Image
58+
59+ This will run the docker image and execute the command specified, which is currently set to run DLRM.
9560
9661``` bash
9762docker run --rm --privileged \
@@ -100,9 +65,3 @@ docker run --rm --privileged \
10065 --name recml-experiment \
10166 recml-training
10267```
103-
104- ### What is happening here?
105- * ** ` --rm ` ** : Automatically deletes the container after the script finishes to keep your disk clean.
106- * ** ` --privileged ` ** : Grants the container direct access to the host's hardware devices, which is required to see the physical TPU chips.
107- * ** ` --net=host ` ** : Removes the container's network isolation, allowing the script to connect to the TPU runtime listening on local ports (e.g., 8353).
108- * ** ` --ipc=host ` ** : Allows the container to use the host's Shared Memory (IPC), which is critical for high-speed data transfer between the CPU and TPU.
0 commit comments