Skip to content

Commit ab47436

Browse files
committed
updated training instructions
1 parent bccd9a3 commit ab47436

1 file changed

Lines changed: 40 additions & 6 deletions

File tree

training.md

Lines changed: 40 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,27 +45,61 @@ If you prefer not to manage a virtual environment or want to deploy this as a co
4545
Create a file named `Dockerfile` in the root of the repository:
4646

4747
```dockerfile
48-
# Use an official Python 3.11 runtime as a parent image
49-
FROM python:3.11-slim
48+
# Use an official Python 3.12 runtime as a parent image
49+
FROM python:3.12-slim
5050

5151
# Set the working directory
5252
WORKDIR /app
5353

54-
# Copy the current directory contents into the container
55-
COPY . /app
54+
# This tells Python to look in /app for the 'recml' package
55+
ENV PYTHONPATH="${PYTHONPATH}:/app"
5656

5757
# Install system tools if needed (e.g., git)
5858
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
5959

60+
# Install the latest jax-tpu-embedding wheel
61+
COPY jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl ./
62+
RUN pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
63+
64+
# Copy requirements.txt to current directory
65+
COPY requirements.txt ./
66+
6067
# Install dependencies
6168
RUN pip install --upgrade pip
62-
RUN pip install -r requirements.txt
69+
RUN pip install -r ./requirements.txt
6370

6471
# Force install the specific protobuf version
6572
RUN pip install "protobuf>=6.31.1" --no-deps
6673

74+
# Copy the current directory contents into the container
75+
COPY . /app
76+
6777
# Default command to run the training script
6878
CMD ["python", "recml/examples/dlrm_experiment_test.py"]
6979
```
7080

71-
You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
81+
You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
82+
83+
### 2. Build the Image
84+
85+
Run this command from the root of the repository. It reads the `Dockerfile`, installs all dependencies, and creates a ready-to-run image.
86+
87+
```bash
88+
docker build -t recml-training .
89+
```
90+
91+
### 3. Run the Image
92+
93+
```bash
94+
docker run --rm --privileged \
95+
--net=host \
96+
--ipc=host \
97+
--name recml-experiment \
98+
recml-training
99+
```
100+
101+
### What is happening here?
102+
* **`--rm`**: Automatically deletes the container after the script finishes to keep your disk clean.
103+
* **`--privileged`**: Grants the container direct access to the host's hardware devices, which is required to see the physical TPU chips.
104+
* **`--net=host`**: Removes the container's network isolation, allowing the script to connect to the TPU runtime listening on local ports (e.g., 8353).
105+
* **`--ipc=host`**: Allows the container to use the host's Shared Memory (IPC), which is critical for high-speed data transfer between the CPU and TPU.

0 commit comments

Comments
 (0)