@@ -45,27 +45,61 @@ If you prefer not to manage a virtual environment or want to deploy this as a co
4545Create a file named ` Dockerfile ` in the root of the repository:
4646
4747``` dockerfile
48- # Use an official Python 3.11 runtime as a parent image
49- FROM python:3.11 -slim
48+ # Use an official Python 3.12 runtime as a parent image
49+ FROM python:3.12 -slim
5050
5151# Set the working directory
5252WORKDIR /app
5353
54- # Copy the current directory contents into the container
55- COPY . /app
54+ # This tells Python to look in /app for the 'recml' package
55+ ENV PYTHONPATH= "${PYTHONPATH}: /app"
5656
5757# Install system tools if needed (e.g., git)
5858RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
5959
60+ # Install the latest jax-tpu-embedding wheel
61+ COPY jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl ./
62+ RUN pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
63+
64+ # Copy requirements.txt to current directory
65+ COPY requirements.txt ./
66+
6067# Install dependencies
6168RUN pip install --upgrade pip
62- RUN pip install -r requirements.txt
69+ RUN pip install -r ./ requirements.txt
6370
6471# Force install the specific protobuf version
6572RUN pip install "protobuf>=6.31.1" --no-deps
6673
74+ # Copy the current directory contents into the container
75+ COPY . /app
76+
6777# Default command to run the training script
6878CMD ["python" , "recml/examples/dlrm_experiment_test.py" ]
6979```
7080
71- You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
81+ You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
82+
83+ ### 2. Build the Image
84+
85+ Run this command from the root of the repository. It reads the ` Dockerfile ` , installs all dependencies, and creates a ready-to-run image.
86+
87+ ``` bash
88+ docker build -t recml-training .
89+ ```
90+
91+ ### 3. Run the Image
92+
93+ ``` bash
94+ docker run --rm --privileged \
95+ --net=host \
96+ --ipc=host \
97+ --name recml-experiment \
98+ recml-training
99+ ```
100+
101+ ### What is happening here?
102+ * ** ` --rm ` ** : Automatically deletes the container after the script finishes to keep your disk clean.
103+ * ** ` --privileged ` ** : Grants the container direct access to the host's hardware devices, which is required to see the physical TPU chips.
104+ * ** ` --net=host ` ** : Removes the container's network isolation, allowing the script to connect to the TPU runtime listening on local ports (e.g., 8353).
105+ * ** ` --ipc=host ` ** : Allows the container to use the host's Shared Memory (IPC), which is critical for high-speed data transfer between the CPU and TPU.
0 commit comments