Skip to content

Commit b0aa1cc

Browse files
committed
Added training guidlines on how to run DLRM/HSTU on v6
1 parent ac003e6 commit b0aa1cc

1 file changed

Lines changed: 86 additions & 0 deletions

File tree

training.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Model Training Guide
2+
3+
This guide explains how to set up the environment and train the HSTU/DLRM models on Cloud TPU v6 (or other TPU generations).
4+
5+
## Option 1: Virtual Environment (Recommended for Dev)
6+
7+
If you are developing on a TPU VM directly, use a virtual environment to avoid conflicts with the system-level Python packages.
8+
9+
#### 1. Prerequisites
10+
Ensure you have **Python 3.11+** installed.
11+
```bash
12+
python3 --version
13+
```
14+
#### Create the venv\
15+
```bash
16+
python3 -m venv venv
17+
```
18+
#### Activate it
19+
```bash
20+
source venv/bin/activate
21+
```
22+
23+
### 2. Create and Activate Virtual Environment
24+
Run the following from the root of the repository:
25+
```bash
26+
# Create the venv
27+
python3 -m venv venv
28+
29+
# Activate it
30+
source venv/bin/activate
31+
```
32+
33+
### 3. Install Dependencies
34+
```bash
35+
pip install -r requirements.txt
36+
```
37+
We need to force a specific version of Protobuf to ensure compatibility with our TPU stack. Run this exactly as shown:
38+
```bash
39+
pip install "protobuf>=6.31.1" --no-deps
40+
```
41+
The --no-deps flag is required to prevent pip from downgrading it due to strict dependency pinning in other libraries.
42+
43+
### 4. Run the Training for DLRM
44+
```bash
45+
python dlrm_experiment_test.py
46+
```
47+
48+
## Option 2: Docker (Recommended for Production)
49+
50+
If you prefer not to manage a virtual environment or want to deploy this as a container, you can build a Docker image.
51+
52+
## 1. Build the Image
53+
reate a file named `Dockerfile` in the root of the repository:
54+
55+
```dockerfile
56+
# Use an official Python 3.11 runtime as a parent image
57+
FROM python:3.11-slim
58+
59+
# Set the working directory
60+
WORKDIR /app
61+
62+
# Copy the current directory contents into the container
63+
COPY . /app
64+
65+
# Install system tools if needed (e.g., git)
66+
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
67+
68+
# Install dependencies
69+
RUN pip install --upgrade pip
70+
RUN pip install -r requirements.txt
71+
72+
# Force install the specific protobuf version
73+
RUN pip install "protobuf>=6.31.1" --no-deps
74+
75+
# Default command to run the training script
76+
CMD ["python", "recml/examples/dlrm_experiment_test.py"]
77+
```
78+
79+
You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
80+
81+
82+
83+
84+
85+
86+

0 commit comments

Comments
 (0)