Skip to content

Commit 5bf590c

Browse files
committed
added a check to continue sampling even if a kernel didn't execute due to branching; updated README with docker instructions
1 parent f6bb926 commit 5bf590c

2 files changed

Lines changed: 80 additions & 5 deletions

File tree

README.md

Lines changed: 74 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,16 +48,76 @@ We also provide a `Dockerfile` so you can set up and run our code on your own sy
4848

4949
[![Build ALL CUDA/OMP Codes](https://github.com/gregbolet/gpu-flopbench/actions/workflows/buildAllCodesGithubAction.yml/badge.svg)](https://github.com/gregbolet/gpu-flopbench/actions/workflows/buildAllCodesGithubAction.yml)
5050

51+
## Docker Setup Instructions
52+
53+
For ease-of-reproducibility, we supply a `Dockerfile` with the necessary steps to recreate our environment and dataset using your own GPU hardware.
54+
The following is a list of steps to help you get set up and into the main bash shell of the container.
55+
56+
‼️‼️
57+
We note that the base container image will take up about 40 GB of storage space; once we start building codes and gathering profiling data, the disk usage will jump up to about 50 GB.
58+
Please ensure your system has enough storage space before continuing.
59+
‼️‼️
60+
61+
```
62+
git clone git@github.com:gregbolet/gpu-FLOPBench.git ./gpu-flopbench
63+
cd ./gpu-flopbench
64+
docker build --progress=plain -t 'gpu-flopbench' .
65+
docker run -ti --gpus all --name gpu-flopbench-container --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all gpu-flopbench
66+
docker exec -it gpu-flopbench-container /bin/bash
67+
```
68+
69+
Note: if you're on a **Windows Docker Desktop** host, be sure to enable the following:
70+
```
71+
NVIDIA Control Panel >> Desktop Tab >> Enable Developer Settings (make sure it's enabled)
72+
then
73+
NVIDIA Control Panel >> Select a Task... Tree Pane >> Developer (expand section) >> Manage GPU Performance Counters >> Allow access to the GPU performance counters to all users (make sure this is enabled)
74+
then
75+
restart Docker Desktop
76+
```
77+
78+
## Docker Data Collection Instructions (CUDA program building & profiling)
79+
80+
Once you're in the main bash shell of the container, you should be by default in the `/gpu-flopbench` directory with a conda environment called `gpu-flopbench`.
81+
We can now start building all the codes and collecting their performance counter data! 🌈😊
82+
83+
Run the following commands from the `gpu-flopbench` main project directory within the Docker container (they should work without issue):
84+
```
85+
source ./runBuild.sh
86+
```
87+
^ Depending on the number of cores on your CPU, this can take anywhere from 5-20 minutes.
88+
It's essentially building all the codes using our `CMakeLists.txt` file.
89+
Once this is done, we can start gathering CUDA kernel profiling data with the following command:
90+
```
91+
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/src/prna-cuda/data_tables python3 ./gatherData.py --outfile=roofline-data.csv 2>&1 | tee runlog.txt
92+
```
93+
^ This process will take about 5-6 hours, so please have someone around to babysit in case any unexpected issues arise.
94+
We tested this on our own Docker container and had no issues.
95+
96+
# Solo (no Docker) Instructions
97+
98+
Below is a list of instructions for reproducing what is done in the above Docker container, but instead on your own system.
99+
This path is laden with more unexpected complications and potentially more debugging effort, so continue at your own risk.
100+
A lot of the CUDA codes we built had their compilation instructions tailored to our particular system, so you may end up having to do more work to get all the codes built and running if you decide to change compiler, compiler versions, or CUDA versions.
101+
In future work we would like to make this process of building the codes agnostic to the system, but for now this is what we have working.
102+
51103
## Building
52104

105+
Start by simply cloning our repo.
106+
```
107+
git clone git@github.com:gregbolet/gpu-FLOPBench.git ./gpu-flopbench
108+
cd ./gpu-flopbench
109+
```
110+
53111
Execute the following command to get the Makefile generated and to start the build process.
54112
This will automatically `make` all the programs, **you'll NEED to edit the `runBuild.sh` script to properly set any compilers/options for the codes to build**.
55-
NOTE: If you're running this from a Docker container generated from our Dockerfile, it should work out-of-the-box.
56113
By default, we have everything building with `clang++` and `clang`, this should mostly work out-of-the-box but some include paths may need to be set/overriden. (SEE BELOW)
114+
57115
```
58116
source ./runBuild.sh
59117
```
60-
We originally had the CUDA codes building with `nvcc` but for simplicity have switch to just LLVM. You may still be able to build the codes with `nvcc`, but it may take some modifications to the build pipeline.
118+
NOTE: If you're running this from a Docker container generated from our Dockerfile, it should work out-of-the-box.
119+
We originally had the CUDA codes building with `nvcc`, but to be able to also build SYCL and OMP codes, we switched to just LLVM. You may still be able to build the codes with `nvcc`, but it may take some modifications to the build pipeline.
120+
We have future plans to sample SYCL and OMP codes, but for now, this work focuses on CUDA codes.
61121

62122

63123
## Common Build Issues
@@ -75,18 +135,26 @@ Here's a list of other common build issues that might help if you're encounterin
75135
- missing libs to link
76136
- putting some search/include dirs before others when compiling (duplicate filenames can cause header include mixups)
77137

138+
We note that our entire build process is captured in one `CMakeLists.txt` file.
139+
This was done purposely to be able to build all the codes in a batch manner, as having to manually go in and modify individual HeCBench Makefiles was tiresome.
140+
141+
Essentially, our `CMakeLists.txt` file treats each `src/*-cuda` and `src/*-omp` directory as a single CMake/Makefile target, with the corresponding output executable having the same name as its `src` directory.
142+
We automatically include many of the sub-directories for header files.
143+
The reason why our `CMakeLists.txt` file is so long is because there were many codes that we had to manually modify their build process to get them to build correctly.
144+
This took a while to do, but ultimately makes the build process much easier and manageable.
145+
78146
## Python Environment Setup
79147

80148
We used Python3 (v3.11.11) for executing our Python scripts.
81149
The `requirements.txt` file contains all the necessary packages and their versions that should be installed prior to using any of our Python scripts.
82150
It is strongly advised to set up a new Conda environment to not mess up the base Python installation on your system.
83-
NOTE: This is already done for you if you're using the supplied Dockerfile.
84151

85152
```
86153
conda create --name "gpu-flopbench" python=3.11.11
87154
conda activate gpu-flopbench
88155
pip install -r ./requirements.txt
89156
```
157+
NOTE: This is already done for you if you're using the supplied Dockerfile.
90158

91159
## Gathering Roofline Data
92160

@@ -95,6 +163,7 @@ Once all the codes are built, we can start the data collection process. We have
95163
```
96164
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/src/prna-cuda/data_tables python3 ./gatherData.py --outfile=roofline-data.csv 2>&1 | tee runlog.txt
97165
```
166+
NOTE: This command should work out-of-the-box if you built a container using our Dockerfile.
98167

99168
This will automatically invoke each of the built executables, using `ncu` (NVIDIA Nsight Compute) to profile each of the kernels in the executable. Some of the codes require files to be downloaded proir, this script takes care of the downloading process and makes sure that all the requested files are in place.
100169
The `DATAPATH` environment variable is only needed by `frna-cuda` and `prna-cuda`, so if you're not running those, you can drop it.
@@ -114,6 +183,8 @@ The internal workflow at a high level looks like the following:
114183

115184
The `gatherData.py` script will emit a CSV file called `roofline-data.csv` containing all the benchmarking data. After each kernel is run, the data is written out to the last line of the CSV file. We encourage writing the results of the execution to a log file for later error/execution analysis.
116185

186+
‼️‼️This process of profiling all the codes can take a while (roughly 6-7 hours), we suggest leaving the profiling running while someone babysits in case of an unexpected error. ‼️‼️
187+
117188

118189
## Scraping the CUDA kernels
119190

gatherData.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1003,6 +1003,10 @@ def execute_targets(targets:list, dfFilename:str, skipRuns:bool=False):
10031003
stdout = execResult.stdout.decode('UTF-8')
10041004
assert execResult.returncode == 0, f'error in execution!\n {stdout}'
10051005

1006+
if '==WARNING== No kernels were profiled.' in stdout:
1007+
print(f'\tSKIPWARNING No kernels were profiled for {basename}-[{kernelName}] -- skipping!')
1008+
continue
1009+
10061010
rawDF = roofline_results_to_df(rooflineResult)
10071011
roofDF = calc_roofline_data(rawDF)
10081012

@@ -1014,8 +1018,8 @@ def execute_targets(targets:list, dfFilename:str, skipRuns:bool=False):
10141018
else:
10151019
# doing this now to skip failing runs
10161020
continue
1017-
dataDict = {'targetName':[basename], 'exeArgs':[exeArgs], 'kernelName':[kernelName]}
1018-
subset = pd.DataFrame(dataDict)
1021+
#dataDict = {'targetName':[basename], 'exeArgs':[exeArgs], 'kernelName':[kernelName]}
1022+
#subset = pd.DataFrame(dataDict)
10191023

10201024
# if we skip the run, read the nsys-rep files
10211025
else:

0 commit comments

Comments
 (0)