Skip to content

Commit f8c1c36

Browse files
committed
reorganizing code, still need to get jupyter working
1 parent f23807f commit f8c1c36

10 files changed

Lines changed: 5775 additions & 1290 deletions

File tree

Dockerfile

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ SHELL ["/bin/bash", "-c"]
1212
RUN apt-get update && \
1313
apt-get upgrade -y && \
1414
apt-get install -y wget make git gfortran libomp-18-dev libboost-all-dev clang-18 clang-tools-18 unzip && \
15-
apt-get install -y imagemagick && \
15+
apt-get install -y imagemagick vim && \
1616
wget https://github.com/Kitware/CMake/releases/download/v3.28.0/cmake-3.28.0-linux-x86_64.sh && \
1717
chmod +x cmake-3.28.0-linux-x86_64.sh && \
1818
./cmake-3.28.0-linux-x86_64.sh --skip-license --prefix=/usr/local && \
@@ -79,4 +79,11 @@ RUN if [ "$HOSTOS" = "windows" ]; then \
7979
RUN echo 'conda activate gpu-flopbench' >> ~/.bashrc
8080

8181
# one of the issues with a windows host is that the execute permissions are not preserved when copying files into the container
82-
# this is okay, and it seems like everything works fine without needing to change it.
82+
# this is okay, and it seems like everything works fine without needing to change it.
83+
84+
85+
# set an environment variable for convenience
86+
ENV GPU_FLOPBENCH_ROOT=/gpu-flopbench
87+
88+
# expose the Jupyter notebook port
89+
EXPOSE 8888

README.md

Lines changed: 79 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -96,19 +96,86 @@ source ./runBuild.sh
9696
It's essentially building all the codes using our `CMakeLists.txt` file.
9797
Once this is done, we can start gathering CUDA kernel profiling data with the following command:
9898
```
99-
cd ./cuda-profiling
99+
cd $GPU_FLOPBENCH_ROOT/cuda-profiling
100100
101-
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/src/prna-cuda/data_tables python3 ./gatherData.py --outfile=profiling-data.csv 2>&1 | tee -a runlog.txt
101+
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/../src/prna-cuda/data_tables python3 ./gatherData.py --outfile=profiling-data.csv 2>&1 | tee -a runlog.txt
102102
```
103103
^ This process will take about 10 hours, so please have someone around to babysit in case any unexpected issues arise.
104104
We tested this on our own Docker container and had no issues.
105105

106-
### Scraping Source Codes
106+
107+
108+
109+
## Scraping Source Codes
107110

108111
While you wait for the performance counter data to gather, you can start with a simple scrape of the CUDA codes.
109112
```
110-
python3 simpleScrapeKernels.py --skipSASS --cudaOnly --outfile="simple-scraped-kernels.json"
113+
## Output file already provided -- no need to run these commands
114+
115+
cd $GPU_FLOPBENCH_ROOT/source-scraping
116+
117+
python3 simpleScrapeKernels.py --skipSASS --cudaOnly --outfile="simple-scraped-kernels-CUDA-only.json"
111118
```
119+
This will generate a file called `simple-scraped-kernels-CUDA-only.json`, which is the full textual representation of each program with concatenated files.
120+
This means that CUDA kernels that come from the same parent program have the same source code that gets passed to an LLM.
121+
122+
123+
124+
125+
126+
## Source Code Manual Categorization
127+
This part of the dataset is the human-annotated source codes (no need to run anything).
128+
We manually inspected each of the scraped kernels and flagged the codes for the following properties:
129+
130+
1) Warp Divergence (i.e: conditional statements)
131+
2) Calling a `__device__` function
132+
3) Having recursion
133+
4) Data-dependent Warp Divergence (i.e: conditional statements that depend on data read-in or calculated at runtime)
134+
5) FLOP Division
135+
6) Calls to external libraries or functions (e.g: CCCL/CUB, cuSparse, cuFFT)
136+
7) Calling special math functions / intrinsics (e.g: sin/cos, `__ddiv_rd`)
137+
8) Having common float subexpressions (i.e: repeated calculations)
138+
139+
If a code is flagged with any of the features 4-8, it is considered a *hard* code, that can NOT be directly statically analyzed, and thus could host hidden FLOP operations.
140+
We provide this manually-annotated dataset in the file: `$GPU_FLOPBENCH_ROOT/manual-code-classification/manually_classified_CUDA_kernels.json`.
141+
The UI for performing this classification was served through Streamlit, which allowed us to quickly get through manual kernel categorization efforts.
142+
The file for launching the classification UI is: `$GPU_FLOPBENCH_ROOT/manual-code-classification/manualStreamlit_classify.py`, although we prodive our already-made classifications in the `manually_classified_CUDA_kernels.json` file.
143+
144+
### Automatic Scraping for Annotation
145+
In order to manually classify each of the kernels, we needed a way to automatically focus on the CUDA kernel of interest for each profiled kernel.
146+
We performed a step where we attempted to automatically scrape the CUDA kernel function definitions and their call path code, but given the complexity of C++, it was challenging.
147+
The script below produces an output file called `extracted_CUDA_kernels.json` which is the result of these efforts.
148+
```
149+
## Output file already provided -- no need to run these commands
150+
151+
cd $GPU_FLOPBENCH_ROOT/source-scraping
152+
153+
python3 extract_kernels.py
154+
```
155+
The main problem with this approach is that we used TreeSitter for static concrete syntax tree (CST) parsing, which made it challenging to handle all the edge cases of C++, so lots of codes were not fully scraped (e.g: missing relevant callee code, includes code from CUDA headers, includes incorrect overloaded callee function definitions).
156+
We end up using these scrapes in the Streamlit UI for manual annotations, where an annotator would often have to revert back to the full source code of a kernel to properly flag its features.
157+
158+
We hope to improve this process in the future, but for now it's mostly a human-based effort.
159+
Before this human-based approach, we did try to use LLMs, but they struggled to correctly extract code in many cases. Since we couldn't put much confiedence in LLMs to reliably perform this task for us, we scrapped the fully LLM-automated scraping approach.
160+
161+
# Dataset Creation
162+
Now that we have the performance counter profiling data, scraped kernels, and manual annotations, we can create the dataset.
163+
We break the process up into two steps: (1) creation of the *easy* subset and (2) creation of the *hard* subset, where we then join the two subsets into one dataset.
164+
165+
## *Easy* Subset Creation
166+
167+
168+
## *Hard* Subset Creation
169+
170+
## Dataset Amalgamation
171+
172+
173+
174+
175+
176+
177+
178+
112179

113180
# Solo (no Docker) Instructions
114181

@@ -174,12 +241,12 @@ pip install -r ./requirements.txt
174241
```
175242
NOTE: This is already done for you if you're using the supplied Dockerfile.
176243

177-
## Gathering Roofline Data
244+
## Gathering Profiling Data
178245

179-
Once all the codes are built, we can start the data collection process. We have our own script called `gatherData.py` which can be invoked to gather the roofline benchmarking data of each of the built programs.
246+
Once all the codes are built, we can start the data collection process. We have our own script called `gatherData.py` which can be invoked to gather the profiling data of each of the built programs.
180247

181248
```
182-
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/src/prna-cuda/data_tables python3 ./gatherData.py --outfile=roofline-data.csv 2>&1 | tee runlog.txt
249+
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/src/prna-cuda/data_tables python3 ./gatherData.py --outfile=profiling-data.csv 2>&1 | tee runlog.txt
183250
```
184251
NOTE: This command should work out-of-the-box if you built a container using our Dockerfile.
185252

@@ -195,18 +262,18 @@ The internal workflow at a high level looks like the following:
195262
5. Correct the malformed execution arguments for some targets.
196263
6. Search for (and confirm) the existence of required input files for some programs. Unzip and extract any files that are zipped and came with HeCBench.
197264
7. Use `cuobjdump` and `cu++filt` to extract kernel names from each executable. These are used when invoking `ncu` to profile a particular kernel.
198-
8. Run each of the executables and gather their roofline performance data with `ncu`.
199-
9. Write gathered data to output `roofline-data.csv` file.
265+
8. Run each of the executables and gather their profiling performance data with `ncu`.
266+
9. Write gathered data to output `profiling-data.csv` file.
200267

201268

202-
The `gatherData.py` script will emit a CSV file called `roofline-data.csv` containing all the benchmarking data. After each kernel is run, the data is written out to the last line of the CSV file. We encourage writing the results of the execution to a log file for later error/execution analysis.
269+
The `gatherData.py` script will emit a CSV file called `profiling-data.csv` containing all the benchmarking data. After each kernel is run, the data is written out to the last line of the CSV file. We encourage writing the results of the execution to a log file for later error/execution analysis.
203270

204271
‼️‼️This process of profiling all the codes can take a while (roughly 10 hours), we suggest leaving the profiling running while someone babysits in case of an unexpected error. ‼️‼️
205272

206273

207274
## Scraping the CUDA kernels
208275

209-
Once all the roofline benchmarking data has been collected, we can go ahead and scrape the sampled targets for their CUDA/OMP kernels. We do this with a script called `analysis/simpleScrapeKernels.py`, which will do the following:
276+
Once all the benchmarking data has been collected, we can go ahead and scrape the sampled targets for their CUDA/OMP kernels. We do this with a script called `analysis/simpleScrapeKernels.py`, which will do the following:
210277

211278
1. Go through all the executables in the `build` dir and extract their kernel names via `cuobjdump` or `objdump`
212279
2. Create a dictionary assiging to each kernel the `cat` contents of all the source files used by the target
@@ -235,28 +302,7 @@ Once we have the scraped source code, we can run the `analysis/vizAndPruneScrape
235302

236303
Given that some codes have very long input contexts, we drop these codes from inference/testing to save on inference/training costs.
237304
The cap we set is at 8k tokens for now, based on an initial token count analysis done to check the max number of programs we could keep without the codes being too costly to query or verbose in tokenage.
238-
We essentially get to keep 50% of all the CUDA and OMP codes whose roofline values were sampled.
239-
240-
241-
## Creating the Roofline Dataset for Querying LLMs
242-
243-
Once we have the `simple-scraped-kernels-CUDA-pruned-with-sass.json`, `simple-scraped-kernels-OMP-pruned-with-sass.json`, and `roofline-data.csv` files generated, we can create the Roofline Dataset by invoking the `dataset-gen/createTrainingDatset.ipynb` notebook to emit two files:
244-
245-
- `train-dataset-balanced.csv` -- 80% of the sampled codes
246-
- `validation-dataset-balanced.csv` -- 20% of the sampled codes
247-
248-
Each row of these CSV files contains one kernel and its scraped source code along with SASS code.
249-
For now we don't use the SASS code, as this is for a future effort.
250-
251-
The notebook will go through the scraped codes and roofline performance data, performing the following actions:
252-
- Dropping any codes that were not scraped due to the token cutoff
253-
- Creating a Roofline plot of the collected data -- it's also the plot we use in the paper
254-
- Limits the number of times a code can appear in the dataset to once (some codes had multiple kernels that got scraped, so we only query about 1 kernel from the program)
255-
- Balances the dataframe to have an equal number of Bandwidth-bound (BB) and Compute-bound (CB) codes (most are BB)
256-
- Does an 80/20 train/validation split of the full dataset
257-
258-
<br>
259-
<br>
305+
We essentially get to keep 50% of all the CUDA and OMP codes whose profiling values were sampled.
260306

261307
We NOTE: depending on your GPU version/capabilities, you'll need to edit the following values in the `dataset-gen/roofline_utils.py` script:
262308
- `gpuName = 'NVIDIA RTX 3080'` -- name of the GPU

cuda-profiling/gatherData.py

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -29,31 +29,29 @@
2929
# these will be used globally in this program
3030
# mainly for consistency. They are absoule (full) paths
3131
DOWNLOAD_DIR = ''
32-
ROOT_DIR = ''
32+
THIS_DIR = ''
3333
SRC_DIR = ''
3434
BUILD_DIR = ''
3535

3636
def setup_dirs(buildDir, srcDir):
3737
global DOWNLOAD_DIR
38-
global ROOT_DIR
38+
global THIS_DIR
3939
global SRC_DIR
4040
global BUILD_DIR
4141

42-
ROOT_DIR = os.path.abspath(f'{srcDir}/../cuda-profiling')
43-
assert os.path.exists(ROOT_DIR)
42+
THIS_DIR = os.path.abspath(os.path.dirname(__file__))
43+
assert os.path.exists(THIS_DIR)
4444

45-
DOWNLOAD_DIR = os.path.abspath(f'{ROOT_DIR}/downloads')
45+
DOWNLOAD_DIR = os.path.abspath(f'{THIS_DIR}/downloads')
4646

4747
if not os.path.exists(DOWNLOAD_DIR):
4848
os.mkdir(DOWNLOAD_DIR)
4949

50-
THIS_DIR = os.path.abspath(os.path.dirname(__file__))
51-
5250
SRC_DIR = os.path.abspath(os.path.join(THIS_DIR, f'{srcDir}'))
5351
BUILD_DIR = os.path.abspath(os.path.join(THIS_DIR, f'{buildDir}'))
5452

5553
print('Using the following directories:', flush=True)
56-
print(f'ROOT_DIR = [{ROOT_DIR}]', flush=True)
54+
print(f'THIS_DIR = [{THIS_DIR}]', flush=True)
5755
print(f'DOWNLOAD_DIR = [{DOWNLOAD_DIR}]', flush=True)
5856
print(f'SRC_DIR = [{SRC_DIR}]', flush=True)
5957
print(f'BUILD_DIR = [{BUILD_DIR}]', flush=True)
@@ -486,8 +484,8 @@ def modify_exe_args_for_some_targets(targets:list):
486484
target['exeArgs'] = '../face-cuda/Face.pgm ../face-cuda/info.txt ../face-cuda/class.txt Output-gpu.pgm'
487485
elif 'srad-' in basename:
488486
target['exeArgs'] = '1000 0.5 502 458'
489-
elif 'snicit-' in basename:
490-
target['exeArgs'] = '-k C'
487+
elif (basename == 'snicit-cuda'):
488+
target['exeArgs'] = '-k C -p ./dataset'
491489
elif 'grep-' in basename:
492490
target['exeArgs'] = '-f ../grep-cuda/testcases/lua.lines.js.txt "\."'
493491
elif (basename == 'che-cuda') or (basename == 'che-omp'):
@@ -502,8 +500,6 @@ def modify_exe_args_for_some_targets(targets:list):
502500
target['exeArgs'] = 'graph.csv 10000 output'
503501
elif (basename == 'atomicCost-cuda'):
504502
target['exeArgs'] = '16 10'
505-
elif (basename == 'snicit-cuda'):
506-
target['exeArgs'] = '-k C -p ./dataset'
507503

508504

509505
return targets

0 commit comments

Comments
 (0)