You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+79-33Lines changed: 79 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -96,19 +96,86 @@ source ./runBuild.sh
96
96
It's essentially building all the codes using our `CMakeLists.txt` file.
97
97
Once this is done, we can start gathering CUDA kernel profiling data with the following command:
98
98
```
99
-
cd ./cuda-profiling
99
+
cd $GPU_FLOPBENCH_ROOT/cuda-profiling
100
100
101
-
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/src/prna-cuda/data_tables python3 ./gatherData.py --outfile=profiling-data.csv 2>&1 | tee -a runlog.txt
101
+
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/../src/prna-cuda/data_tables python3 ./gatherData.py --outfile=profiling-data.csv 2>&1 | tee -a runlog.txt
102
102
```
103
103
^ This process will take about 10 hours, so please have someone around to babysit in case any unexpected issues arise.
104
104
We tested this on our own Docker container and had no issues.
105
105
106
-
### Scraping Source Codes
106
+
107
+
108
+
109
+
## Scraping Source Codes
107
110
108
111
While you wait for the performance counter data to gather, you can start with a simple scrape of the CUDA codes.
This will generate a file called `simple-scraped-kernels-CUDA-only.json`, which is the full textual representation of each program with concatenated files.
120
+
This means that CUDA kernels that come from the same parent program have the same source code that gets passed to an LLM.
121
+
122
+
123
+
124
+
125
+
126
+
## Source Code Manual Categorization
127
+
This part of the dataset is the human-annotated source codes (no need to run anything).
128
+
We manually inspected each of the scraped kernels and flagged the codes for the following properties:
129
+
130
+
1) Warp Divergence (i.e: conditional statements)
131
+
2) Calling a `__device__` function
132
+
3) Having recursion
133
+
4) Data-dependent Warp Divergence (i.e: conditional statements that depend on data read-in or calculated at runtime)
134
+
5) FLOP Division
135
+
6) Calls to external libraries or functions (e.g: CCCL/CUB, cuSparse, cuFFT)
136
+
7) Calling special math functions / intrinsics (e.g: sin/cos, `__ddiv_rd`)
137
+
8) Having common float subexpressions (i.e: repeated calculations)
138
+
139
+
If a code is flagged with any of the features 4-8, it is considered a *hard* code, that can NOT be directly statically analyzed, and thus could host hidden FLOP operations.
140
+
We provide this manually-annotated dataset in the file: `$GPU_FLOPBENCH_ROOT/manual-code-classification/manually_classified_CUDA_kernels.json`.
141
+
The UI for performing this classification was served through Streamlit, which allowed us to quickly get through manual kernel categorization efforts.
142
+
The file for launching the classification UI is: `$GPU_FLOPBENCH_ROOT/manual-code-classification/manualStreamlit_classify.py`, although we prodive our already-made classifications in the `manually_classified_CUDA_kernels.json` file.
143
+
144
+
### Automatic Scraping for Annotation
145
+
In order to manually classify each of the kernels, we needed a way to automatically focus on the CUDA kernel of interest for each profiled kernel.
146
+
We performed a step where we attempted to automatically scrape the CUDA kernel function definitions and their call path code, but given the complexity of C++, it was challenging.
147
+
The script below produces an output file called `extracted_CUDA_kernels.json` which is the result of these efforts.
148
+
```
149
+
## Output file already provided -- no need to run these commands
150
+
151
+
cd $GPU_FLOPBENCH_ROOT/source-scraping
152
+
153
+
python3 extract_kernels.py
154
+
```
155
+
The main problem with this approach is that we used TreeSitter for static concrete syntax tree (CST) parsing, which made it challenging to handle all the edge cases of C++, so lots of codes were not fully scraped (e.g: missing relevant callee code, includes code from CUDA headers, includes incorrect overloaded callee function definitions).
156
+
We end up using these scrapes in the Streamlit UI for manual annotations, where an annotator would often have to revert back to the full source code of a kernel to properly flag its features.
157
+
158
+
We hope to improve this process in the future, but for now it's mostly a human-based effort.
159
+
Before this human-based approach, we did try to use LLMs, but they struggled to correctly extract code in many cases. Since we couldn't put much confiedence in LLMs to reliably perform this task for us, we scrapped the fully LLM-automated scraping approach.
160
+
161
+
# Dataset Creation
162
+
Now that we have the performance counter profiling data, scraped kernels, and manual annotations, we can create the dataset.
163
+
We break the process up into two steps: (1) creation of the *easy* subset and (2) creation of the *hard* subset, where we then join the two subsets into one dataset.
NOTE: This is already done for you if you're using the supplied Dockerfile.
176
243
177
-
## Gathering Roofline Data
244
+
## Gathering Profiling Data
178
245
179
-
Once all the codes are built, we can start the data collection process. We have our own script called `gatherData.py` which can be invoked to gather the roofline benchmarking data of each of the built programs.
246
+
Once all the codes are built, we can start the data collection process. We have our own script called `gatherData.py` which can be invoked to gather the profiling data of each of the built programs.
180
247
181
248
```
182
-
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/src/prna-cuda/data_tables python3 ./gatherData.py --outfile=roofline-data.csv 2>&1 | tee runlog.txt
249
+
LD_LIBRARY_PATH=/usr/lib/llvm-18/lib:$LD_LIBRARY_PATH DATAPATH=$PWD/src/prna-cuda/data_tables python3 ./gatherData.py --outfile=profiling-data.csv 2>&1 | tee runlog.txt
183
250
```
184
251
NOTE: This command should work out-of-the-box if you built a container using our Dockerfile.
185
252
@@ -195,18 +262,18 @@ The internal workflow at a high level looks like the following:
195
262
5. Correct the malformed execution arguments for some targets.
196
263
6. Search for (and confirm) the existence of required input files for some programs. Unzip and extract any files that are zipped and came with HeCBench.
197
264
7. Use `cuobjdump` and `cu++filt` to extract kernel names from each executable. These are used when invoking `ncu` to profile a particular kernel.
198
-
8. Run each of the executables and gather their roofline performance data with `ncu`.
199
-
9. Write gathered data to output `roofline-data.csv` file.
265
+
8. Run each of the executables and gather their profiling performance data with `ncu`.
266
+
9. Write gathered data to output `profiling-data.csv` file.
200
267
201
268
202
-
The `gatherData.py` script will emit a CSV file called `roofline-data.csv` containing all the benchmarking data. After each kernel is run, the data is written out to the last line of the CSV file. We encourage writing the results of the execution to a log file for later error/execution analysis.
269
+
The `gatherData.py` script will emit a CSV file called `profiling-data.csv` containing all the benchmarking data. After each kernel is run, the data is written out to the last line of the CSV file. We encourage writing the results of the execution to a log file for later error/execution analysis.
203
270
204
271
‼️‼️This process of profiling all the codes can take a while (roughly 10 hours), we suggest leaving the profiling running while someone babysits in case of an unexpected error. ‼️‼️
205
272
206
273
207
274
## Scraping the CUDA kernels
208
275
209
-
Once all the roofline benchmarking data has been collected, we can go ahead and scrape the sampled targets for their CUDA/OMP kernels. We do this with a script called `analysis/simpleScrapeKernels.py`, which will do the following:
276
+
Once all the benchmarking data has been collected, we can go ahead and scrape the sampled targets for their CUDA/OMP kernels. We do this with a script called `analysis/simpleScrapeKernels.py`, which will do the following:
210
277
211
278
1. Go through all the executables in the `build` dir and extract their kernel names via `cuobjdump` or `objdump`
212
279
2. Create a dictionary assiging to each kernel the `cat` contents of all the source files used by the target
@@ -235,28 +302,7 @@ Once we have the scraped source code, we can run the `analysis/vizAndPruneScrape
235
302
236
303
Given that some codes have very long input contexts, we drop these codes from inference/testing to save on inference/training costs.
237
304
The cap we set is at 8k tokens for now, based on an initial token count analysis done to check the max number of programs we could keep without the codes being too costly to query or verbose in tokenage.
238
-
We essentially get to keep 50% of all the CUDA and OMP codes whose roofline values were sampled.
239
-
240
-
241
-
## Creating the Roofline Dataset for Querying LLMs
242
-
243
-
Once we have the `simple-scraped-kernels-CUDA-pruned-with-sass.json`, `simple-scraped-kernels-OMP-pruned-with-sass.json`, and `roofline-data.csv` files generated, we can create the Roofline Dataset by invoking the `dataset-gen/createTrainingDatset.ipynb` notebook to emit two files:
244
-
245
-
-`train-dataset-balanced.csv` -- 80% of the sampled codes
246
-
-`validation-dataset-balanced.csv` -- 20% of the sampled codes
247
-
248
-
Each row of these CSV files contains one kernel and its scraped source code along with SASS code.
249
-
For now we don't use the SASS code, as this is for a future effort.
250
-
251
-
The notebook will go through the scraped codes and roofline performance data, performing the following actions:
252
-
- Dropping any codes that were not scraped due to the token cutoff
253
-
- Creating a Roofline plot of the collected data -- it's also the plot we use in the paper
254
-
- Limits the number of times a code can appear in the dataset to once (some codes had multiple kernels that got scraped, so we only query about 1 kernel from the program)
255
-
- Balances the dataframe to have an equal number of Bandwidth-bound (BB) and Compute-bound (CB) codes (most are BB)
256
-
- Does an 80/20 train/validation split of the full dataset
257
-
258
-
<br>
259
-
<br>
305
+
We essentially get to keep 50% of all the CUDA and OMP codes whose profiling values were sampled.
260
306
261
307
We NOTE: depending on your GPU version/capabilities, you'll need to edit the following values in the `dataset-gen/roofline_utils.py` script:
0 commit comments