Skip to content

Commit 1538657

Browse files
Refactor SIMD on Rust learning path content for clarity and consistency
1 parent 436fc99 commit 1538657

7 files changed

Lines changed: 59 additions & 58 deletions

File tree

content/learning-paths/cross-platform/simd-on-rust/_index.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,20 @@
11
---
2-
title: Learn how to write SIMD code on Arm using Rust
2+
title: Write SIMD code on Arm using Rust
33

44
minutes_to_complete: 30
55

66
description: Learn how to write SIMD code in Rust on Arm platforms using Neon intrinsics, portable SIMD abstractions, and optimize performance with architecture-specific instructions.
77

8-
who_is_this_for: This is an advanced topic for software developers who want take advantage of SIMD code on Arm systems using Rust.
8+
who_is_this_for: This is an advanced topic for software developers who want to take advantage of SIMD code on Arm systems using Rust.
99

1010
learning_objectives:
11-
- Learn how to write SIMD code with Rust on Arm.
11+
- Write SIMD code with Rust using std::arch and Neon intrinsics on Arm
12+
- Use portable SIMD abstractions with std::simd for cross-platform code
13+
- Apply feature detection and target attributes for architecture-specific optimizations
14+
- Compare C and Rust SIMD implementations and disassembly output
1215

1316
prerequisites:
14-
- An Arm-based computer with recent versions of a C compiler (Clang or GCC) and a Rust compiler installed.
17+
- An Arm-based computer with recent versions of a C compiler (Clang or GCC) and a Rust compiler installed
1518

1619
author: Konstantinos Margaritis
1720

content/learning-paths/cross-platform/simd-on-rust/conclusion.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,5 @@ layout: learningpathall
88

99
You have now seen a few examples of writing SIMD code on Arm with Rust.
1010

11-
Performance-wise, there is little difference between C and Rust as Rust is perfectly capable of generating the same assembly code as C in most cases. That said, if you want to program optimal SIMD code using the Arm ASIMD/Neon intrinsics, `std::arch` is the most obvious choice. If, however, your approach needs to be as portable as possible and you don't want to spend time providing multiple implementations for each architecture then `std::simd` is a very viable alternative (even though it's not part of the stable compiler yet).
11+
Performance-wise, there's little difference between C and Rust as Rust is perfectly capable of generating the same assembly code as C in most cases. That said, if you want to program optimal SIMD code using the Arm ASIMD/Neon intrinsics, `std::arch` is the most obvious choice. If, however, your approach needs to be as portable as possible and you don't want to spend time providing multiple implementations for each architecture then `std::simd` is a very viable alternative (even though it's not part of the stable compiler yet).
1212

content/learning-paths/cross-platform/simd-on-rust/intro-to-rust.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@ In this Learning Path, you will learn the basics of how to program SIMD code on
1212

1313
Rust is a safe programming language with some key advantages:
1414

15-
* It is a modern, strong-typed language.
16-
* Rust is memory safe by design: it is very difficult to introduce a bug like buffer overflow with Rust.
17-
* Strict language: the Rust compiler is very strict and does not let you make easy mistakes as you might with C.
18-
* The usage and support for Rust is expanding to many architectures and operating systems.
15+
* It's a modern, strong-typed language
16+
* Rust is memory safe by design: it's very difficult to introduce a bug like buffer overflow with Rust
17+
* Strict language: the Rust compiler is very strict and doesn't let you make easy mistakes as you might with C
18+
* The usage and support for Rust is expanding to many architectures and operating systems
1919

2020
## SIMD with Rust
2121

@@ -24,19 +24,19 @@ Support for intrinsics in languages such as C and C++ is generally added by the
2424
Rust is a little different in that regard. While vendors are still very involved in providing the support for SIMD intrinsics in the compiler, there are other alternatives and approaches used to provide SIMD abstraction.
2525

2626
Currently there are 2 SIMD programming interfaces in Rust:
27-
* One under `std::arch` which follows the C intrinsics as much as possible.
28-
* Another, `std::simd`, which provides a portable abstraction to SIMD programming so that code can just be recompiled across different architectures with more or less the same results. While there are similar libraries for C and C++, this is different in that the intent is for it to be merged as an official extension to the Rust standard library under `std::simd`.
27+
* One under `std::arch` which follows the C intrinsics as much as possible
28+
* Another, `std::simd`, which provides a portable abstraction to SIMD programming so that code can be recompiled across different architectures with more or less the same results. While there are similar libraries for C and C++, this is different in that the intent is for it to be merged as an official extension to the Rust standard library under `std::simd`
2929

3030
You will learn how to use both of these interfaces to write code that uses Advanced SIMD/Neon instructions on an Arm CPU.
3131

3232
Before you start, make sure you have the [Rust compiler installed](/install-guides/rust).
3333

34-
You can check if you have a working `rustc` compiler installed by running the following command:
34+
To check if you have a working `rustc` compiler installed, run the following command:
3535

3636
```bash
3737
rustc --version
3838
```
39-
Your output should look similar to the following:
39+
The output should look similar to:
4040

4141
```bash
4242
rustc 1.79.0 (129f3b996 2024-06-10)
@@ -50,15 +50,15 @@ Switch to the `nightly` version to `rustc` by running the following:
5050
rustup default nightly
5151
```
5252

53-
Now run the version command again to check if you have the right version:
53+
To check the version again, run:
5454

5555
```bash
5656
rustc --version
5757
```
58-
Your output should now look similar to the following:
58+
The output should now look similar to:
5959

6060
```bash
6161
rustc 1.82.0-nightly (92c6c0380 2024-07-21)
6262
```
6363

64-
Now that you have a working Rust compiler with the features supported in the nightly version, you can continue with building and running the examples included in this learning path. Please note that the code examples in this learning path are not optimally written for Rust (to do that you would have to use `cargo`, find the proper `crates` to do specific tasks, for example for 2D arrays, which would increase the complexity of this learning path significantly).
64+
Now that you have a working Rust compiler with the features supported in the nightly version, you can continue with building and running the examples included in this Learning Path. The code examples in this Learning Path aren't optimally written for Rust (to do that you would have to use `cargo`, find the proper `crates` to do specific tasks, for example for 2D arrays, which would increase the complexity of this Learning Path significantly).

content/learning-paths/cross-platform/simd-on-rust/simd-on-rust-part1.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@ layout: learningpathall
88

99
## Differences with programming with intrinsics in C and Rust
1010

11-
As per the Arm Community blog post about [Neon Intrinsics in Rust](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/rust-neon-intrinsics), there are some differences between C and Rust when programming with intrinsics which are listed in the blog and which will be expanded on in this Learning Path with code examples.
11+
As per the Arm Community blog post about [Neon Intrinsics in Rust](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/rust-neon-intrinsics), there are some differences between C and Rust when programming with intrinsics which are listed in the blog and which you'll expand on in this Learning Path with code examples.
1212

13-
We start with an example that uses Arm Advanced SIMD (Neon) intrinsics in C. Create a file named `average_neon.c` with the contents shown below. This program computes the average value of every pair of elements in 2 arrays:
13+
You'll start with an example that uses Arm Advanced SIMD (Neon) intrinsics in C. Create a file named `average_neon.c` with the contents shown below. This program computes the average value of every pair of elements in 2 arrays:
1414

1515
```C
1616
#include <stdio.h>
@@ -58,12 +58,12 @@ Compile the code as follows:
5858
```bash
5959
gcc -O3 -fno-inline average_neon.c -o average_neon
6060
```
61-
Now run the program as follows:
61+
Run the program:
6262

6363
```bash
6464
./average_neon
6565
```
66-
The output should look like the following:
66+
The output should look similar to:
6767

6868
```output
6969
A[0] = 2.00, B[0] = -9.00 -> C[0] = -3.50
@@ -77,7 +77,7 @@ A[7] = 16.00, B[7] = -30.00 -> C[7] = -7.00
7777
...
7878
```
7979

80-
Note that the `-fno-inline` option was passed to the compiler. Use this option to prevent the C compiler from inlining the `average_vec` function. This is needed to compare the disassembly output of the `average_vec` function from the C version against the disassembly output from the Rust version.
80+
Note that the `-fno-inline` option was passed to the compiler. Use this option to prevent the C compiler from inlining the `average_vec` function. You'll need this to compare the disassembly output of the `average_vec` function from the C version against the disassembly output from the Rust version.
8181

8282
Generate the disassembly output from the C version as shown below:
8383

@@ -153,8 +153,7 @@ Run the program as follows:
153153
```bash
154154
./average1
155155
```
156-
157-
The output should look like the following:
156+
The output should look similar to:
158157

159158
```output
160159
A[0] = 2, B[0] = -9 -> C[0] = -3.5
@@ -170,15 +169,15 @@ A[7] = 16, B[7] = -30 -> C[7] = -7
170169

171170
The outputs shown from these 2 versions are the same apart from the formatting.
172171

173-
This particular example is not very complicated but you will notice some key differences between C and Rust already:
172+
This particular example isn't very complicated but you'll notice some key differences between C and Rust already:
174173

175-
* Uninitialized variables - mutable/immutable arguments passed to the functions are not a concern for a C developer creating a proof of concept program. This is not the case with Rust programming, which forces the developer to think about these things right from the start. This usually means that it takes longer to write a simple program in Rust but you can be certain that this program will not suffer from trivial bugs such as buffer overflows, out of bounds, illegal conversions etc.
176-
* Conversions/Castings need to be explicit, e.g., `2.0_f32 * ((i+1) as f32)`.
177-
* There is no need to pass size as a parameter as Rust includes size information in its arrays.
174+
* Uninitialized variables - mutable/immutable arguments passed to the functions aren't a concern for a C developer creating a proof of concept program. This isn't the case with Rust programming, which forces the developer to think about these things right from the start. This usually means that it takes longer to write a program in Rust but you can be certain that this program won't suffer from trivial bugs such as buffer overflows, out of bounds, illegal conversions etc.
175+
* Conversions/Castings need to be explicit, for example `2.0_f32 * ((i+1) as f32)`
176+
* There's no need to pass size as a parameter as Rust includes size information in its arrays
178177

179-
Note that this program is not written in the most optimal way for Rust. It is just a 'port' of the C program into Rust with the minimal changes needed to compile and run.
178+
Note that this program isn't written in the most optimal way for Rust. It's a 'port' of the C program into Rust with the minimal changes needed to compile and run.
180179

181-
The next step is to use SIMD intrinsics in your Rust program for the averaging loop. Replace the previous `average_vec` function with the function shown below and save the updated contents in a file named `average2.rs` as shown below:
180+
The next step is to use SIMD intrinsics in your Rust program for the averaging loop. Replace the previous `average_vec` function with the function shown below and save the updated contents in a file named `average2.rs`:
182181

183182
```Rust
184183
#[inline(never)]
@@ -234,12 +233,11 @@ A[6] = 14, B[6] = -27 -> C[6] = -6.5
234233
A[7] = 16, B[7] = -30 -> C[7] = -7
235234
...
236235
```
237-
238236
The results are the same but let's look at some of the differences:
239237

240-
* You need to use `target_arch` and `target_feature` to use specific hardware extensions. This is Rust's feature detection which is explained in more detail in the next section.
241-
* All definitions and functions need to be enabled with `use`, either selectively, for example `use std::arch::aarch64::float32x4_t` or with a wildcard `use std::arch::aarch64::*`. If in doubt, use the latter.
242-
* You will notice `#[inline(never)]` in the definition of `average_vec`. This is to let the compiler know that it should not inline this function because you will compare the disassembly against the C version.
238+
* You need to use `target_arch` and `target_feature` to use specific hardware extensions. This is Rust's feature detection which is explained in more detail in the next section
239+
* All definitions and functions need to be enabled with `use`, either selectively, for example `use std::arch::aarch64::float32x4_t` or with a wildcard `use std::arch::aarch64::*`. If in doubt, use the latter
240+
* You'll notice `#[inline(never)]` in the definition of `average_vec`. This is to let the compiler know that it shouldn't inline this function because you'll compare the disassembly against the C version
243241

244242
Now generate the disassembly output for `average2`as follows:
245243

content/learning-paths/cross-platform/simd-on-rust/simd-on-rust-part2.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
## An example with dot product instructions
1010

11-
You can now continue with an example around `dotprod` intrinsics. Shown below is a program that calculates the sum of absolute differences (SAD) of a 32x32 array of 8-bit unsigned integers (`uint8_t`) using the `vdotq_u32` intrinsic. Save the contents in a file named `dotprod1.c` as shown below:
11+
You can now continue with an example around `dotprod` intrinsics. Shown below is a program that calculates the sum of absolute differences (SAD) of a 32x32 array of 8-bit unsigned integers (`uint8_t`) using the `vdotq_u32` intrinsic. Save the contents in a file named `dotprod1.c`:
1212

1313
```C
1414
#include <stdio.h>
@@ -59,17 +59,17 @@ int main() {
5959
}
6060
```
6161
62-
Now compile the program as follows:
62+
Compile the program as follows:
6363
6464
```bash
6565
gcc -O3 -march=armv8.2-a+dotprod dotprod1.c -o dotprod1
6666
```
6767

68-
And run the program as per below:
68+
Run the program:
6969
```bash
7070
./dotprod1
7171
```
72-
The output should look like the following:
72+
The output should look similar to:
7373
```output
7474
A[0] = [ 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f ]
7575
B[0] = [ 00 00 00 00 04 04 04 04 00 00 00 00 04 04 04 04 00 00 00 00 04 04 04 04 00 00 00 00 04 04 04 04 ]
@@ -198,12 +198,12 @@ Compile the program as follows:
198198
rustc -O dotprod2.rs
199199
```
200200

201-
Run the program as per below:
201+
Run the program:
202202
```bash
203203
./dotprod2
204204
```
205205

206-
The output should look like the following:
206+
The output should look similar to:
207207

208208
```output
209209
A[0] = [ 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f ]
@@ -220,11 +220,11 @@ sad = 7400
220220

221221
As you can see both executables produce the same output.
222222

223-
Now generate the disassembly output as shown below:
223+
Generate the disassembly output:
224224
```bash
225225
objdump -S dotprod2
226226
```
227-
The output should look like the following:
227+
The output should look similar to:
228228
```asm
229229
0000000000006394 <_ZN4core9core_arch10arm_shared4neon9generated9vdotq_u3217h5c7bc8d63e4a993fE>:
230230
6394: 3dc00000 ldr q0, [x0]
@@ -275,11 +275,11 @@ The output should look like the following:
275275
6724: d65f03c0 ret
276276
```
277277

278-
Note that where you might expect to see a `udot` instruction, there is a `bl` instruction which indicates a branch. The `udot` instruction is instead called in another function, which carries out the loads again.
278+
Note that where you might expect to see a `udot` instruction, there's a `bl` instruction which indicates a branch. The `udot` instruction is instead called in another function, which carries out the loads again.
279279

280280
This seems counter-intuitive but the reason is that, unlike C, Rust treats the intrinsics like normal functions.
281281

282-
Like functions, inlining them is not always guaranteed. If it is possible to inline the intrinsic, code generation and performance would be almost as that with C. If it is not possible, you might find that the same code in Rust performs worse than in C.
282+
Like functions, inlining them isn't always guaranteed. If it's possible to inline the intrinsic, code generation and performance would be almost as that with C. If it's not possible, you might find that the same code in Rust performs worse than in C.
283283

284284
Because of this, you have to look carefully at the disassembly generated from your SIMD Rust code. So, how can you fix this behavior and get the expected generated code?
285285

@@ -337,5 +337,5 @@ Now look at the changed disassembly output as follows:
337337
66bc: d65f03c0 ret
338338
```
339339

340-
This disassembly output is now as you would expect it to be as well as being better performant. You will see that the compiler automatically unrolled the loop twice because it was able to figure out that the number of iterations was small. Increasing the iterations will probably disable aggressive unrolling but it will at least inline the intrinsics properly.
340+
This disassembly output is now as you would expect it to be as well as being better performant. You'll see that the compiler automatically unrolled the loop twice because it was able to figure out that the number of iterations was small. Increasing the iterations will probably disable aggressive unrolling but it will at least inline the intrinsics properly.
341341

0 commit comments

Comments
 (0)