You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-loops/1-about.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,11 +6,11 @@ weight: 2
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
9
+
## Introduction to SIMD on Arm
10
10
11
-
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with Neon, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
11
+
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with Neon, a familiar fixed-width vector extension. As Arm architectures evolve, the SIMD capabilities available to you also expand.
12
12
13
-
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike Neon, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
13
+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. These are two powerful, scalable vector extensions designed for modern workloads. Unlike Neon, these architecture extensions aren't just wider; they're fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
14
14
15
15
## What is the SIMD Loops project?
16
16
@@ -30,6 +30,6 @@ The project includes:
30
30
- A simple command-line runner to execute any loop interactively
31
31
- Optional standalone binaries for bare-metal and simulator use
32
32
33
-
You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.
33
+
You don't need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect—this is the core learning loop.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-loops/2-using.md
+12-10Lines changed: 12 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ Expected output on Linux:
27
27
aarch64
28
28
```
29
29
30
-
Expected output on macOS:
30
+
On macOS, the expected output is:
31
31
32
32
```output
33
33
arm64
@@ -86,45 +86,45 @@ Each loop is implemented in several SIMD extension variants. Conditional compila
86
86
87
87
The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.
88
88
89
-
When SIMD ACLE is supported (SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
89
+
When SIMD ACLE is supported (SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE support isn't available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
90
90
91
91
The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.
92
92
93
-
At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.
93
+
At compile time, you can select which loop optimization to compile, whether it's based on SME or SVE intrinsics, or one of the available inline assembly variants.
94
+
95
+
To compile the project, run make in the project directory:
94
96
95
97
```console
96
98
make
97
99
```
98
100
99
-
With no target specified, the list of targets is printed:
101
+
With no target specified, the output shows the list of available targets:
100
102
101
103
```output
102
104
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
103
105
```
104
106
105
-
Build all loops for all targets:
107
+
To build all loops for all targets, run:
106
108
107
109
```console
108
110
make all
109
111
```
110
112
111
-
Build all loops for a single target, such as Neon:
113
+
To build all loops for a single target, such as Neon, run:
112
114
113
115
```console
114
116
make neon
115
117
```
116
118
117
119
As a result of the build, two types of binaries are generated.
118
120
119
-
The first is a single executable named `simd_loops`, which includes all loop implementations.
120
-
121
-
Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the Neon target:
121
+
To select a specific loop, pass parameters to the program. For example, to run loop 1 for 5 iterations using the Neon target:
122
122
123
123
```console
124
124
build/neon/bin/simd_loops -k 1 -n 5
125
125
```
126
126
127
-
Example output:
127
+
The expected output is:
128
128
129
129
```output
130
130
Loop 001 - FP32 inner product
@@ -140,6 +140,8 @@ To run loop 1 as a standalone binary:
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-loops/4-conclusion.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,16 @@
1
1
---
2
-
title: How to learn with SIMD Loops
2
+
title: Learning with SIMD Loops
3
3
weight: 5
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Bridging the gap between specs and real code
9
+
## Bridging the gap between specifications and real code
10
10
11
11
SIMD Loops is a practical way to learn the intricacies of SVE and SME across modern Arm architectures. By providing small, runnable loop kernels with reference code and optimized variants, it closes the gap between architectural specifications and real applications.
12
12
13
-
Whether you are moving from Neon or starting directly with SVE and SME, the project offers:
13
+
Whether you're moving from Neon or starting directly with SVE and SME, the project offers:
14
14
- A broad catalog of kernels that highlight specific features (predication, VLA programming, gather/scatter, streaming mode, ZA tiles)
15
15
- Clear, readable implementations in C, ACLE intrinsics, and selected inline assembly
16
16
- Flexible build targets and a simple runner to execute and validate loops
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-loops/_index.md
+14-5Lines changed: 14 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,21 @@
1
1
---
2
-
title: "Code kata: perfect your SVE and SME skills with SIMD Loops"
2
+
title: Learn SVE and SME programming with SIMD Loops
3
+
4
+
description: Learn how to write high-performance SIMD code using the SIMD Loops project, with hands-on examples demonstrating SVE, SVE2, and SME2 features on Arm processors.
3
5
4
6
minutes_to_complete: 30
5
7
6
8
who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2, and SME2 to improve software performance on Arm processors.
7
9
8
10
learning_objectives:
9
11
- Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME)
10
-
- Describe what SIMD Loops contains and how kernels are organized across scalar, Neon, SVE,SVE2, and SME2 variants
12
+
- Describe what SIMD Loops contains and how kernels are organized across scalar, Neon, SVE,SVE2, and SME2 variants
11
13
- Build and run a selected kernel with the provided runner and validate correctness against the C reference
12
14
- Choose the appropriate build target to compare Neon, SVE/SVE2, and SME2 implementations
13
15
14
-
15
16
prerequisites:
16
-
- An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers.
17
-
- Some familiarity with SIMD programming and Neon intrinsics.
17
+
- An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers
18
+
- Some familiarity with SIMD programming and Neon intrinsics
18
19
- Recent toolchains that support SVE/SME (GCC 13+ or Clang 16+ recommended)
0 commit comments