Skip to content

Commit 8c295ac

Browse files
authored
Merge pull request #3154 from madeline-underwood/simdloop2
Refine documentation for SIMD Loops: update titles, improve clarity, …
2 parents a5002b4 + ea03086 commit 8c295ac

5 files changed

Lines changed: 36 additions & 27 deletions

File tree

content/learning-paths/cross-platform/simd-loops/1-about.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ weight: 2
66
layout: learningpathall
77
---
88

9-
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
9+
## Introduction to SIMD on Arm
1010

11-
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with Neon, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
11+
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with Neon, a familiar fixed-width vector extension. As Arm architectures evolve, the SIMD capabilities available to you also expand.
1212

13-
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike Neon, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
13+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. These are two powerful, scalable vector extensions designed for modern workloads. Unlike Neon, these architecture extensions aren't just wider; they're fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
1414

1515
## What is the SIMD Loops project?
1616

@@ -30,6 +30,6 @@ The project includes:
3030
- A simple command-line runner to execute any loop interactively
3131
- Optional standalone binaries for bare-metal and simulator use
3232

33-
You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.
33+
You don't need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effectthis is the core learning loop.
3434

3535

content/learning-paths/cross-platform/simd-loops/2-using.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Expected output on Linux:
2727
aarch64
2828
```
2929

30-
Expected output on macOS:
30+
On macOS, the expected output is:
3131

3232
```output
3333
arm64
@@ -86,45 +86,45 @@ Each loop is implemented in several SIMD extension variants. Conditional compila
8686

8787
The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.
8888

89-
When SIMD ACLE is supported (SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
89+
When SIMD ACLE is supported (SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE support isn't available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
9090

9191
The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.
9292

93-
At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.
93+
At compile time, you can select which loop optimization to compile, whether it's based on SME or SVE intrinsics, or one of the available inline assembly variants.
94+
95+
To compile the project, run make in the project directory:
9496

9597
```console
9698
make
9799
```
98100

99-
With no target specified, the list of targets is printed:
101+
With no target specified, the output shows the list of available targets:
100102

101103
```output
102104
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
103105
```
104106

105-
Build all loops for all targets:
107+
To build all loops for all targets, run:
106108

107109
```console
108110
make all
109111
```
110112

111-
Build all loops for a single target, such as Neon:
113+
To build all loops for a single target, such as Neon, run:
112114

113115
```console
114116
make neon
115117
```
116118

117119
As a result of the build, two types of binaries are generated.
118120

119-
The first is a single executable named `simd_loops`, which includes all loop implementations.
120-
121-
Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the Neon target:
121+
To select a specific loop, pass parameters to the program. For example, to run loop 1 for 5 iterations using the Neon target:
122122

123123
```console
124124
build/neon/bin/simd_loops -k 1 -n 5
125125
```
126126

127-
Example output:
127+
The expected output is:
128128

129129
```output
130130
Loop 001 - FP32 inner product
@@ -140,6 +140,8 @@ To run loop 1 as a standalone binary:
140140
build/neon/standalone/bin/loop_001.elf
141141
```
142142

143+
The expected output is
144+
143145
Example output:
144146

145147
```output

content/learning-paths/cross-platform/simd-loops/3-example.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -229,8 +229,6 @@ For instruction semantics and SME/SME2 optimization guidance, see the [SME Progr
229229
230230
Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features:
231231
232-
- **Neon**: the Neon version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
233-
234-
- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
235-
236-
- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.
232+
- **Neon**: the Neon version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation
233+
- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores
234+
- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers

content/learning-paths/cross-platform/simd-loops/4-conclusion.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
---
2-
title: How to learn with SIMD Loops
2+
title: Learning with SIMD Loops
33
weight: 5
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Bridging the gap between specs and real code
9+
## Bridging the gap between specifications and real code
1010

1111
SIMD Loops is a practical way to learn the intricacies of SVE and SME across modern Arm architectures. By providing small, runnable loop kernels with reference code and optimized variants, it closes the gap between architectural specifications and real applications.
1212

13-
Whether you are moving from Neon or starting directly with SVE and SME, the project offers:
13+
Whether you're moving from Neon or starting directly with SVE and SME, the project offers:
1414
- A broad catalog of kernels that highlight specific features (predication, VLA programming, gather/scatter, streaming mode, ZA tiles)
1515
- Clear, readable implementations in C, ACLE intrinsics, and selected inline assembly
1616
- Flexible build targets and a simple runner to execute and validate loops

content/learning-paths/cross-platform/simd-loops/_index.md

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,21 @@
11
---
2-
title: "Code kata: perfect your SVE and SME skills with SIMD Loops"
2+
title: Learn SVE and SME programming with SIMD Loops
3+
4+
description: Learn how to write high-performance SIMD code using the SIMD Loops project, with hands-on examples demonstrating SVE, SVE2, and SME2 features on Arm processors.
35

46
minutes_to_complete: 30
57

68
who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2, and SME2 to improve software performance on Arm processors.
79

810
learning_objectives:
911
- Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME)
10-
- Describe what SIMD Loops contains and how kernels are organized across scalar, Neon, SVE,SVE2, and SME2 variants
12+
- Describe what SIMD Loops contains and how kernels are organized across scalar, Neon, SVE, SVE2, and SME2 variants
1113
- Build and run a selected kernel with the provided runner and validate correctness against the C reference
1214
- Choose the appropriate build target to compare Neon, SVE/SVE2, and SME2 implementations
1315

14-
1516
prerequisites:
16-
- An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers.
17-
- Some familiarity with SIMD programming and Neon intrinsics.
17+
- An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers
18+
- Some familiarity with SIMD programming and Neon intrinsics
1819
- Recent toolchains that support SVE/SME (GCC 13+ or Clang 16+ recommended)
1920

2021
author:
@@ -48,6 +49,14 @@ further_reading:
4849
title: SVE Programming Examples
4950
link: https://developer.arm.com/documentation/dai0548/latest
5051
type: documentation
52+
- resource:
53+
title: SIMD Loops Repository
54+
link: https://gitlab.arm.com/architecture/simd-loops
55+
type: documentation
56+
- resource:
57+
title: Scalable Vector Extensions Resources
58+
link: https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions
59+
type: documentation
5160
- resource:
5261
title: Port Code to Arm Scalable Vector Extension (SVE)
5362
link: /learning-paths/servers-and-cloud-computing/sve

0 commit comments

Comments
 (0)