Merge pull request #3154 from madeline-underwood/simdloop2

pareenaverma · web-flow · commit 8c295ac02ee8 · 2026-04-13T11:53:12.000-04:00
Refine documentation for SIMD Loops: update titles, improve clarity, …
diff --git a/content/learning-paths/cross-platform/simd-loops/1-about.md b/content/learning-paths/cross-platform/simd-loops/1-about.md
@@ -6,11 +6,11 @@ weight: 2
 layout: learningpathall
 ---
 
-## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
+## Introduction to SIMD on Arm
 
-Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with Neon, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
+Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with Neon, a familiar fixed-width vector extension. As Arm architectures evolve, the SIMD capabilities available to you also expand.
 
-This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike Neon, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
+This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. These are two powerful, scalable vector extensions designed for modern workloads. Unlike Neon, these architecture extensions aren't just wider; they're fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
 
 ## What is the SIMD Loops project?
 
@@ -30,6 +30,6 @@ The project includes:
 - A simple command-line runner to execute any loop interactively
 - Optional standalone binaries for bare-metal and simulator use
 
-You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.
+You don't need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect—this is the core learning loop.
 
 
diff --git a/content/learning-paths/cross-platform/simd-loops/2-using.md b/content/learning-paths/cross-platform/simd-loops/2-using.md
@@ -27,7 +27,7 @@ Expected output on Linux:
 aarch64
 ```
 
-Expected output on macOS:
+On macOS, the expected output is:
 
 ```output
 arm64
@@ -86,45 +86,45 @@ Each loop is implemented in several SIMD extension variants. Conditional compila
 
 The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.
 
-When SIMD ACLE is supported (SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
+When SIMD ACLE is supported (SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE support isn't available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
 
 The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.
 
-At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.
+At compile time, you can select which loop optimization to compile, whether it's based on SME or SVE intrinsics, or one of the available inline assembly variants.
+
+To compile the project, run make in the project directory:
 
 ```console
 make
 ```
 
-With no target specified, the list of targets is printed:
+With no target specified, the output shows the list of available targets:
 
 ```output
 all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
 ```
 
-Build all loops for all targets:
+To build all loops for all targets, run:
 
 ```console
 make all
 ```
 
-Build all loops for a single target, such as Neon:
+To build all loops for a single target, such as Neon, run:
 
 ```console
 make neon
 ```
 
 As a result of the build, two types of binaries are generated.
 
-The first is a single executable named `simd_loops`, which includes all loop implementations.
-
-Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the Neon target:
+To select a specific loop, pass parameters to the program. For example, to run loop 1 for 5 iterations using the Neon target:
 
 ```console
 build/neon/bin/simd_loops -k 1 -n 5
 ```
 
-Example output:
+The expected output is:
 
 ```output
 Loop 001 - FP32 inner product
@@ -140,6 +140,8 @@ To run loop 1 as a standalone binary:
 build/neon/standalone/bin/loop_001.elf
 ```
 
+The expected output is
+
 Example output:
 
 ```output
diff --git a/content/learning-paths/cross-platform/simd-loops/3-example.md b/content/learning-paths/cross-platform/simd-loops/3-example.md
@@ -229,8 +229,6 @@ For instruction semantics and SME/SME2 optimization guidance, see the [SME Progr
 
 Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features:
 
-- **Neon**: the Neon version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
-
-- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
-
-- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.
+- **Neon**: the Neon version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation
+- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores
+- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers
diff --git a/content/learning-paths/cross-platform/simd-loops/4-conclusion.md b/content/learning-paths/cross-platform/simd-loops/4-conclusion.md
@@ -1,16 +1,16 @@
 ---
-title: How to learn with SIMD Loops
+title: Learning with SIMD Loops
 weight: 5
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Bridging the gap between specs and real code
+## Bridging the gap between specifications and real code
 
 SIMD Loops is a practical way to learn the intricacies of SVE and SME across modern Arm architectures. By providing small, runnable loop kernels with reference code and optimized variants, it closes the gap between architectural specifications and real applications.
 
-Whether you are moving from Neon or starting directly with SVE and SME, the project offers:
+Whether you're moving from Neon or starting directly with SVE and SME, the project offers:
 - A broad catalog of kernels that highlight specific features (predication, VLA programming, gather/scatter, streaming mode, ZA tiles)
 - Clear, readable implementations in C, ACLE intrinsics, and selected inline assembly
 - Flexible build targets and a simple runner to execute and validate loops
diff --git a/content/learning-paths/cross-platform/simd-loops/_index.md b/content/learning-paths/cross-platform/simd-loops/_index.md
@@ -1,20 +1,21 @@
 ---
-title: "Code kata: perfect your SVE and SME skills with SIMD Loops"
+title: Learn SVE and SME programming with SIMD Loops
+
+description: Learn how to write high-performance SIMD code using the SIMD Loops project, with hands-on examples demonstrating SVE, SVE2, and SME2 features on Arm processors.
 
 minutes_to_complete: 30
 
 who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2, and SME2 to improve software performance on Arm processors.
 
 learning_objectives:
      - Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME)
-     - Describe what SIMD Loops contains and how kernels are organized across scalar, Neon, SVE,SVE2, and SME2 variants
+     - Describe what SIMD Loops contains and how kernels are organized across scalar, Neon, SVE, SVE2, and SME2 variants
      - Build and run a selected kernel with the provided runner and validate correctness against the C reference
      - Choose the appropriate build target to compare Neon, SVE/SVE2, and SME2 implementations
 
-
 prerequisites:
-    - An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers. 
-    - Some familiarity with SIMD programming and Neon intrinsics.
+    - An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers
+    - Some familiarity with SIMD programming and Neon intrinsics
     - Recent toolchains that support SVE/SME (GCC 13+ or Clang 16+ recommended)
 
 author:
@@ -48,6 +49,14 @@ further_reading:
         title: SVE Programming Examples
         link: https://developer.arm.com/documentation/dai0548/latest
         type: documentation
+    - resource:
+        title: SIMD Loops Repository
+        link: https://gitlab.arm.com/architecture/simd-loops
+        type: documentation
+    - resource:
+        title: Scalable Vector Extensions Resources
+        link: https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions
+        type: documentation
     - resource:
         title: Port Code to Arm Scalable Vector Extension (SVE)
         link: /learning-paths/servers-and-cloud-computing/sve