Merge pull request #3141 from anupras-mohapatra-arm/servers-and-cloud-computing

jasonrandrews · web-flow · commit 6469e45727e0 · 2026-04-09T14:45:49.000-05:00
Documentation updates/refresh for zlib-ng Learning Path
diff --git a/content/learning-paths/servers-and-cloud-computing/zlib/_index.md b/content/learning-paths/servers-and-cloud-computing/zlib/_index.md
@@ -1,5 +1,5 @@
 ---
-title: Learn how to build and use zlib-ng on Arm servers
+title: Improve data compression performance on Arm servers with zlib-ng
 
 description: Learn how to build and use zlib-ng on Arm servers, using its Neon SIMD and ARMv8 CRC32 optimizations to improve compression performance compared to the system default zlib.
 
diff --git a/content/learning-paths/servers-and-cloud-computing/zlib/perf.md b/content/learning-paths/servers-and-cloud-computing/zlib/perf.md
@@ -4,6 +4,12 @@ title: Use perf to analyze zlib-ng performance
 weight: 4
 ---
 
+## Analyze performance improvements with perf
+
+In the previous section, you learned how to use `zlib-ng` to improve the performance of a Python application for compressing large files.
+
+In this section, you will use `perf` to analyze the performance of the application.
+
 ## Install necessary software packages
 
 Install Linux `perf`:
@@ -12,9 +18,9 @@ Install Linux `perf`:
 sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r` -y
 ```
 
-For more information about installing `perf` review [Perf for Linux on Arm](/install-guides/perf/).
+For more information about installing `perf`, review [Perf for Linux on Arm](/install-guides/perf/).
 
-Allow user access to PMU (Performance Monitoring Unit) registers and kernel symbol addresses:
+Allow user access to Performance Monitoring Unit (PMU) registers and kernel symbol addresses:
 
 ```console
 sudo sh -c "echo '1' > /proc/sys/kernel/perf_event_paranoid"
@@ -23,19 +29,17 @@ sudo sh -c "echo '0' > /proc/sys/kernel/kptr_restrict"
 
 The first setting allows unprivileged access to PMU counters. The second allows `perf` to read kernel symbol addresses from `/proc/kallsyms`, which is needed for complete call graph resolution in flame graphs.
 
-For more information refer to the [Linux kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#perf-event-paranoid).
+For more information, refer to the [Linux kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#perf-event-paranoid).
 
 ## Profile the default zlib with perf
 
-The previous section explained how to run a Python program to compress large files and measure performance with `zlib-ng`. Now use `perf` to look at the performance.
-
-Continue with the same `zip.py` program as the previous section. Make sure to start with `zip.py` and `largefile` available. Confirm the application is working and `largefile.gz` is created when it is run.
+Make sure that the `zip.py` program and `largefile` from the [previous section](/learning-paths/servers-and-cloud-computing/zlib/py-zlib/) are available. Confirm the application is working and `largefile.gz` is created when it is run.
 
 ```console
 python zip.py
 ```
 
-## Run the example with perf using the default zlib
+## Run the example with perf stat using the default zlib
 
 Run `perf stat` with a set of basic hardware events that are reliably available on virtualized Arm instances:
 
@@ -65,7 +69,7 @@ The output is similar to:
        0.126034000 seconds sys
 ```
 
-## Use perf record and generate the flame graph
+## Use perf record and generate a flame graph for zlib
 
 You can also record the application activity with `perf record`. `-F` specifies the sampling frequency and `--call-graph dwarf` collects call graphs using DWARF debug info, which works more reliably than frame pointers for Python on Arm cloud VMs:
 
@@ -74,7 +78,7 @@ perf record -F 99 --call-graph dwarf python ./zip.py
 ```
 
 {{% notice Note %}}
-On cloud VMs you may see warnings about kernel address maps and `kptr_restrict` even after setting it to 0. These are non-fatal — `perf record` still captures userspace samples successfully. Kernel frames in the flame graph may be unresolved, but the `zlib` and Python frames that matter for this analysis will be present.
+On cloud VMs, you may see warnings about kernel address maps and `kptr_restrict` even after setting it to 0. These are non-fatal — `perf record` still captures userspace samples successfully. Kernel frames in the flame graph may be unresolved, but the `zlib` and Python frames that matter for this analysis will be present.
 {{% /notice %}}
 
 To visualize the results, install `FlameGraph`:
@@ -99,7 +103,7 @@ The line count should be greater than 0. Then generate the flame graph:
 
 Copy the file `flamegraph1.svg` to your computer and open it in a browser or another image application.
 
-## Look at the report
+## Inspect the perf report
 
 As an alternative, use `perf report` to inspect the profiling data:
 
@@ -117,11 +121,11 @@ The output is similar to:
 
 The key observation is that `libz.so.1.3` accounts for a large share of the self time — roughly 43% in the primary deflate path and another 8% in `crc32_z`. This confirms that `zlib` compression is the dominant hotspot in the workload, and that the CRC32 calculation alone is a measurable fraction of total runtime. The many `[unknown]` frames are Python interpreter internals that `perf` cannot resolve without debug symbols, but they do not affect the conclusion.
 
-In the next section you will run the same workload with `zlib-ng` and compare the hotspot profile.
+In the next step, you will run the same workload with `zlib-ng` and compare the hotspot profile.
 
 ## Run the example again with perf stat and zlib-ng
 
-This time use `LD_PRELOAD` to switch to `zlib-ng` and check the performance difference.
+This time, use `LD_PRELOAD` to switch to `zlib-ng` and check the performance difference.
 
 ```console
 LD_PRELOAD=/usr/local/lib/libz.so.1 perf stat -e cycles,instructions,cache-references,cache-misses python ./zip.py
@@ -149,7 +153,7 @@ Comparing against the default `zlib` run:
 |---|---|---|---|
 | Cycles | 12,984,652,076 | 4,897,544,240 | -62% |
 | Instructions | 45,063,797,078 | 20,992,552,372 | -53% |
-| IPC | 3.47 | 4.29 | +24% |
+| Instructions Per Cycle (IPC) | 3.47 | 4.29 | +24% |
 | Cache references | 20,161,747,365 | 7,522,040,735 | -63% |
 | Elapsed time | 4.85s | 1.83s | -62% |
 
@@ -176,10 +180,11 @@ The output is similar to:
 There are two significant changes compared to the default `zlib` report:
 
 - **`libz.so.1.3.1.zlib-ng` is now named in the report**, confirming that `LD_PRELOAD` loaded `zlib-ng` correctly. The default `zlib` report showed an unresolved address at 43% self time; here the same hotspot is identified as `insert_string_roll` — `zlib-ng`'s Neon-accelerated hash chain insertion function.
-
 - **`crc32_z` has disappeared from the top entries entirely.** In the default `zlib` run, `crc32_z` accounted for 8.37% of samples. With `zlib-ng`, ARMv8 hardware CRC32 instructions execute fast enough that CRC32 no longer appears as a measurable hotspot.
 
-The 64.40% figure for `insert_string_roll` looks higher than the 43% from the default `zlib` run, but `perf report` percentages are relative to samples collected *within that run*, not across runs. The zlib-ng run completed in 1.83 seconds versus 4.85 seconds for the default `zlib`. The `-F 99` flag used in the `perf record` command sets a sampling frequency of 99 Hz, so the number of samples collected is roughly proportional to the run duration — approximately 181 samples for zlib-ng versus 480 for the default `zlib`. The absolute sample counts tell a different story:
+The 64.40% figure for `insert_string_roll` looks higher than the 43% from the default `zlib` run, but `perf report` percentages are relative to samples collected *within that run*, not across runs. The `zlib-ng` run completed in 1.83 seconds versus 4.85 seconds for the default `zlib`. 
+
+The `-F 99` flag used in the `perf record` command sets a sampling frequency of 99 Hz, so the number of samples collected is roughly proportional to the run duration — approximately 181 samples for zlib-ng versus 480 for the default `zlib`. The absolute sample counts tell a different story:
 
 | Function | Default zlib | zlib-ng |
 |---|---|---|
@@ -188,11 +193,14 @@ The 64.40% figure for `insert_string_roll` looks higher than the 43% from the de
 
 The function received *fewer* absolute samples with `zlib-ng`, meaning less real time was spent there. It appears as a higher percentage only because the total run time — and therefore total samples — shrank so much.
 
-## Generate the new flame graph
+## Generate a new flame graph for zlib-ng
+
+Run `perf record` again with `zlib-ng`:
 
 ```console
 LD_PRELOAD=/usr/local/lib/libz.so.1 perf record -F 99 --call-graph dwarf python ./zip.py
 ```
+Convert the recorded data to folded stacks and verify the output is non-empty before generating the SVG:
 
 ```console
 perf script > out.perf-script
@@ -205,10 +213,10 @@ Copy the file `flamegraph2.svg` to your computer. Open it in a browser or other
 Flame graphs have no time axis — frame width represents the proportion of total samples, and each SVG scales to fill its full width regardless of how long the run took. This means you cannot compare absolute widths across the two graphs. What you can compare is the *relative proportion* that `zlib` occupies within each graph:
 
 - In `flamegraph1.svg`, look at what fraction of the total width is occupied by `libz` frames. Then check the same in `flamegraph2.svg`. The `zlib-ng` run should show a similar or slightly larger fraction — because the run is 2.6x shorter but the library is still doing the same work. The meaningful comparison is the `perf stat` cycle and time data from the previous section, not the flame graph widths.
-- **What the flame graph is useful for here** is identifying which functions dominate within the `zlib` stack. In `flamegraph1.svg` you should see `crc32_z` as a visible frame. In `flamegraph2.svg` it should be absent or too narrow to label, replaced by `insert_string_roll` as the top frame — confirming the hotspot has shifted from CRC32 to hash insertion after `zlib-ng`'s ARMv8 CRC32 acceleration removes the previous bottleneck.
+- What the flame graph is useful for is identifying which functions dominate within the `zlib` stack. In `flamegraph1.svg` you should see `crc32_z` as a visible frame. In `flamegraph2.svg` it should be absent or too narrow to label, replaced by `insert_string_roll` as the top frame — confirming the hotspot has shifted from CRC32 to hash insertion after `zlib-ng`'s ARMv8 CRC32 acceleration removes the previous bottleneck.
 
-## Summary
+## What you've learned
 
-In this Learning Path you replaced the system `zlib` with `zlib-ng` on an Arm server and measured the performance improvement.
+In this Learning Path, you replaced the system `zlib` with `zlib-ng` on an Arm server and measured the performance improvement.
 
 `zlib-ng` is built for modern Arm platforms. Its Neon SIMD and ARMv8 CRC32 acceleration deliver significantly faster compression than the default system library, without requiring any changes to your application code. Using `LD_PRELOAD` makes it straightforward to test and adopt `zlib-ng` for any dynamically linked application. Python, nginx, PostgreSQL, and many others all benefit from the same approach.
diff --git a/content/learning-paths/servers-and-cloud-computing/zlib/py-zlib.md b/content/learning-paths/servers-and-cloud-computing/zlib/py-zlib.md
@@ -4,27 +4,36 @@ title: Improve Python application performance using zlib-ng
 weight: 3
 ---
 
+## Reduce time taken by a Python application to compress data
+
+In the previous section, you learned how to build `zlib-ng` with Neon SIMD and ARMv8 CRC32 optimizations enabled.
+
+In this section, you will accelerate the performance of an example Python application that compresses a large file and measure the performance difference when using `zlib-ng`.
+
 ## Install necessary software packages
 
-Make sure `python3` is available when `python` is run:
+Ensure that `python3` is available when you run `python`:
 
 ```bash
 sudo apt install python-is-python3 -y
 ```
 
-## Compress files with Python and zlib-ng
-
-The previous section explained how to build `zlib-ng` with Neon SIMD and ARMv8 CRC32 optimizations enabled.
-
-Use a Python example and measure the performance difference with `zlib-ng`.
+## Create a large file to compress
 
 Navigate to your home directory before creating the example files:
 
 ```bash
 cd $HOME
 ```
+To create an input file called `largefile`, use the `dd` command.
+
+```bash
+dd if=/dev/zero of=largefile count=1M bs=1024
+```
 
-Use a text editor to copy and save the code below in a file named `zip.py`.
+## Create an example Python file compression application
+
+To create the Python application for compressing `largefile`, use a text editor to copy and save the following code in a file named `zip.py`.
 
 ```python { file_name="zip.py" }
 import gzip
@@ -38,20 +47,11 @@ with open('largefile', 'rb') as f_in:
 
 f_out.close()
 ```
+The Python file compression application will read `largefile` as input and write a compressed version as `largefile.gz`.
 
-## Create a large file to compress
-
-The above Python code will read a file named `largefile` and write a compressed version as `largefile.gz`.
-
-To create the input file, use the `dd` command.
-
-```bash
-dd if=/dev/zero of=largefile count=1M bs=1024
-```
-
-## Run the example using the default zlib
+## Compress the file using the Python application and default zlib
 
-Run with the default `zlib` and time the execution.
+Run `zip.py` with the default `zlib` and time the execution.
 
 ```bash
 time python zip.py
@@ -67,7 +67,7 @@ sys     0m0.117s
 
 Make a note of the `real` time.
 
-## Run the example again with zlib-ng
+## Compress the file again using the Python application and zlib-ng
 
 This time, use `LD_PRELOAD` to switch to `zlib-ng` and measure the performance difference.
 
@@ -84,7 +84,10 @@ real    0m1.759s
 user    0m1.654s
 sys     0m0.105s
 ```
+In this example, `zlib-ng` reduces compression time from 4.6 seconds to 1.8 seconds, roughly a 2.6x improvement — driven by the Neon-accelerated adler32 and inflate chunk copy routines.
+
+## What you've learned and what's next
 
-Compare the `real` time against the default `zlib` run. In this example, `zlib-ng` reduces compression time from 4.6 seconds to 1.8 seconds, roughly a 2.6x improvement — driven by the Neon-accelerated adler32 and inflate chunk copy routines.
+In this section, you used `zlib-ng` to accelerate the performance of an example Python file compression application. You compared the difference in performance between `zlib` and `zlib-ng`.
 
-The next section introduces how to use Linux `perf` to profile applications and look for `zlib` activity.
+In the next section, you will learn how to use Linux `perf` to profile applications and look for `zlib` activity.
diff --git a/content/learning-paths/servers-and-cloud-computing/zlib/setup.md b/content/learning-paths/servers-and-cloud-computing/zlib/setup.md
@@ -7,9 +7,16 @@ layout: "learningpathall"
 
 ## Overview
 
-Most Linux distributions ship `zlib` without Arm-specific optimizations. This means instruction extensions such as Neon SIMD and ARMv8 CRC32 are not used, leaving significant performance on the table for compression-heavy workloads.
+Most Linux distributions ship `zlib` without Arm-specific optimizations. This means instruction extensions such as Neon Single Instruction Multiple Data (SIMD) and ARMv8 CRC32 are not used, leaving significant performance on the table for compression-heavy workloads.
 
-`zlib-ng` is an actively maintained fork of zlib designed for modern systems. It includes Neon SIMD acceleration for adler32, inflate chunk copying, and hash operations, plus ARMv8 CRC32 and PMULL acceleration. It also supports a zlib-compatible API mode, allowing you to use it as a drop-in replacement without recompiling your applications.
+Designed for modern systems, `zlib-ng` is an actively maintained fork of `zlib` that includes the following enhancements:
+ 
+ - Neon SIMD acceleration for adler32
+ - Inflate chunk copying
+ - Hash operations
+ - ARMv8 CRC32 and PMULL acceleration
+ 
+ `zlib-ng` also supports a zlib-compatible API mode, allowing you to use it as a drop-in replacement without recompiling your applications.
 
 ## Confirm Arm SIMD capabilities in the processor
 
@@ -23,7 +30,7 @@ lscpu | grep -E "asimd|crc32"
 
 The `asimd` flag indicates Neon (Advanced SIMD) support. The `crc32` flag confirms hardware-accelerated CRC32. Both are present on all modern Arm server platforms, including AWS Graviton, Azure Cobalt 100, and Google Axion.
 
-## Check what the default zlib contains
+## Install prerequisite packages and inspect default zlib library
 
 Install the packages you need for this Learning Path:
 
@@ -40,7 +47,7 @@ You can inspect the default library with `objdump` to see whether it contains an
 objdump -d /usr/lib/aarch64-linux-gnu/libz.so.1 | grep -cE "crc32|pmull|v[0-9]+\.(16b|8b|8h|4h|4s|2s|2d|1d)"
 ```
 
-Recent Ubuntu releases on aarch64 typically return a non-zero value here — the system `zlib` has *some* Neon code, but the coverage is partial. `zlib-ng` provides dedicated Neon paths for adler32, inflate chunk copying, slide hash, and compare256, with significantly more instruction-level parallelism than the default library. Installing `zlib-ng` is worthwhile on any Arm server regardless of what the system `zlib` already contains.
+Recent Ubuntu releases on aarch64 typically return a non-zero value here — the system `zlib` has *some* Neon code, but the coverage is partial. `zlib-ng` provides dedicated Neon paths for adler32, inflate chunk copying, slide hash, and compare256, with significantly more instruction-level parallelism than the default library. This makes installing `zlib-ng` worthwhile on any Arm server regardless of what the system `zlib` already contains.
 
 ## Build and install zlib-ng
 
@@ -93,7 +100,7 @@ The output will show both the system `zlib` and the newly installed `zlib-ng`:
 
 ## Configure and test zlib-ng
 
-Since `zlib-ng` is a shared library, you can configure which version an application uses without relinking it.
+Because `zlib-ng` is a shared library, you can configure which version an application uses without relinking it.
 
 Navigate back to your home directory before creating the test files:
 
@@ -137,7 +144,7 @@ Run the program to confirm the system `zlib` version:
 ./test
 ```
 
-The output will be the version of the system library, for example:
+The output will be the version of the system library. For example:
 
 ```output
 1.3
@@ -168,10 +175,14 @@ LD_PRELOAD=/usr/local/lib/libz.so.1 ./test
 
 The `LD_PRELOAD` environment variable tells the dynamic linker to load this library before the system default.
 
-The `zlib-ng` version identifier will print. The version string includes the `.zlib-ng` suffix to distinguish it from the upstream library:
+The output shows the `zlib-ng` version:
 
 ```output
 1.3.1.zlib-ng
 ```
+The version identifier includes the `.zlib-ng` suffix to distinguish it from the upstream library.
 
-In the next section you will use `zlib-ng` to accelerate a Python application doing data compression.
+## What you've learned and what's next
+
+In this section, you built, installed, and tested `zlib-ng`. 
+In the next section you will use `zlib-ng` to accelerate a Python application that does data compression.