Skip to content

Commit 6469e45

Browse files
Merge pull request #3141 from anupras-mohapatra-arm/servers-and-cloud-computing
Documentation updates/refresh for zlib-ng Learning Path
2 parents 6e14150 + 2cc93eb commit 6469e45

4 files changed

Lines changed: 72 additions & 50 deletions

File tree

content/learning-paths/servers-and-cloud-computing/zlib/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Learn how to build and use zlib-ng on Arm servers
2+
title: Improve data compression performance on Arm servers with zlib-ng
33

44
description: Learn how to build and use zlib-ng on Arm servers, using its Neon SIMD and ARMv8 CRC32 optimizations to improve compression performance compared to the system default zlib.
55

content/learning-paths/servers-and-cloud-computing/zlib/perf.md

Lines changed: 27 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ title: Use perf to analyze zlib-ng performance
44
weight: 4
55
---
66

7+
## Analyze performance improvements with perf
8+
9+
In the previous section, you learned how to use `zlib-ng` to improve the performance of a Python application for compressing large files.
10+
11+
In this section, you will use `perf` to analyze the performance of the application.
12+
713
## Install necessary software packages
814

915
Install Linux `perf`:
@@ -12,9 +18,9 @@ Install Linux `perf`:
1218
sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r` -y
1319
```
1420

15-
For more information about installing `perf` review [Perf for Linux on Arm](/install-guides/perf/).
21+
For more information about installing `perf`, review [Perf for Linux on Arm](/install-guides/perf/).
1622

17-
Allow user access to PMU (Performance Monitoring Unit) registers and kernel symbol addresses:
23+
Allow user access to Performance Monitoring Unit (PMU) registers and kernel symbol addresses:
1824

1925
```console
2026
sudo sh -c "echo '1' > /proc/sys/kernel/perf_event_paranoid"
@@ -23,19 +29,17 @@ sudo sh -c "echo '0' > /proc/sys/kernel/kptr_restrict"
2329

2430
The first setting allows unprivileged access to PMU counters. The second allows `perf` to read kernel symbol addresses from `/proc/kallsyms`, which is needed for complete call graph resolution in flame graphs.
2531

26-
For more information refer to the [Linux kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#perf-event-paranoid).
32+
For more information, refer to the [Linux kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#perf-event-paranoid).
2733

2834
## Profile the default zlib with perf
2935

30-
The previous section explained how to run a Python program to compress large files and measure performance with `zlib-ng`. Now use `perf` to look at the performance.
31-
32-
Continue with the same `zip.py` program as the previous section. Make sure to start with `zip.py` and `largefile` available. Confirm the application is working and `largefile.gz` is created when it is run.
36+
Make sure that the `zip.py` program and `largefile` from the [previous section](/learning-paths/servers-and-cloud-computing/zlib/py-zlib/) are available. Confirm the application is working and `largefile.gz` is created when it is run.
3337

3438
```console
3539
python zip.py
3640
```
3741

38-
## Run the example with perf using the default zlib
42+
## Run the example with perf stat using the default zlib
3943

4044
Run `perf stat` with a set of basic hardware events that are reliably available on virtualized Arm instances:
4145

@@ -65,7 +69,7 @@ The output is similar to:
6569
0.126034000 seconds sys
6670
```
6771

68-
## Use perf record and generate the flame graph
72+
## Use perf record and generate a flame graph for zlib
6973

7074
You can also record the application activity with `perf record`. `-F` specifies the sampling frequency and `--call-graph dwarf` collects call graphs using DWARF debug info, which works more reliably than frame pointers for Python on Arm cloud VMs:
7175

@@ -74,7 +78,7 @@ perf record -F 99 --call-graph dwarf python ./zip.py
7478
```
7579

7680
{{% notice Note %}}
77-
On cloud VMs you may see warnings about kernel address maps and `kptr_restrict` even after setting it to 0. These are non-fatal — `perf record` still captures userspace samples successfully. Kernel frames in the flame graph may be unresolved, but the `zlib` and Python frames that matter for this analysis will be present.
81+
On cloud VMs, you may see warnings about kernel address maps and `kptr_restrict` even after setting it to 0. These are non-fatal — `perf record` still captures userspace samples successfully. Kernel frames in the flame graph may be unresolved, but the `zlib` and Python frames that matter for this analysis will be present.
7882
{{% /notice %}}
7983

8084
To visualize the results, install `FlameGraph`:
@@ -99,7 +103,7 @@ The line count should be greater than 0. Then generate the flame graph:
99103

100104
Copy the file `flamegraph1.svg` to your computer and open it in a browser or another image application.
101105

102-
## Look at the report
106+
## Inspect the perf report
103107

104108
As an alternative, use `perf report` to inspect the profiling data:
105109

@@ -117,11 +121,11 @@ The output is similar to:
117121

118122
The key observation is that `libz.so.1.3` accounts for a large share of the self time — roughly 43% in the primary deflate path and another 8% in `crc32_z`. This confirms that `zlib` compression is the dominant hotspot in the workload, and that the CRC32 calculation alone is a measurable fraction of total runtime. The many `[unknown]` frames are Python interpreter internals that `perf` cannot resolve without debug symbols, but they do not affect the conclusion.
119123

120-
In the next section you will run the same workload with `zlib-ng` and compare the hotspot profile.
124+
In the next step, you will run the same workload with `zlib-ng` and compare the hotspot profile.
121125

122126
## Run the example again with perf stat and zlib-ng
123127

124-
This time use `LD_PRELOAD` to switch to `zlib-ng` and check the performance difference.
128+
This time, use `LD_PRELOAD` to switch to `zlib-ng` and check the performance difference.
125129

126130
```console
127131
LD_PRELOAD=/usr/local/lib/libz.so.1 perf stat -e cycles,instructions,cache-references,cache-misses python ./zip.py
@@ -149,7 +153,7 @@ Comparing against the default `zlib` run:
149153
|---|---|---|---|
150154
| Cycles | 12,984,652,076 | 4,897,544,240 | -62% |
151155
| Instructions | 45,063,797,078 | 20,992,552,372 | -53% |
152-
| IPC | 3.47 | 4.29 | +24% |
156+
| Instructions Per Cycle (IPC) | 3.47 | 4.29 | +24% |
153157
| Cache references | 20,161,747,365 | 7,522,040,735 | -63% |
154158
| Elapsed time | 4.85s | 1.83s | -62% |
155159

@@ -176,10 +180,11 @@ The output is similar to:
176180
There are two significant changes compared to the default `zlib` report:
177181

178182
- **`libz.so.1.3.1.zlib-ng` is now named in the report**, confirming that `LD_PRELOAD` loaded `zlib-ng` correctly. The default `zlib` report showed an unresolved address at 43% self time; here the same hotspot is identified as `insert_string_roll``zlib-ng`'s Neon-accelerated hash chain insertion function.
179-
180183
- **`crc32_z` has disappeared from the top entries entirely.** In the default `zlib` run, `crc32_z` accounted for 8.37% of samples. With `zlib-ng`, ARMv8 hardware CRC32 instructions execute fast enough that CRC32 no longer appears as a measurable hotspot.
181184

182-
The 64.40% figure for `insert_string_roll` looks higher than the 43% from the default `zlib` run, but `perf report` percentages are relative to samples collected *within that run*, not across runs. The zlib-ng run completed in 1.83 seconds versus 4.85 seconds for the default `zlib`. The `-F 99` flag used in the `perf record` command sets a sampling frequency of 99 Hz, so the number of samples collected is roughly proportional to the run duration — approximately 181 samples for zlib-ng versus 480 for the default `zlib`. The absolute sample counts tell a different story:
185+
The 64.40% figure for `insert_string_roll` looks higher than the 43% from the default `zlib` run, but `perf report` percentages are relative to samples collected *within that run*, not across runs. The `zlib-ng` run completed in 1.83 seconds versus 4.85 seconds for the default `zlib`.
186+
187+
The `-F 99` flag used in the `perf record` command sets a sampling frequency of 99 Hz, so the number of samples collected is roughly proportional to the run duration — approximately 181 samples for zlib-ng versus 480 for the default `zlib`. The absolute sample counts tell a different story:
183188

184189
| Function | Default zlib | zlib-ng |
185190
|---|---|---|
@@ -188,11 +193,14 @@ The 64.40% figure for `insert_string_roll` looks higher than the 43% from the de
188193

189194
The function received *fewer* absolute samples with `zlib-ng`, meaning less real time was spent there. It appears as a higher percentage only because the total run time — and therefore total samples — shrank so much.
190195

191-
## Generate the new flame graph
196+
## Generate a new flame graph for zlib-ng
197+
198+
Run `perf record` again with `zlib-ng`:
192199

193200
```console
194201
LD_PRELOAD=/usr/local/lib/libz.so.1 perf record -F 99 --call-graph dwarf python ./zip.py
195202
```
203+
Convert the recorded data to folded stacks and verify the output is non-empty before generating the SVG:
196204

197205
```console
198206
perf script > out.perf-script
@@ -205,10 +213,10 @@ Copy the file `flamegraph2.svg` to your computer. Open it in a browser or other
205213
Flame graphs have no time axis — frame width represents the proportion of total samples, and each SVG scales to fill its full width regardless of how long the run took. This means you cannot compare absolute widths across the two graphs. What you can compare is the *relative proportion* that `zlib` occupies within each graph:
206214

207215
- In `flamegraph1.svg`, look at what fraction of the total width is occupied by `libz` frames. Then check the same in `flamegraph2.svg`. The `zlib-ng` run should show a similar or slightly larger fraction — because the run is 2.6x shorter but the library is still doing the same work. The meaningful comparison is the `perf stat` cycle and time data from the previous section, not the flame graph widths.
208-
- **What the flame graph is useful for here** is identifying which functions dominate within the `zlib` stack. In `flamegraph1.svg` you should see `crc32_z` as a visible frame. In `flamegraph2.svg` it should be absent or too narrow to label, replaced by `insert_string_roll` as the top frame — confirming the hotspot has shifted from CRC32 to hash insertion after `zlib-ng`'s ARMv8 CRC32 acceleration removes the previous bottleneck.
216+
- What the flame graph is useful for is identifying which functions dominate within the `zlib` stack. In `flamegraph1.svg` you should see `crc32_z` as a visible frame. In `flamegraph2.svg` it should be absent or too narrow to label, replaced by `insert_string_roll` as the top frame — confirming the hotspot has shifted from CRC32 to hash insertion after `zlib-ng`'s ARMv8 CRC32 acceleration removes the previous bottleneck.
209217

210-
## Summary
218+
## What you've learned
211219

212-
In this Learning Path you replaced the system `zlib` with `zlib-ng` on an Arm server and measured the performance improvement.
220+
In this Learning Path, you replaced the system `zlib` with `zlib-ng` on an Arm server and measured the performance improvement.
213221

214222
`zlib-ng` is built for modern Arm platforms. Its Neon SIMD and ARMv8 CRC32 acceleration deliver significantly faster compression than the default system library, without requiring any changes to your application code. Using `LD_PRELOAD` makes it straightforward to test and adopt `zlib-ng` for any dynamically linked application. Python, nginx, PostgreSQL, and many others all benefit from the same approach.

content/learning-paths/servers-and-cloud-computing/zlib/py-zlib.md

Lines changed: 25 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,36 @@ title: Improve Python application performance using zlib-ng
44
weight: 3
55
---
66

7+
## Reduce time taken by a Python application to compress data
8+
9+
In the previous section, you learned how to build `zlib-ng` with Neon SIMD and ARMv8 CRC32 optimizations enabled.
10+
11+
In this section, you will accelerate the performance of an example Python application that compresses a large file and measure the performance difference when using `zlib-ng`.
12+
713
## Install necessary software packages
814

9-
Make sure `python3` is available when `python` is run:
15+
Ensure that `python3` is available when you run `python`:
1016

1117
```bash
1218
sudo apt install python-is-python3 -y
1319
```
1420

15-
## Compress files with Python and zlib-ng
16-
17-
The previous section explained how to build `zlib-ng` with Neon SIMD and ARMv8 CRC32 optimizations enabled.
18-
19-
Use a Python example and measure the performance difference with `zlib-ng`.
21+
## Create a large file to compress
2022

2123
Navigate to your home directory before creating the example files:
2224

2325
```bash
2426
cd $HOME
2527
```
28+
To create an input file called `largefile`, use the `dd` command.
29+
30+
```bash
31+
dd if=/dev/zero of=largefile count=1M bs=1024
32+
```
2633

27-
Use a text editor to copy and save the code below in a file named `zip.py`.
34+
## Create an example Python file compression application
35+
36+
To create the Python application for compressing `largefile`, use a text editor to copy and save the following code in a file named `zip.py`.
2837

2938
```python { file_name="zip.py" }
3039
import gzip
@@ -38,20 +47,11 @@ with open('largefile', 'rb') as f_in:
3847

3948
f_out.close()
4049
```
50+
The Python file compression application will read `largefile` as input and write a compressed version as `largefile.gz`.
4151

42-
## Create a large file to compress
43-
44-
The above Python code will read a file named `largefile` and write a compressed version as `largefile.gz`.
45-
46-
To create the input file, use the `dd` command.
47-
48-
```bash
49-
dd if=/dev/zero of=largefile count=1M bs=1024
50-
```
51-
52-
## Run the example using the default zlib
52+
## Compress the file using the Python application and default zlib
5353

54-
Run with the default `zlib` and time the execution.
54+
Run `zip.py` with the default `zlib` and time the execution.
5555

5656
```bash
5757
time python zip.py
@@ -67,7 +67,7 @@ sys 0m0.117s
6767

6868
Make a note of the `real` time.
6969

70-
## Run the example again with zlib-ng
70+
## Compress the file again using the Python application and zlib-ng
7171

7272
This time, use `LD_PRELOAD` to switch to `zlib-ng` and measure the performance difference.
7373

@@ -84,7 +84,10 @@ real 0m1.759s
8484
user 0m1.654s
8585
sys 0m0.105s
8686
```
87+
In this example, `zlib-ng` reduces compression time from 4.6 seconds to 1.8 seconds, roughly a 2.6x improvement — driven by the Neon-accelerated adler32 and inflate chunk copy routines.
88+
89+
## What you've learned and what's next
8790

88-
Compare the `real` time against the default `zlib` run. In this example, `zlib-ng` reduces compression time from 4.6 seconds to 1.8 seconds, roughly a 2.6x improvement — driven by the Neon-accelerated adler32 and inflate chunk copy routines.
91+
In this section, you used `zlib-ng` to accelerate the performance of an example Python file compression application. You compared the difference in performance between `zlib` and `zlib-ng`.
8992

90-
The next section introduces how to use Linux `perf` to profile applications and look for `zlib` activity.
93+
In the next section, you will learn how to use Linux `perf` to profile applications and look for `zlib` activity.

content/learning-paths/servers-and-cloud-computing/zlib/setup.md

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,16 @@ layout: "learningpathall"
77

88
## Overview
99

10-
Most Linux distributions ship `zlib` without Arm-specific optimizations. This means instruction extensions such as Neon SIMD and ARMv8 CRC32 are not used, leaving significant performance on the table for compression-heavy workloads.
10+
Most Linux distributions ship `zlib` without Arm-specific optimizations. This means instruction extensions such as Neon Single Instruction Multiple Data (SIMD) and ARMv8 CRC32 are not used, leaving significant performance on the table for compression-heavy workloads.
1111

12-
`zlib-ng` is an actively maintained fork of zlib designed for modern systems. It includes Neon SIMD acceleration for adler32, inflate chunk copying, and hash operations, plus ARMv8 CRC32 and PMULL acceleration. It also supports a zlib-compatible API mode, allowing you to use it as a drop-in replacement without recompiling your applications.
12+
Designed for modern systems, `zlib-ng` is an actively maintained fork of `zlib` that includes the following enhancements:
13+
14+
- Neon SIMD acceleration for adler32
15+
- Inflate chunk copying
16+
- Hash operations
17+
- ARMv8 CRC32 and PMULL acceleration
18+
19+
`zlib-ng` also supports a zlib-compatible API mode, allowing you to use it as a drop-in replacement without recompiling your applications.
1320

1421
## Confirm Arm SIMD capabilities in the processor
1522

@@ -23,7 +30,7 @@ lscpu | grep -E "asimd|crc32"
2330

2431
The `asimd` flag indicates Neon (Advanced SIMD) support. The `crc32` flag confirms hardware-accelerated CRC32. Both are present on all modern Arm server platforms, including AWS Graviton, Azure Cobalt 100, and Google Axion.
2532

26-
## Check what the default zlib contains
33+
## Install prerequisite packages and inspect default zlib library
2734

2835
Install the packages you need for this Learning Path:
2936

@@ -40,7 +47,7 @@ You can inspect the default library with `objdump` to see whether it contains an
4047
objdump -d /usr/lib/aarch64-linux-gnu/libz.so.1 | grep -cE "crc32|pmull|v[0-9]+\.(16b|8b|8h|4h|4s|2s|2d|1d)"
4148
```
4249

43-
Recent Ubuntu releases on aarch64 typically return a non-zero value here — the system `zlib` has *some* Neon code, but the coverage is partial. `zlib-ng` provides dedicated Neon paths for adler32, inflate chunk copying, slide hash, and compare256, with significantly more instruction-level parallelism than the default library. Installing `zlib-ng` is worthwhile on any Arm server regardless of what the system `zlib` already contains.
50+
Recent Ubuntu releases on aarch64 typically return a non-zero value here — the system `zlib` has *some* Neon code, but the coverage is partial. `zlib-ng` provides dedicated Neon paths for adler32, inflate chunk copying, slide hash, and compare256, with significantly more instruction-level parallelism than the default library. This makes installing `zlib-ng` worthwhile on any Arm server regardless of what the system `zlib` already contains.
4451

4552
## Build and install zlib-ng
4653

@@ -93,7 +100,7 @@ The output will show both the system `zlib` and the newly installed `zlib-ng`:
93100

94101
## Configure and test zlib-ng
95102

96-
Since `zlib-ng` is a shared library, you can configure which version an application uses without relinking it.
103+
Because `zlib-ng` is a shared library, you can configure which version an application uses without relinking it.
97104

98105
Navigate back to your home directory before creating the test files:
99106

@@ -137,7 +144,7 @@ Run the program to confirm the system `zlib` version:
137144
./test
138145
```
139146

140-
The output will be the version of the system library, for example:
147+
The output will be the version of the system library. For example:
141148

142149
```output
143150
1.3
@@ -168,10 +175,14 @@ LD_PRELOAD=/usr/local/lib/libz.so.1 ./test
168175

169176
The `LD_PRELOAD` environment variable tells the dynamic linker to load this library before the system default.
170177

171-
The `zlib-ng` version identifier will print. The version string includes the `.zlib-ng` suffix to distinguish it from the upstream library:
178+
The output shows the `zlib-ng` version:
172179

173180
```output
174181
1.3.1.zlib-ng
175182
```
183+
The version identifier includes the `.zlib-ng` suffix to distinguish it from the upstream library.
176184

177-
In the next section you will use `zlib-ng` to accelerate a Python application doing data compression.
185+
## What you've learned and what's next
186+
187+
In this section, you built, installed, and tested `zlib-ng`.
188+
In the next section you will use `zlib-ng` to accelerate a Python application that does data compression.

0 commit comments

Comments
 (0)