You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/zlib/_index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Learn how to build and use zlib-ng on Arm servers
2
+
title: Improve data compression performance on Arm servers with zlib-ng
3
3
4
4
description: Learn how to build and use zlib-ng on Arm servers, using its Neon SIMD and ARMv8 CRC32 optimizations to improve compression performance compared to the system default zlib.
The first setting allows unprivileged access to PMU counters. The second allows `perf` to read kernel symbol addresses from `/proc/kallsyms`, which is needed for complete call graph resolution in flame graphs.
25
31
26
-
For more information refer to the [Linux kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#perf-event-paranoid).
32
+
For more information, refer to the [Linux kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#perf-event-paranoid).
27
33
28
34
## Profile the default zlib with perf
29
35
30
-
The previous section explained how to run a Python program to compress large files and measure performance with `zlib-ng`. Now use `perf` to look at the performance.
31
-
32
-
Continue with the same `zip.py` program as the previous section. Make sure to start with `zip.py` and `largefile` available. Confirm the application is working and `largefile.gz` is created when it is run.
36
+
Make sure that the `zip.py` program and `largefile` from the [previous section](/learning-paths/servers-and-cloud-computing/zlib/py-zlib/) are available. Confirm the application is working and `largefile.gz` is created when it is run.
33
37
34
38
```console
35
39
python zip.py
36
40
```
37
41
38
-
## Run the example with perf using the default zlib
42
+
## Run the example with perf stat using the default zlib
39
43
40
44
Run `perf stat` with a set of basic hardware events that are reliably available on virtualized Arm instances:
41
45
@@ -65,7 +69,7 @@ The output is similar to:
65
69
0.126034000 seconds sys
66
70
```
67
71
68
-
## Use perf record and generate the flame graph
72
+
## Use perf record and generate a flame graph for zlib
69
73
70
74
You can also record the application activity with `perf record`. `-F` specifies the sampling frequency and `--call-graph dwarf` collects call graphs using DWARF debug info, which works more reliably than frame pointers for Python on Arm cloud VMs:
On cloud VMs you may see warnings about kernel address maps and `kptr_restrict` even after setting it to 0. These are non-fatal — `perf record` still captures userspace samples successfully. Kernel frames in the flame graph may be unresolved, but the `zlib` and Python frames that matter for this analysis will be present.
81
+
On cloud VMs, you may see warnings about kernel address maps and `kptr_restrict` even after setting it to 0. These are non-fatal — `perf record` still captures userspace samples successfully. Kernel frames in the flame graph may be unresolved, but the `zlib` and Python frames that matter for this analysis will be present.
78
82
{{% /notice %}}
79
83
80
84
To visualize the results, install `FlameGraph`:
@@ -99,7 +103,7 @@ The line count should be greater than 0. Then generate the flame graph:
99
103
100
104
Copy the file `flamegraph1.svg` to your computer and open it in a browser or another image application.
101
105
102
-
## Look at the report
106
+
## Inspect the perf report
103
107
104
108
As an alternative, use `perf report` to inspect the profiling data:
105
109
@@ -117,11 +121,11 @@ The output is similar to:
117
121
118
122
The key observation is that `libz.so.1.3` accounts for a large share of the self time — roughly 43% in the primary deflate path and another 8% in `crc32_z`. This confirms that `zlib` compression is the dominant hotspot in the workload, and that the CRC32 calculation alone is a measurable fraction of total runtime. The many `[unknown]` frames are Python interpreter internals that `perf` cannot resolve without debug symbols, but they do not affect the conclusion.
119
123
120
-
In the next section you will run the same workload with `zlib-ng` and compare the hotspot profile.
124
+
In the next step, you will run the same workload with `zlib-ng` and compare the hotspot profile.
121
125
122
126
## Run the example again with perf stat and zlib-ng
123
127
124
-
This time use `LD_PRELOAD` to switch to `zlib-ng` and check the performance difference.
128
+
This time, use `LD_PRELOAD` to switch to `zlib-ng` and check the performance difference.
125
129
126
130
```console
127
131
LD_PRELOAD=/usr/local/lib/libz.so.1 perf stat -e cycles,instructions,cache-references,cache-misses python ./zip.py
@@ -149,7 +153,7 @@ Comparing against the default `zlib` run:
There are two significant changes compared to the default `zlib` report:
177
181
178
182
-**`libz.so.1.3.1.zlib-ng` is now named in the report**, confirming that `LD_PRELOAD` loaded `zlib-ng` correctly. The default `zlib` report showed an unresolved address at 43% self time; here the same hotspot is identified as `insert_string_roll` — `zlib-ng`'s Neon-accelerated hash chain insertion function.
179
-
180
183
-**`crc32_z` has disappeared from the top entries entirely.** In the default `zlib` run, `crc32_z` accounted for 8.37% of samples. With `zlib-ng`, ARMv8 hardware CRC32 instructions execute fast enough that CRC32 no longer appears as a measurable hotspot.
181
184
182
-
The 64.40% figure for `insert_string_roll` looks higher than the 43% from the default `zlib` run, but `perf report` percentages are relative to samples collected *within that run*, not across runs. The zlib-ng run completed in 1.83 seconds versus 4.85 seconds for the default `zlib`. The `-F 99` flag used in the `perf record` command sets a sampling frequency of 99 Hz, so the number of samples collected is roughly proportional to the run duration — approximately 181 samples for zlib-ng versus 480 for the default `zlib`. The absolute sample counts tell a different story:
185
+
The 64.40% figure for `insert_string_roll` looks higher than the 43% from the default `zlib` run, but `perf report` percentages are relative to samples collected *within that run*, not across runs. The `zlib-ng` run completed in 1.83 seconds versus 4.85 seconds for the default `zlib`.
186
+
187
+
The `-F 99` flag used in the `perf record` command sets a sampling frequency of 99 Hz, so the number of samples collected is roughly proportional to the run duration — approximately 181 samples for zlib-ng versus 480 for the default `zlib`. The absolute sample counts tell a different story:
183
188
184
189
| Function | Default zlib | zlib-ng |
185
190
|---|---|---|
@@ -188,11 +193,14 @@ The 64.40% figure for `insert_string_roll` looks higher than the 43% from the de
188
193
189
194
The function received *fewer* absolute samples with `zlib-ng`, meaning less real time was spent there. It appears as a higher percentage only because the total run time — and therefore total samples — shrank so much.
190
195
191
-
## Generate the new flame graph
196
+
## Generate a new flame graph for zlib-ng
197
+
198
+
Run `perf record` again with `zlib-ng`:
192
199
193
200
```console
194
201
LD_PRELOAD=/usr/local/lib/libz.so.1 perf record -F 99 --call-graph dwarf python ./zip.py
195
202
```
203
+
Convert the recorded data to folded stacks and verify the output is non-empty before generating the SVG:
196
204
197
205
```console
198
206
perf script > out.perf-script
@@ -205,10 +213,10 @@ Copy the file `flamegraph2.svg` to your computer. Open it in a browser or other
205
213
Flame graphs have no time axis — frame width represents the proportion of total samples, and each SVG scales to fill its full width regardless of how long the run took. This means you cannot compare absolute widths across the two graphs. What you can compare is the *relative proportion* that `zlib` occupies within each graph:
206
214
207
215
- In `flamegraph1.svg`, look at what fraction of the total width is occupied by `libz` frames. Then check the same in `flamegraph2.svg`. The `zlib-ng` run should show a similar or slightly larger fraction — because the run is 2.6x shorter but the library is still doing the same work. The meaningful comparison is the `perf stat` cycle and time data from the previous section, not the flame graph widths.
208
-
-**What the flame graph is useful for here** is identifying which functions dominate within the `zlib` stack. In `flamegraph1.svg` you should see `crc32_z` as a visible frame. In `flamegraph2.svg` it should be absent or too narrow to label, replaced by `insert_string_roll` as the top frame — confirming the hotspot has shifted from CRC32 to hash insertion after `zlib-ng`'s ARMv8 CRC32 acceleration removes the previous bottleneck.
216
+
- What the flame graph is useful for is identifying which functions dominate within the `zlib` stack. In `flamegraph1.svg` you should see `crc32_z` as a visible frame. In `flamegraph2.svg` it should be absent or too narrow to label, replaced by `insert_string_roll` as the top frame — confirming the hotspot has shifted from CRC32 to hash insertion after `zlib-ng`'s ARMv8 CRC32 acceleration removes the previous bottleneck.
209
217
210
-
## Summary
218
+
## What you've learned
211
219
212
-
In this Learning Path you replaced the system `zlib` with `zlib-ng` on an Arm server and measured the performance improvement.
220
+
In this Learning Path, you replaced the system `zlib` with `zlib-ng` on an Arm server and measured the performance improvement.
213
221
214
222
`zlib-ng` is built for modern Arm platforms. Its Neon SIMD and ARMv8 CRC32 acceleration deliver significantly faster compression than the default system library, without requiring any changes to your application code. Using `LD_PRELOAD` makes it straightforward to test and adopt `zlib-ng` for any dynamically linked application. Python, nginx, PostgreSQL, and many others all benefit from the same approach.
## Reduce time taken by a Python application to compress data
8
+
9
+
In the previous section, you learned how to build `zlib-ng` with Neon SIMD and ARMv8 CRC32 optimizations enabled.
10
+
11
+
In this section, you will accelerate the performance of an example Python application that compresses a large file and measure the performance difference when using `zlib-ng`.
12
+
7
13
## Install necessary software packages
8
14
9
-
Make sure`python3` is available when `python` is run:
15
+
Ensure that`python3` is available when you run `python`:
10
16
11
17
```bash
12
18
sudo apt install python-is-python3 -y
13
19
```
14
20
15
-
## Compress files with Python and zlib-ng
16
-
17
-
The previous section explained how to build `zlib-ng` with Neon SIMD and ARMv8 CRC32 optimizations enabled.
18
-
19
-
Use a Python example and measure the performance difference with `zlib-ng`.
21
+
## Create a large file to compress
20
22
21
23
Navigate to your home directory before creating the example files:
22
24
23
25
```bash
24
26
cd$HOME
25
27
```
28
+
To create an input file called `largefile`, use the `dd` command.
29
+
30
+
```bash
31
+
dd if=/dev/zero of=largefile count=1M bs=1024
32
+
```
26
33
27
-
Use a text editor to copy and save the code below in a file named `zip.py`.
34
+
## Create an example Python file compression application
35
+
36
+
To create the Python application for compressing `largefile`, use a text editor to copy and save the following code in a file named `zip.py`.
28
37
29
38
```python { file_name="zip.py" }
30
39
import gzip
@@ -38,20 +47,11 @@ with open('largefile', 'rb') as f_in:
38
47
39
48
f_out.close()
40
49
```
50
+
The Python file compression application will read `largefile` as input and write a compressed version as `largefile.gz`.
41
51
42
-
## Create a large file to compress
43
-
44
-
The above Python code will read a file named `largefile` and write a compressed version as `largefile.gz`.
45
-
46
-
To create the input file, use the `dd` command.
47
-
48
-
```bash
49
-
dd if=/dev/zero of=largefile count=1M bs=1024
50
-
```
51
-
52
-
## Run the example using the default zlib
52
+
## Compress the file using the Python application and default zlib
53
53
54
-
Run with the default `zlib` and time the execution.
54
+
Run `zip.py`with the default `zlib` and time the execution.
55
55
56
56
```bash
57
57
time python zip.py
@@ -67,7 +67,7 @@ sys 0m0.117s
67
67
68
68
Make a note of the `real` time.
69
69
70
-
## Run the example again with zlib-ng
70
+
## Compress the file again using the Python application and zlib-ng
71
71
72
72
This time, use `LD_PRELOAD` to switch to `zlib-ng` and measure the performance difference.
73
73
@@ -84,7 +84,10 @@ real 0m1.759s
84
84
user 0m1.654s
85
85
sys 0m0.105s
86
86
```
87
+
In this example, `zlib-ng` reduces compression time from 4.6 seconds to 1.8 seconds, roughly a 2.6x improvement — driven by the Neon-accelerated adler32 and inflate chunk copy routines.
88
+
89
+
## What you've learned and what's next
87
90
88
-
Compare the `real` time against the default `zlib` run. In this example, `zlib-ng`reduces compression time from 4.6 seconds to 1.8 seconds, roughly a 2.6x improvement — driven by the Neon-accelerated adler32 and inflate chunk copy routines.
91
+
In this section, you used `zlib-ng`to accelerate the performance of an example Python file compression application. You compared the difference in performance between `zlib` and `zlib-ng`.
89
92
90
-
The next section introduces how to use Linux `perf` to profile applications and look for `zlib` activity.
93
+
In the next section, you will learn how to use Linux `perf` to profile applications and look for `zlib` activity.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/zlib/setup.md
+19-8Lines changed: 19 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,9 +7,16 @@ layout: "learningpathall"
7
7
8
8
## Overview
9
9
10
-
Most Linux distributions ship `zlib` without Arm-specific optimizations. This means instruction extensions such as Neon SIMD and ARMv8 CRC32 are not used, leaving significant performance on the table for compression-heavy workloads.
10
+
Most Linux distributions ship `zlib` without Arm-specific optimizations. This means instruction extensions such as Neon Single Instruction Multiple Data (SIMD) and ARMv8 CRC32 are not used, leaving significant performance on the table for compression-heavy workloads.
11
11
12
-
`zlib-ng` is an actively maintained fork of zlib designed for modern systems. It includes Neon SIMD acceleration for adler32, inflate chunk copying, and hash operations, plus ARMv8 CRC32 and PMULL acceleration. It also supports a zlib-compatible API mode, allowing you to use it as a drop-in replacement without recompiling your applications.
12
+
Designed for modern systems, `zlib-ng` is an actively maintained fork of `zlib` that includes the following enhancements:
13
+
14
+
- Neon SIMD acceleration for adler32
15
+
- Inflate chunk copying
16
+
- Hash operations
17
+
- ARMv8 CRC32 and PMULL acceleration
18
+
19
+
`zlib-ng` also supports a zlib-compatible API mode, allowing you to use it as a drop-in replacement without recompiling your applications.
13
20
14
21
## Confirm Arm SIMD capabilities in the processor
15
22
@@ -23,7 +30,7 @@ lscpu | grep -E "asimd|crc32"
23
30
24
31
The `asimd` flag indicates Neon (Advanced SIMD) support. The `crc32` flag confirms hardware-accelerated CRC32. Both are present on all modern Arm server platforms, including AWS Graviton, Azure Cobalt 100, and Google Axion.
25
32
26
-
## Check what the default zlib contains
33
+
## Install prerequisite packages and inspect default zlib library
27
34
28
35
Install the packages you need for this Learning Path:
29
36
@@ -40,7 +47,7 @@ You can inspect the default library with `objdump` to see whether it contains an
Recent Ubuntu releases on aarch64 typically return a non-zero value here — the system `zlib` has *some* Neon code, but the coverage is partial. `zlib-ng` provides dedicated Neon paths for adler32, inflate chunk copying, slide hash, and compare256, with significantly more instruction-level parallelism than the default library. Installing `zlib-ng` is worthwhile on any Arm server regardless of what the system `zlib` already contains.
50
+
Recent Ubuntu releases on aarch64 typically return a non-zero value here — the system `zlib` has *some* Neon code, but the coverage is partial. `zlib-ng` provides dedicated Neon paths for adler32, inflate chunk copying, slide hash, and compare256, with significantly more instruction-level parallelism than the default library. This makes installing `zlib-ng` worthwhile on any Arm server regardless of what the system `zlib` already contains.
44
51
45
52
## Build and install zlib-ng
46
53
@@ -93,7 +100,7 @@ The output will show both the system `zlib` and the newly installed `zlib-ng`:
93
100
94
101
## Configure and test zlib-ng
95
102
96
-
Since`zlib-ng` is a shared library, you can configure which version an application uses without relinking it.
103
+
Because`zlib-ng` is a shared library, you can configure which version an application uses without relinking it.
97
104
98
105
Navigate back to your home directory before creating the test files:
99
106
@@ -137,7 +144,7 @@ Run the program to confirm the system `zlib` version:
137
144
./test
138
145
```
139
146
140
-
The output will be the version of the system library, for example:
147
+
The output will be the version of the system library. For example:
0 commit comments