perf(parquet): hybrid 2-byte unroll + tail loop for readUleb128Int by mashraf-222 · Pull Request #2 · codeflash-ai/trino

mashraf-222 · 2026-04-24T10:02:39Z

Summary

Replace the fully-unrolled 5-branch readUleb128Int with a hybrid: straight-line paths for
1- and 2-byte values, tight loop for 3- to 5-byte values. Independent-rerun measures a
+1.5% to +16.9% throughput improvement across four input configurations at 75-sample JMH
rigor, with no correctness regression and no measurable impact on the unchanged
readUleb128Long path in the same file.

What Changed

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderUtils.java — one method
(readUleb128Int) refactored. 23 insertions / 23 deletions. No public-signature changes,
no new fields, no new imports. The prior comment "Manual loop unrolling shows improvements
in BenchmarkReadUleb128Int" is updated to reflect the new structure.

Why It Works

The prior implementation paid a deep 5-branch chain cost on every call, including the
common 1- and 2-byte cases. The hybrid preserves the 1-byte and 2-byte fast paths as
straight-line code (small dictionary indices, page/group counts, bit-width markers are
typically 1–2 byte ULEB128) and collapses the rarer 3-, 4-, 5-byte cases into a small
tight loop that the JIT layouts more favorably than a 5-deep branch chain.

JVM-level effects observed:

The head path (1- and 2-byte) is shorter than the prior unroll's head, which matters
at size=100 (frequent exits, branch-prediction cost dominates).
The tail loop's smaller i-cache footprint and standard loop pattern appears to JIT
better than the prior 4th- and 5th-byte unrolled branches, which were partially cold.

Why It's Correct

1-byte fast path: (b0 & 0x80) == 0 → return b0. b0 is a signed Java byte, but
because the high bit is zero, the widening conversion to int gives 0..127 exactly.
2-byte fast path: (b0 & 0x7F) | ((b1 & 0x7F) << 7) is the canonical 2-byte ULEB128
decode.
Tail loop handles byte 3, 4, 5 with shift walking 14 → 21 → 28 and exiting at
shift < 35, so byte 5 is written at the correct bit position.
Malformed input detection (verify((b & 0x80) == 0, …) on 5-byte overflow)
preserved — the loop exits on shift >= 35 if byte 5 still has its high bit set, and
then the verify fires exactly as before.
Thread-safety: Method is public static, operates on caller-supplied
SimpleSliceInputStream. No shared state touched. No new fields.
Allocation: Stack locals only; same allocation profile as before.

Verification commands:

./mvnw -pl lib/trino-parquet test → 67,242 tests, 0 failures.
./mvnw -pl lib/trino-parquet validate → checkstyle and modernizer clean.
./mvnw -pl lib/trino-parquet -Perrorprone-compiler compile → clean.

Benchmark Methodology

Harness: the project's own BenchmarkReadUleb128Int (JMH 1.37), unchanged. Dynamic
randomized inputs via writeUnsignedVarInt(random.nextInt(maxValue)). Result sink with
@CompilerControl(DONT_INLINE) (prevents DCE).
Primary config: 5 forks × 10 warmup + 15 measurement × 500 ms, throughput (ops/ms),
99.9% confidence intervals (JMH default). 75 samples per row.
JVM: Temurin 25.0.3 (java -version: OpenJDK 25.0.3+9 LTS).
JVM args: -Xms2g -Xmx2g.
Host: shared AWS VM (Linux 6.17.0-1010-aws). No CPU pinning, no turbo-boost
control. This is called out in Risks below.
Control benchmark: the same file's readUleb128IntLoop variant — its source is
NOT touched by this change; it is the in-harness stability indicator.

Results

Primary — `readUleb128Int` (production method, the change target)

5 forks × 15 iterations = 75 samples per row, 99.9% CI:

Config (maxValue, size)	Before (ops/ms)	After (ops/ms)	Change
(20000, 100)	3458.77 ± 38.19	4043.72 ± 24.14	+16.92%
(20000, 1000)	210.93 ± 1.61	213.99 ± 2.53	+1.45%
(4000000, 100)	3119.62 ± 27.00	3237.69 ± 27.91	+3.78%
(4000000, 1000)	199.27 ± 2.44	207.35 ± 1.42	+4.06%

All four rows: 99.9% CIs non-overlapping between master and the change. The
(20000, 1000) case is borderline (gap ~0.5 ops/ms) — see Risks.

Control — `readUleb128IntLoop` (source identical on both branches)

Same 75-sample rigor:

Config	Master	With change	Delta
(20k, 100)	2373.16 ± 19.82	2368.06 ± 17.30	-0.21%
(20k, 1000)	231.69 ± 2.76	230.40 ± 2.22	-0.56%
(4M, 100)	1718.93 ± 13.78	1713.50 ± 15.84	-0.32%
(4M, 1000)	175.70 ± 1.52	173.52 ± 1.55	-1.24%

Noise band ≈ 1.2%. Measurement environment stable.

Regression check — `readUleb128Long` (same file, source UNCHANGED)

3 forks × 10 iterations = 30 samples, 99.9% CI. This verifies the change does not
accidentally disturb neighboring code layout or inlining decisions:

Config	Master	With change	Delta
readUleb128Long (size=1000)	76.68 ± 1.36	74.75 ± 1.61	-2.52% (CI overlap)
readUleb128Long (size=10000)	6.97 ± 0.19	7.08 ± 0.11	+1.62% (CI overlap)
readUleb128LongLoop (size=1000)	64.83 ± 1.26	60.60 ± 3.80	-6.52% (CI overlap, wider error after)
readUleb128LongLoop (size=10000)	5.32 ± 0.12	5.38 ± 0.10	+0.93% (CI overlap)

All four rows have overlapping CIs. No regression attributable to this change.

Reproduction

# One-time environment
curl -fsSL -o /tmp/temurin25.tar.gz 'https://api.adoptium.net/v3/binary/latest/25/ga/linux/x64/jdk/hotspot/normal/eclipse'
sudo mkdir -p /opt/temurin-25 && sudo tar -xzf /tmp/temurin25.tar.gz -C /opt/temurin-25 --strip-components=1
export JAVA_HOME=/opt/temurin-25 PATH=$JAVA_HOME/bin:$PATH

# Parent pom (one-time)
./mvnw -N install -DskipTests

# Build classpath + test classes for baseline
git checkout master
./mvnw -pl lib/trino-parquet test-compile -q
./mvnw -pl lib/trino-parquet dependency:build-classpath -Dmdep.outputFile=/tmp/cp.txt -Dmdep.includeScope=test -q
CP="lib/trino-parquet/target/test-classes:lib/trino-parquet/target/classes:$(cat /tmp/cp.txt)"

# Baseline run (expect ~12 min wall time)
java -cp "$CP" -Xms2g -Xmx2g org.openjdk.jmh.Main \
  "io.trino.parquet.reader.BenchmarkReadUleb128Int" \
  -f 5 -wi 10 -i 15 -w 500ms -r 500ms -rf json -rff /tmp/before.json

# Change run
git checkout codeflash/optimize     # or this PR's head
./mvnw -pl lib/trino-parquet test-compile -q
java -cp "$CP" -Xms2g -Xmx2g org.openjdk.jmh.Main \
  "io.trino.parquet.reader.BenchmarkReadUleb128Int" \
  -f 5 -wi 10 -i 15 -w 500ms -r 500ms -rf json -rff /tmp/after.json

Callers / Impact Scope

ParquetReaderUtils.readUleb128Int is called from the Parquet reader's per-group
header parsing and RLE/bit-packing hybrid decoder headers:

RleBitPackingHybridDecoder.readGroupHeader and related call sites call
readUleb128Int once per data group. At the group granularity, per-call savings
compound across a scan.
Bit-width markers and small dictionary indices also decode through this method.

The method is NOT called per-row in the innermost decoding loops, so end-to-end
query-level impact is not claimed (and was not measured). The win shows up at the
micro-benchmark level and at the per-group-header level.

Risks and Limitations

Shared-VM measurement environment. Benchmarks were run on an AWS VM without CPU
pinning or turbo-boost control. The JMH warning about environment applies: re-running
at -f 2 -wi 3 -i 3 can show CI overlap on the (20k, 1000) row (gap is small).
The -f 5 -wi 10 -i 15 rigor above is what clears the CIs.
The (20k, 1000) row is a borderline +1.5% win. At 75 samples the CIs do not
overlap, but a reviewer running shorter rigor will likely see noise there. The
change is NOT claimed to be a meaningful win on that specific config — the case is
reported for completeness.
No end-to-end query benchmark. Micro-benchmark evidence only. The
per-group-header calling pattern means the amortized speedup over an actual Parquet
scan will be much smaller than the per-call micro-benchmark numbers.
No readUleb128Long change in this PR. Baseline on master shows the existing
unrolled long decoder wins by 25–28% vs a naive loop on random long inputs; a
hybrid applied there would regress on this distribution. A future change via
word-reads + Long.numberOfTrailingZeros over a 0x8080_8080_8080_8080L mask may
be tractable but is out of scope here.
Not sanity-checked with -prof perfnorm or -prof gc. No instruction-level or
allocation evidence — the win is attributed to branch-chain depth reduction from
diff-reading and the throughput numbers, not from profiler output.

Test Plan

./mvnw -pl lib/trino-parquet test — 67,242 tests pass.
./mvnw -pl lib/trino-parquet validate — checkstyle + modernizer clean.
./mvnw -pl lib/trino-parquet -Perrorprone-compiler compile — clean.
BenchmarkReadUleb128Int at 75-sample rigor, control stable, non-overlapping CIs.
BenchmarkReadUleb128Long regression check — no regression.
Reviewer reproducing the benchmark per the Reproduction section (not run on CI).

Disclosure

This change was drafted by a codeflash-agent autonomous optimization session and then
independently re-benchmarked by me before this PR was opened. The agent's initial
report headlined a +25.7% figure on the (4M, 1000) config; independent rebench at
the same and higher rigor showed that number was a single-baseline JIT-state artifact
and the reproducible win on that config is +4.1%. The numbers in this PR are the
75-sample independently-reproduced figures, not the agent's original claim.

Replace the fully-unrolled 5-branch implementation with a hybrid: - Fast straight-line paths for 1-byte and 2-byte values (the most common cases: small dictionary indices, page/group counts, bit-width markers). - Tight loop for 3-, 4- and 5-byte values. The prior unrolled implementation pays a deep branch-chain cost on longer ULEB128 sequences, where a simple loop body is faster. This change keeps the straight-line code for the common short values while collapsing the rare long values into a loop that the JIT can optimize more effectively. BenchmarkReadUleb128Int (JDK 25, 3 forks x 10 warmup + 15 measurement x 500ms, throughput in ops/ms, 99.9% CIs, non-overlapping for every row): Config (maxValue, size) Before After Change (20000, 100) 3929.8 ± 34.6 4202.4 ± 43.0 +6.9% (20000, 1000) 214.2 ± 2.2 218.6 ± 2.2 +2.1% (4000000, 100) 3247.4 ± 22.5 3403.0 ± 32.8 +4.8% (4000000, 1000) 165.5 ± 1.8 208.1 ± 2.4 +25.7% The in-benchmark readUleb128IntLoop control (unchanged code path) shows noise-only variation (-2.2% to 0.0%), confirming the measurement infrastructure is stable. Verification: - ./mvnw -pl lib/trino-parquet test: 67242 tests, 0 failures. - ./mvnw -pl lib/trino-parquet validate: checkstyle and modernizer clean. - ./mvnw -pl lib/trino-parquet -Perrorprone-compiler compile: clean. - TestParquetReaderUtilsBenchmarks (random-input roundtrip): pass. - TestRleBitPackingDecoderBenchmark (primary consumer): pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parquet): hybrid 2-byte unroll + tail loop for readUleb128Int#2

perf(parquet): hybrid 2-byte unroll + tail loop for readUleb128Int#2
mashraf-222 wants to merge 1 commit intomasterfrom
codeflash/optimize

mashraf-222 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mashraf-222 commented Apr 24, 2026

Summary

What Changed

Why It Works

Why It's Correct

Benchmark Methodology

Results

Primary — readUleb128Int (production method, the change target)

Control — readUleb128IntLoop (source identical on both branches)

Regression check — readUleb128Long (same file, source UNCHANGED)

Reproduction

Callers / Impact Scope

Risks and Limitations

Test Plan

Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Primary — `readUleb128Int` (production method, the change target)

Control — `readUleb128IntLoop` (source identical on both branches)

Regression check — `readUleb128Long` (same file, source UNCHANGED)