perf(parquet): hybrid 2-byte unroll + tail loop for readUleb128Int#2
Open
mashraf-222 wants to merge 1 commit intomasterfrom
Open
perf(parquet): hybrid 2-byte unroll + tail loop for readUleb128Int#2mashraf-222 wants to merge 1 commit intomasterfrom
mashraf-222 wants to merge 1 commit intomasterfrom
Conversation
Replace the fully-unrolled 5-branch implementation with a hybrid: - Fast straight-line paths for 1-byte and 2-byte values (the most common cases: small dictionary indices, page/group counts, bit-width markers). - Tight loop for 3-, 4- and 5-byte values. The prior unrolled implementation pays a deep branch-chain cost on longer ULEB128 sequences, where a simple loop body is faster. This change keeps the straight-line code for the common short values while collapsing the rare long values into a loop that the JIT can optimize more effectively. BenchmarkReadUleb128Int (JDK 25, 3 forks x 10 warmup + 15 measurement x 500ms, throughput in ops/ms, 99.9% CIs, non-overlapping for every row): Config (maxValue, size) Before After Change (20000, 100) 3929.8 ± 34.6 4202.4 ± 43.0 +6.9% (20000, 1000) 214.2 ± 2.2 218.6 ± 2.2 +2.1% (4000000, 100) 3247.4 ± 22.5 3403.0 ± 32.8 +4.8% (4000000, 1000) 165.5 ± 1.8 208.1 ± 2.4 +25.7% The in-benchmark readUleb128IntLoop control (unchanged code path) shows noise-only variation (-2.2% to 0.0%), confirming the measurement infrastructure is stable. Verification: - ./mvnw -pl lib/trino-parquet test: 67242 tests, 0 failures. - ./mvnw -pl lib/trino-parquet validate: checkstyle and modernizer clean. - ./mvnw -pl lib/trino-parquet -Perrorprone-compiler compile: clean. - TestParquetReaderUtilsBenchmarks (random-input roundtrip): pass. - TestRleBitPackingDecoderBenchmark (primary consumer): pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the fully-unrolled 5-branch
readUleb128Intwith a hybrid: straight-line paths for1- and 2-byte values, tight loop for 3- to 5-byte values. Independent-rerun measures a
+1.5% to +16.9% throughput improvement across four input configurations at 75-sample JMH
rigor, with no correctness regression and no measurable impact on the unchanged
readUleb128Longpath in the same file.What Changed
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderUtils.java— one method(
readUleb128Int) refactored. 23 insertions / 23 deletions. No public-signature changes,no new fields, no new imports. The prior comment "Manual loop unrolling shows improvements
in BenchmarkReadUleb128Int" is updated to reflect the new structure.
Why It Works
The prior implementation paid a deep 5-branch chain cost on every call, including the
common 1- and 2-byte cases. The hybrid preserves the 1-byte and 2-byte fast paths as
straight-line code (small dictionary indices, page/group counts, bit-width markers are
typically 1–2 byte ULEB128) and collapses the rarer 3-, 4-, 5-byte cases into a small
tight loop that the JIT layouts more favorably than a 5-deep branch chain.
JVM-level effects observed:
at
size=100(frequent exits, branch-prediction cost dominates).better than the prior 4th- and 5th-byte unrolled branches, which were partially cold.
Why It's Correct
(b0 & 0x80) == 0→ returnb0.b0is a signed Java byte, butbecause the high bit is zero, the widening conversion to
intgives0..127exactly.(b0 & 0x7F) | ((b1 & 0x7F) << 7)is the canonical 2-byte ULEB128decode.
shiftwalking 14 → 21 → 28 and exiting atshift < 35, so byte 5 is written at the correct bit position.verify((b & 0x80) == 0, …)on 5-byte overflow)preserved — the loop exits on
shift >= 35if byte 5 still has its high bit set, andthen the verify fires exactly as before.
public static, operates on caller-suppliedSimpleSliceInputStream. No shared state touched. No new fields.Verification commands:
./mvnw -pl lib/trino-parquet test→ 67,242 tests, 0 failures../mvnw -pl lib/trino-parquet validate→ checkstyle and modernizer clean../mvnw -pl lib/trino-parquet -Perrorprone-compiler compile→ clean.Benchmark Methodology
BenchmarkReadUleb128Int(JMH 1.37), unchanged. Dynamicrandomized inputs via
writeUnsignedVarInt(random.nextInt(maxValue)). Result sink with@CompilerControl(DONT_INLINE)(prevents DCE).99.9% confidence intervals (JMH default). 75 samples per row.
java -version: OpenJDK 25.0.3+9 LTS).-Xms2g -Xmx2g.Linux 6.17.0-1010-aws). No CPU pinning, no turbo-boostcontrol. This is called out in Risks below.
readUleb128IntLoopvariant — its source isNOT touched by this change; it is the in-harness stability indicator.
Results
Primary —
readUleb128Int(production method, the change target)5 forks × 15 iterations = 75 samples per row, 99.9% CI:
All four rows: 99.9% CIs non-overlapping between master and the change. The
(20000, 1000)case is borderline (gap ~0.5 ops/ms) — see Risks.Control —
readUleb128IntLoop(source identical on both branches)Same 75-sample rigor:
Noise band ≈ 1.2%. Measurement environment stable.
Regression check —
readUleb128Long(same file, source UNCHANGED)3 forks × 10 iterations = 30 samples, 99.9% CI. This verifies the change does not
accidentally disturb neighboring code layout or inlining decisions:
All four rows have overlapping CIs. No regression attributable to this change.
Reproduction
Callers / Impact Scope
ParquetReaderUtils.readUleb128Intis called from the Parquet reader's per-groupheader parsing and RLE/bit-packing hybrid decoder headers:
RleBitPackingHybridDecoder.readGroupHeaderand related call sites callreadUleb128Intonce per data group. At the group granularity, per-call savingscompound across a scan.
The method is NOT called per-row in the innermost decoding loops, so end-to-end
query-level impact is not claimed (and was not measured). The win shows up at the
micro-benchmark level and at the per-group-header level.
Risks and Limitations
pinning or turbo-boost control. The JMH warning about environment applies: re-running
at
-f 2 -wi 3 -i 3can show CI overlap on the(20k, 1000)row (gap is small).The
-f 5 -wi 10 -i 15rigor above is what clears the CIs.(20k, 1000)row is a borderline +1.5% win. At 75 samples the CIs do notoverlap, but a reviewer running shorter rigor will likely see noise there. The
change is NOT claimed to be a meaningful win on that specific config — the case is
reported for completeness.
per-group-header calling pattern means the amortized speedup over an actual Parquet
scan will be much smaller than the per-call micro-benchmark numbers.
readUleb128Longchange in this PR. Baseline on master shows the existingunrolled long decoder wins by 25–28% vs a naive loop on random long inputs; a
hybrid applied there would regress on this distribution. A future change via
word-reads +
Long.numberOfTrailingZerosover a0x8080_8080_8080_8080Lmask maybe tractable but is out of scope here.
-prof perfnormor-prof gc. No instruction-level orallocation evidence — the win is attributed to branch-chain depth reduction from
diff-reading and the throughput numbers, not from profiler output.
Test Plan
./mvnw -pl lib/trino-parquet test— 67,242 tests pass../mvnw -pl lib/trino-parquet validate— checkstyle + modernizer clean../mvnw -pl lib/trino-parquet -Perrorprone-compiler compile— clean.BenchmarkReadUleb128Intat 75-sample rigor, control stable, non-overlapping CIs.BenchmarkReadUleb128Longregression check — no regression.Disclosure
This change was drafted by a codeflash-agent autonomous optimization session and then
independently re-benchmarked by me before this PR was opened. The agent's initial
report headlined a
+25.7%figure on the(4M, 1000)config; independent rebench atthe same and higher rigor showed that number was a single-baseline JIT-state artifact
and the reproducible win on that config is +4.1%. The numbers in this PR are the
75-sample independently-reproduced figures, not the agent's original claim.