Skip to content

perf(operator): lookup table for Math.pow(10, n) in MathFunctions.round#3

Open
mashraf-222 wants to merge 1 commit intomasterfrom
perf/mathfunctions-round-pow10-lookup
Open

perf(operator): lookup table for Math.pow(10, n) in MathFunctions.round#3
mashraf-222 wants to merge 1 commit intomasterfrom
perf/mathfunctions-round-pow10-lookup

Conversation

@mashraf-222
Copy link
Copy Markdown
Collaborator

Summary

Replace the per-row Math.pow(10, decimals) call in MathFunctions.round(double, long) and
roundReal(long, long) with a 19-entry bit-exact lookup table (POWERS_OF_TEN_DOUBLE).
Independent rerun measures +26% to +387% throughput across 10 BenchmarkRoundFunction
configurations at 30-sample JMH rigor. All 99.9% CIs non-overlapping. 51/51
TestMathFunctions pass. No regression in an unrelated-scalar regression check
(BenchmarkBigIntOperators, same module).

What Changed

  • core/trino-main/src/main/java/io/trino/operator/scalar/MathFunctions.java — one file,
    21 lines (19+ / 2−). No public signatures changed, no new imports.
    • New private static final double[] POWERS_OF_TEN_DOUBLE (19 entries: 10^0 .. 10^18).
    • New private static double powerOfTen(long decimals) helper: bounds-checked lookup
      with a Math.pow(10, decimals) fallback for out-of-range inputs.
    • Two call sites in round(double, long) and roundReal(long, long) swap
      Math.pow(10, decimals)powerOfTen(decimals).

Why It Works

Math.pow(double, double) has IEEE-754 semantics and cannot be constant-folded by the
JIT when the exponent is a method parameter the compiler cannot prove is a compile-time
constant. In the SqlFunction dispatch path used by round(x, decimals), decimals is
supplied per call-site and the JIT therefore emits a real Math.pow call (an intrinsic
that routes through a slow path for non-special arguments). Replacing the call with a
19-entry bounds-checked array load removes that per-row cost entirely.

JVM-level effects observed:

  • The *Actual benchmarks (which go through MathFunctions.round) drop from
    11–46M ops/s to ~54–55M ops/s — flat across decimals 0..4 on the After side, which
    is the signature of the bottleneck having been fully removed (it no longer matters
    which decimal is used).
  • The *Baseline benchmarks (which just call Math.round directly, source unchanged)
    are stable within ±1.6% across branches — measurement environment is not drifting.

Why It's Correct

  • Bit-exact lookup values. Each literal POWERS_OF_TEN_DOUBLE[n] equals
    Math.pow(10, n) bit-for-bit via Double.doubleToRawLongBits for n in [0, 18].
    No precision change for any decimals in the lookup range.
  • Bounds-checked fallback. For decimals < 0 or decimals >= 19, control falls
    through to Math.pow(10, decimals) — preserving the exact prior behavior (including
    Double.POSITIVE_INFINITY / NaN / 0.0 edge cases).
  • Thread-safety. POWERS_OF_TEN_DOUBLE is static final and initialized at class
    load with literal values. powerOfTen is a pure static method with no shared state.
    Safe for concurrent invocation from multiple worker threads.
  • Allocation. Zero new allocations in the hot path (one array load replaces one
    intrinsic call).
  • Tests. ./mvnw -pl core/trino-main test -Dtest='TestMathFunctions'
    51/51 pass, 0 failures. Covers integer/decimal/double/real rounding across
    positive, negative, zero, large, and out-of-range decimal arguments (i.e., the
    fallback path is exercised).
  • Style/static analysis. ./mvnw -pl core/trino-main validate → clean
    (checkstyle + modernizer). No wildcard imports, braces on single-statement
    conditionals, no @author.

Benchmark Methodology

  • Harness: project's own BenchmarkRoundFunction (JMH 1.37), unchanged. Inputs set per
    @Setup via Math.random(); each benchmark method uses the JMH Blackhole convention
    to prevent DCE.
  • Primary config: 3 forks × 8 warmup + 10 measurement × 500 ms, throughput (ops/s),
    99.9% confidence intervals (JMH default). 30 samples per row.
  • JVM: Temurin 25.0.3 (OpenJDK 64-Bit Server VM, 25.0.3+9-LTS).
  • JVM args: -Xms2g -Xmx2g.
  • Host: shared AWS VM (Linux 6.17.0-1010-aws). No CPU pinning, no turbo-boost
    control. See Risks.
  • Control: benchmark's own {double,float}Baseline variants — source NOT touched by
    this change; they are the in-harness stability indicator.
  • Regression check harness: BenchmarkBigIntOperators (same module, unrelated scalar
    ops) at @Fork(2) -wi 5 -i 8 -w 500ms -r 500ms.

Results

Primary — BenchmarkRoundFunction.{double,float}Actual (calls the change target)

Config (numberOfDecimals) Before (ops/s) After (ops/s) Change
doubleActual 0 18,022,279 ± 483,750 54,830,598 ± 1,547,356 +204%
doubleActual 1 11,345,867 ± 218,445 55,088,195 ± 855,819 +386%
doubleActual 2 43,587,610 ± 759,410 54,873,971 ± 917,210 +26%
doubleActual 3 11,335,843 ± 199,692 54,899,798 ± 844,525 +384%
doubleActual 4 11,391,682 ± 210,533 55,461,154 ± 864,121 +387%
floatActual 0 17,639,185 ± 388,151 53,292,111 ± 912,942 +202%
floatActual 1 11,070,728 ± 210,508 53,125,045 ± 1,322,295 +380%
floatActual 2 42,601,056 ± 1,193,689 54,797,856 ± 1,024,909 +29%
floatActual 3 11,215,648 ± 215,000 54,488,734 ± 926,626 +386%
floatActual 4 11,131,120 ± 215,785 54,183,499 ± 870,480 +387%

All 10 rows: 99.9% CIs non-overlapping. Worst-case speedup +26%; best-case +387%.

Control — BenchmarkRoundFunction.{double,float}Baseline (source identical on both branches)

Config Master With change Delta
doubleBaseline 0 20,851,999 ± 469,610 21,052,936 ± 700,633 +0.96%
doubleBaseline 1 13,101,400 ± 334,404 13,178,739 ± 212,654 +0.59%
doubleBaseline 2 53,772,971 ± 906,966 52,998,318 ± 942,592 −1.44%
doubleBaseline 3 13,107,369 ± 317,916 13,197,799 ± 283,491 +0.69%
doubleBaseline 4 13,058,875 ± 311,891 13,266,783 ± 273,155 +1.59%
floatBaseline 0 20,732,754 ± 434,061 20,912,909 ± 535,604 +0.87%
floatBaseline 1 13,048,808 ± 227,922 12,895,425 ± 323,583 −1.18%

Noise band ≈ ±1.6%. Measurement environment stable.

Regression check — BenchmarkBigIntOperators (unrelated scalar, same module)

16 samples/row, 99.9% CI:

Benchmark Master With change Delta CI overlap?
baseLineAdd 193,988,819 ± 12,334,681 194,253,152 ± 9,231,953 +0.14% Yes
baseLineDivide 13,070,107 ± 656,337 13,181,312 ± 558,824 +0.85% Yes
baseLineMultiply 128,693,285 ± 3,132,121 129,809,987 ± 3,397,608 +0.87% Yes
baseLineNegate 350,956,869 ± 9,629,233 347,716,681 ± 13,075,593 −0.92% Yes
baseLineSubtract 188,562,963 ± 6,570,374 188,298,235 ± 6,216,503 −0.14% Yes
overflowChecksAdd 112,315,423 ± 5,760,297 113,182,751 ± 4,222,779 +0.77% Yes
overflowChecksDivide 12,987,295 ± 528,168 13,100,536 ± 629,216 +0.87% Yes
overflowChecksMultiply 70,903,932 ± 807,701 71,498,219 ± 1,267,001 +0.84% Yes
overflowChecksNegate 219,961,306 ± 5,327,069 227,940,001 ± 5,262,004 +3.63% Borderline (still within error bars)

8/9 within noise; the one +3.63% outlier is within its own error bar. No regression
attributable to this change.

Reproduction

# One-time environment (Trino requires Temurin/Oracle JDK 25; Ubuntu OpenJDK is
# rejected by the airbase enforcer)
curl -fsSL -o /tmp/temurin25.tar.gz 'https://api.adoptium.net/v3/binary/latest/25/ga/linux/x64/jdk/hotspot/normal/eclipse'
sudo mkdir -p /opt/temurin-25 && sudo tar -xzf /tmp/temurin25.tar.gz -C /opt/temurin-25 --strip-components=1
export JAVA_HOME=/opt/temurin-25 PATH=$JAVA_HOME/bin:$PATH

# Parent pom (one-time)
./mvnw -N install -DskipTests

# Build classpath + test classes for baseline
git checkout master
./mvnw -pl core/trino-main test-compile -q
./mvnw -pl core/trino-main dependency:build-classpath -Dmdep.outputFile=/tmp/cp.txt -Dmdep.includeScope=test -q
CP="core/trino-main/target/test-classes:core/trino-main/target/classes:$(cat /tmp/cp.txt)"

# Baseline run (expect ~12 min wall time at this rigor)
java -cp "$CP" -Xms2g -Xmx2g org.openjdk.jmh.Main \
  "io.trino.operator.scalar.BenchmarkRoundFunction" \
  -f 3 -wi 8 -i 10 -w 500ms -r 500ms -rf json -rff /tmp/before.json

# With change
git checkout perf/mathfunctions-round-pow10-lookup     # or this PR's head
./mvnw -pl core/trino-main test-compile -q
java -cp "$CP" -Xms2g -Xmx2g org.openjdk.jmh.Main \
  "io.trino.operator.scalar.BenchmarkRoundFunction" \
  -f 3 -wi 8 -i 10 -w 500ms -r 500ms -rf json -rff /tmp/after.json

Callers / Impact Scope

MathFunctions.round(double, long) and roundReal(long, long) are the SQL
round(x, decimals) implementations for double and real types. They are called
once per row when a query uses round(col, n) with a runtime-resolved (or column-valued)
decimals argument. The speedup applies to every such row in a scan.

When decimals is a literal that the planner can bind at compile time, the SQL engine
may short-circuit to a different code path — not measured here. This PR's win is
concretely on the general-purpose round(x, decimals) evaluator; end-to-end query-level
impact on a specific workload would need its own measurement.

Risks and Limitations

  • Shared-VM measurement environment. Benchmarks were run on an AWS VM without CPU
    pinning or turbo-boost control. The per-config magnitude of the speedup (+26% to
    +387%) dwarfs the control noise (±1.6%), so the direction and tier are robust, but a
    reviewer running on a different host should expect the exact ratios to shift.
  • decimals 2 sees smaller gain (+26% / +29%). This is because the master path for
    decimals == 2 is already fast (Math.pow(10, 2) == 100.0, which HotSpot sometimes
    recognizes as a cheap case on some JIT heuristics). The lookup still wins.
  • decimals < 0 or ≥ 19 hit the fallback (same Math.pow as before). This is
    documented in the helper but not separately benchmarked; the fallback preserves prior
    behavior.
  • No end-to-end SQL query benchmark. Micro-benchmark evidence only.
  • No -prof perfnorm or -prof gc. Attribution is from the diff ("remove a
    Math.pow call") and throughput numbers, not from instruction-level profiler output.

Test Plan

  • ./mvnw -pl core/trino-main test -Dtest='TestMathFunctions' → 51/51 pass, 0
    failures.
  • ./mvnw -pl core/trino-main validate → checkstyle + modernizer clean.
  • BenchmarkRoundFunction at 30-sample rigor, control stable, non-overlapping CIs.
  • BenchmarkBigIntOperators unrelated-scalar regression check — no regression.
  • Reviewer reproducing the benchmark per the Reproduction section (not run on CI).

Disclosure

This change was drafted by a codeflash-agent autonomous optimization session and then
independently re-benchmarked before this PR was opened. The agent's reported speedups
(1.24×–5.03×) match the reviewer's reproduction (1.26×–4.87×) within 5% row-by-row, so
the numbers in this PR are the reviewer's 30-sample figures presented directly —
consistent with the agent's own report.

…athFunctions.round

For double/real round(num, decimals), decimals is typically in [0, 18] but Math.pow(10, decimals)
must be called per row because JIT cannot prove the argument is a compile-time constant.

Precompute 10^n for n in [0, 18] as a static double[] and read via bounds-checked index. The lookup
values are bit-exact matches of Math.pow(10, n) (verified via doubleToRawLongBits), so the behavior
is unchanged. Negative or out-of-range decimals fall through to Math.pow(10, decimals).

JMH BenchmarkRoundFunction (2 forks x 5 warmup x 10 measurement iterations, 500ms each,
two independent baseline and optimized runs to rule out JIT artifact):

  decimals  baseline       optimized      speedup
  double 0  18.8M ops/s    57.7M ops/s    3.07x
  double 1  11.5M ops/s    57.8M ops/s    5.03x
  double 2  46.0M ops/s    58.2M ops/s    1.26x
  double 3  11.7M ops/s    57.0M ops/s    4.87x
  double 4  11.8M ops/s    58.3M ops/s    4.94x
  float  0  18.6M ops/s    57.2M ops/s    3.07x
  float  1  11.8M ops/s    57.5M ops/s    4.87x
  float  2  44.9M ops/s    57.1M ops/s    1.27x
  float  3  11.8M ops/s    57.2M ops/s    4.85x
  float  4  11.7M ops/s    56.8M ops/s    4.85x

99% confidence intervals do not overlap. All 51 TestMathFunctions tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant