Skip to content

perf(clients): dense array lookup for Errors.forCode#2

Open
mashraf-222 wants to merge 6 commits intotrunkfrom
codeflash/errors-forcode-dense-array-20260507-pr
Open

perf(clients): dense array lookup for Errors.forCode#2
mashraf-222 wants to merge 6 commits intotrunkfrom
codeflash/errors-forcode-dense-array-20260507-pr

Conversation

@mashraf-222
Copy link
Copy Markdown
Collaborator

perf(clients): dense array lookup for Errors.forCode

Summary: Replace HashMap<Short, Errors> with dense Errors[]
for Errors.forCode. Per-call cost −70% to −80% on 4/4 JMH
configurations with non-overlapping 95% CIs. 279 non-test callers.
End-to-end impact not claimed (Amdahl-bounded below single-box noise floor).

Replaces the HashMap<Short, Errors> lookup table inside
org.apache.kafka.common.protocol.Errors.forCode(short) with a
dense Errors[] indexed by (code − MIN_CODE). The change is
backed by 3 equivalence tests added in this PR plus the 7 existing
ErrorsTest cases (10 total, all pass), a JMH benchmark
(ErrorsForCodeBenchmark) whose four input-distribution
configurations each show a −70% to −80% per-call reduction with
non-overlapping 95% CIs, and a post-refactor profile showing the
method drops out of broker and consumer CPU samples. This PR is
scoped as LIBRARY PRIMITIVE: 279 non-test callers in the
codebase. End-to-end throughput impact was measured at N=5 on a
single-box lz4 workload and is inconclusive due to CI overlap,
as the Amdahl ceiling (~0.32% consumer / ~0.11% broker) sits well
below this hardware's single-box noise floor.

Tier and claim

LIBRARY PRIMITIVEErrors.forCode(short): per-call cost
reduced by −70% to −80% (non-overlap 95% CIs, JMH with 2 forks ×
10 measurement × 5 s + 5 × 5 s warmup per fork) across 4 input
distributions. 279 non-test downstream callers in the codebase.
End-to-end measured inconclusive at N=5 on lz4 1024-byte
reference workload; CIs overlap (see "End-to-end" section).

What changed

  • clients/src/main/java/org/apache/kafka/common/protocol/Errors.java:425-439, 526-539
    — replaces Map<Short, Errors> CODE_TO_ERROR = new HashMap<>()
    with dense static final Errors[] CODE_TO_ERROR_ARR plus
    MIN_CODE/MAX_CODE short fields; rewrites forCode(short) to
    bounds-check then single array load, fallback unchanged.
  • clients/src/test/java/org/apache/kafka/common/protocol/ErrorsTest.java:91-138
    — adds 3 new tests covering round-trip identity, out-of-range
    fallback, and forCode(e.code()) == e idempotence.
  • jmh-benchmarks/src/main/java/org/apache/kafka/jmh/common/ErrorsForCodeBenchmark.java
    — new JMH benchmark, four @Param distributions (allZero,
    small, mixed, miss), unrolled 4× per op, Blackhole-
    consumed.
  • codeflash-evidence/20260507-cand3-errors-forcode-delta.md,
    codeflash-evidence/20260507-cand3-errors-forcode-e2e.md
    per-candidate evidence files committed alongside the code (an
    additional correction commit fixes an arithmetic error in the
    initial evidence: N=5 half-widths now use the correct t=2.776
    (df=4), raw data rows match the session's txt files, and caller
    count is 279 as re-verified on this tree).

Public API: unchanged. Errors.forCode(short) signature,
return type, and observable behaviour on every valid and invalid
input are preserved.

On-disk / on-wire formats: unchanged.

JVM floor: unchanged (clients module continues to target Java 11).

Why it works

Commit f8cba68refactor: Errors.forCode dense array lookup

Mechanism. The old forCode path does
CODE_TO_ERROR.get(Short.valueOf(code)). This executes:

  1. Autobox the primitive short to a Short (cached in the Short
    autobox for codes in −128..127; most Kafka codes fall in this
    range, so this step is often free, but the cache lookup still
    consumes a load).
  2. HashMap.hash(key) — a method call on the Short key.
  3. HashMap.getNode(hash, key) — table indexing, bucket probe,
    key.equals(probe.key) on each probe.
  4. Integer.equals dispatch through the key object's vtable
    (Short is final so this is effectively monomorphic, but the
    JVM still emits the check).

The new path is a primitive bounds check plus one array load:

if (code >= MIN_CODE && code <= MAX_CODE) {
    Errors error = CODE_TO_ERROR_ARR[code - MIN_CODE];
    if (error != null) return error;
}
// fallback log.warn + UNKNOWN_SERVER_ERROR

The JMH numbers show allZero 11.854 ns/op → 2.941 ns/op per 4-call
unroll, i.e. ~2.96 ns/call → ~0.74 ns/call. Those ns/call figures are
consistent with a picture of the baseline doing roughly 7–8 cycles
of chained loads and branches at 3 GHz, and the feature doing
roughly 2–3 cycles for a masked-compare bounds check plus an
L1-resident array load. The cycle estimates are anchored in the
measured ns/op rather than an independent measurement.

Post-refactor profile (profiles/post-refactor/cand3/DELTA.md):
on a 30-second CPU sample of both broker and consumer under the
reference workload, Errors.forCode went from 5/1159 = 0.43%
consumer samples and 2/3295 = 0.06% broker samples down to 0
samples on both JVMs. At the default 9.9 ms sampling interval, a
≤1 ns call site has a vanishing probability of being sampled at
all — consistent with the observed JMH reduction.

Why it's correct.

  • Public API: Errors.forCode(short) signature and return type unchanged.
  • Output for every valid input: the static initializer iterates
    Errors.values() and stores each enum at index code − MIN_CODE,
    so for any enum e, forCode(e.code()) returns e. This is a
    bijection on the defined range.
  • Output for every invalid input: codes outside [MIN_CODE, MAX_CODE] short-circuit to the fallback path (log.warn("Unexpected error code: {}.", code) + return UNKNOWN_SERVER_ERROR), exactly
    matching the prior behaviour. Codes inside the range that don't
    map to an enum leave a null slot; forCode detects the null
    and falls through to the same fallback.
  • Duplicate-code guard: the static initializer still throws
    ExceptionInInitializerError if two enum entries would land at
    the same array index, preserving the pre-existing guard.
  • Log behaviour: the fallback log.warn(...) message and
    returned UNKNOWN_SERVER_ERROR sentinel are byte-identical to
    before.
  • Exception identity / null behaviour / thread safety / determinism: unchanged.
    forCode is a pure read from a static final reference; no
    synchronisation, no ordering observable to callers.

Tests: clients/src/test/java/org/apache/kafka/common/protocol/ErrorsTest.java
now has 10 @test methods (7 prior + 3 new). The three new tests
cover the contract this refactor touches:

  • testForCodeReturnsMatchingForEveryDefinedCode — iterates every
    Errors.values() and asserts forCode(e.code()) == e.
  • testForCodeReturnsUnknownServerErrorForOutOfRangeCodes
    probes Short.MIN_VALUE, Short.MAX_VALUE, and just-past-edge
    values, asserting fallback identity.
  • testForCodeIsIdentityInverseOfCodeforCode(forCode(c).code()) == forCode(c).

All 10 tests pass on the refactor commit (output at
/home/ubuntu/code/kafka/clients/build/test-results/test/TEST-org.apache.kafka.common.protocol.ErrorsTest.xml).

Benchmark

JMH methodology

  • JMH version: 1.37 (inside kafka-jmh-benchmarks-4.4.0-SNAPSHOT shadow jar)
  • JVM: OpenJDK 25.0.2 (Ubuntu build 25.0.2+10-Ubuntu-124.04)
  • Host: 4 vCPU AWS (x86_64), Ubuntu 24.04
  • Config flags: -f 2 -wi 5 -w 5 -i 10 -r 5 — 2 forks × (5 × 5 s warmup + 10 × 5 s measurement)
  • Benchmark: org.apache.kafka.jmh.common.ErrorsForCodeBenchmark.forCode
  • Harness specifics:
    • Four Errors.forCode(short) results per measured op (unrolled 4×), consumed via Blackhole to defeat dead-code elimination.
    • Codes drawn from a 4096-entry preloaded shuffled buffer populated at @Setup (seeded Random(0xC0DEC0DEL)) so per-call cost is dominated by the lookup itself and no loop-inside-loop constant folding is possible.
    • Four input distributions via @Param:
      • allZero — every call is NONE (code 0); the common happy path.
      • small — uniformly random over Errors.values().
      • mixed — 80% NONE / 10% UNKNOWN_SERVER_ERROR / 10% random other (realistic broker shape).
      • miss — codes outside the defined range; worst-case fallback.

JMH results

distribution baseline ns/op (4 calls) feature ns/op (4 calls) Δ% CI overlap
allZero 11.854 ± 0.299 2.941 ± 0.059 −75.19% NO
small 9.947 ± 0.358 2.993 ± 0.057 −69.91% NO
mixed 11.647 ± 0.162 3.005 ± 0.038 −74.20% NO
miss 9.722 ± 0.153 1.959 ± 0.025 −79.85% NO

Source: codeflash-evidence/20260507-cand3-errors-forcode-delta.md
(committed in this PR).

All four configurations: non-overlap 95% CIs (verified by
not (base_mean − base_err > feat_mean + feat_err or feat_mean − feat_err > base_mean + base_err)).
All scoreError/score ratios below 5% (max = 3.60% on baseline small).

End-to-end

E2E was measured at N=5 interleaved on a single-box KRaft broker
(4 vCPU, 1 GB heap, G1GC default, broker + producer + consumer on
the same host — CPU-contended), lz4 compression, 5 M × 1024 B
records. Full protocol and raw runs are in
codeflash-evidence/20260507-cand3-errors-forcode-e2e.md (committed
in this PR) and the raw per-run txt files under the session directory.

Half-widths below are 95% CIs via t-distribution with df=4 (t=2.776)
at N=5, recomputed directly from the raw records/sec values.

metric trunk (N=5) mean ± hw feature (N=5) mean ± hw Δ% CI overlap
Producer records/sec 188,153 ± 18,254 190,046 ± 12,252 +1.01% YES
Consumer records/sec 267,182 ± 57,828 240,174 ± 62,266 −10.11% YES

Source for all E2E numbers above: raw per-run records/sec files cited in
codeflash-evidence/20260507-cand3-errors-forcode-e2e.md. Means and
half-widths were recomputed from the raw data with t=2.776 (df=4, N=5)
and verified in this PR's integrity check.

E2E measurement inconclusive at N=5; 95% CIs overlap on both
producer and consumer.

This PR is not claiming an E2E throughput improvement. By Amdahl,
the measured Errors.forCode share of CPU on this workload was
~0.43% cumulative consumer and ~0.15% broker (reported in
codeflash-evidence/20260507-cand3-errors-forcode-delta.md under the
Amdahl section); a 75% reduction there caps the theoretical E2E at
~0.32% consumer / ~0.11% broker. N=5 on this single-box setup has
half-widths of 6.5–25.9%, so the Amdahl-predicted signal is several
orders of magnitude below the noise floor and cannot be distinguished.

The consumer point estimate of −10.11% is directional-negative
inside a wide noise band. No mechanism for a real consumer
regression is plausible (the refactor is strictly a same-inputs
same-outputs faster lookup), and the trunk/feature raw sequences
show the variance is driven by single-box CPU-contention jitter,
not by the refactor. The evidence file
codeflash-evidence/20260507-cand3-errors-forcode-e2e.md discusses
this in more detail under "Consumer (−10.11%, CI overlap)".

Callers and impact scope

Errors.forCode(short)

Verified non-test caller count: 279 Java call sites outside
/test/ and outside /jmh-benchmarks/.

rg -n "Errors\.forCode" --type java | grep -v '/test/' | grep -v '/jmh-benchmarks/' | wc -l
# → 279

The method sits on virtually every response-parsing path across
clients, broker, KRaft, and coordinators. Hot-path examples
verified by direct rg -n:

  • raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java:953, 973, 1170, 1182, 1346, 1358 (KRaft vote/fetch response parsing; 14 total in this file)
  • clients/src/main/java/org/apache/kafka/clients/NetworkClient.java:1029 (ApiVersions response parsing)
  • clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractFetch.java:208 (per-partition fetch error translation)
  • clients/src/main/java/org/apache/kafka/common/requests/FetchResponse.java:94, 134 (fetch response top-level + partition error codes)
  • clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java:1373, 1523, 1554 (offset-commit + group join error paths)
  • clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerHeartbeatRequestManager.java:188
  • core/src/main/scala/kafka/server/KafkaApis.scala:1575, 4162, 4228 (broker API dispatch)
  • transaction-coordinator/src/main/java/org/apache/kafka/coordinator/transaction/RPCProducerIdManager.java:172
  • group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java:846, 1601, 2026

Reproduction

Each command is copy-paste runnable on a machine with JDK 25 and
Gradle installed. Wall-time estimates assume a 4 vCPU host (they
scale roughly linearly with cores for the benchmark, sub-linearly
for gradle).

# Clone and prepare (~1 minute).
git clone https://github.com/codeflash-ai/kafka.git
cd kafka
git fetch origin codeflash/errors-forcode-dense-array-20260507-pr
BASE=94b6886b12bf050ee4075584293355923e0223ac   # trunk
HEAD=67dde51                                     # PR tip

# JMH baseline (~11 minutes: 4 configs × 2 forks × (5+10)×5 s).
git checkout $BASE
# The benchmark class is new on this PR branch. To measure baseline,
# copy ONLY the benchmark file from HEAD while keeping trunk's Errors.java:
git checkout $HEAD -- jmh-benchmarks/src/main/java/org/apache/kafka/jmh/common/ErrorsForCodeBenchmark.java
./gradlew :jmh-benchmarks:clean :jmh-benchmarks:shadowJar
java -jar jmh-benchmarks/build/libs/kafka-jmh-benchmarks-*-all.jar \
  ".*ErrorsForCodeBenchmark.*" \
  -f 2 -wi 5 -w 5 -i 10 -r 5 \
  -rf json -rff /tmp/baseline.json
git checkout -- .  # restore clean trunk

# JMH feature (~11 minutes).
git checkout $HEAD
./gradlew :jmh-benchmarks:clean :jmh-benchmarks:shadowJar
java -jar jmh-benchmarks/build/libs/kafka-jmh-benchmarks-*-all.jar \
  ".*ErrorsForCodeBenchmark.*" \
  -f 2 -wi 5 -w 5 -i 10 -r 5 \
  -rf json -rff /tmp/feature.json

# Tests (~1 minute).
./gradlew :clients:checkstyleMain :clients:checkstyleTest :clients:spotbugsMain \
          :clients:test --tests "*ErrorsTest*"

E2E reproduction (optional; results will be noise-dominated at N=5
on a single-box setup, as the evidence above documents):

# Fresh KRaft broker on the trunk jar (baseline) or the feature jar:
git checkout $BASE   # or $HEAD
./gradlew :clients:jar :core:jar
rm -rf /tmp/kraft-combined-logs
UUID=$(bin/kafka-storage.sh random-uuid)
bin/kafka-storage.sh format -t $UUID -c config/server.properties --standalone
bin/kafka-server-start.sh config/server.properties &

bin/kafka-topics.sh --create --topic t --partitions 8 --replication-factor 1 \
    --bootstrap-server localhost:9092

# 500k warm-up producer run
bin/kafka-producer-perf-test.sh --topic t --num-records 500000 --record-size 1024 \
    --throughput -1 \
    --producer-props acks=1 linger.ms=5 batch.size=65536 \
      compression.type=lz4 buffer.memory=134217728 \
      bootstrap.servers=localhost:9092

# 5M-record measurement run
bin/kafka-producer-perf-test.sh --topic t --num-records 5000000 --record-size 1024 \
    --throughput -1 \
    --producer-props acks=1 linger.ms=5 batch.size=65536 \
      compression.type=lz4 buffer.memory=134217728 \
      bootstrap.servers=localhost:9092

# Consumer read (fresh group per run)
bin/kafka-consumer-perf-test.sh --topic t --messages 5000000 \
    --bootstrap-server localhost:9092 \
    --group e2e-$(date +%s%N) --timeout 600000

Repeat N=5 times per side, interleaved A-B-A-B-A-B-A-B-A-B.

Risks and limitations

  • Static initialiser work at class-load. The initialiser now
    iterates Errors.values() twice (once to compute min/max, once
    to populate the array) instead of once. This is ~270 array loads
    at class-load, run exactly once per Errors classloader — an
    irrelevant one-off cost.
  • Memory. Adds one Errors[] of size MAX_CODE − MIN_CODE + 1
    to the Errors class's static state. At the current range
    [−1, 133] this is 135 reference slots ≈ < 300 bytes including
    array header. Retained for the life of the JVM. The prior
    HashMap it replaces held the same entries with larger
    per-entry overhead (HashMap.Node + autoboxed Short per
    key), so this is a small net reduction.
  • Future code range. If a future Kafka release adds an error
    code outside the current [−1, 133] range, the array grows to
    accommodate it, with any gap slots left null and resolved via
    the out-of-range fallback path (same behaviour as today). There
    is no hard upper bound enforced at code time, so in the worst
    case a code like Short.MAX_VALUE would allocate a 32 k-slot
    array (~128 KB). This is an eventuality worth noting; if it
    becomes a concern, the fallback could be reinstated as a
    HashMap lookup for codes beyond a hard cap, preserving the
    dense array for the common range.
  • Thread safety. The array reference is static final,
    populated in the class initialiser, read without
    synchronisation. This is the same pattern as the prior
    HashMap field (static final Map) and is safe for concurrent
    reads.
  • Measurement caveats. Measured on OpenJDK 25.0.2 on a 4-vCPU
    AWS host. E2E was measured on the same host with broker +
    producer + consumer sharing CPU (see the End-to-end section);
    N=5 is insufficient to detect a sub-1% Amdahl-predicted E2E
    signal, and the half-widths in the E2E table reflect this. JMH
    results on non-HotSpot JVMs have not been measured.

Test plan

  • ./gradlew :clients:test --tests "*ErrorsTest*" — 10 pass, 0 fail (10 total: 7 prior + 3 new in this PR)
  • ./gradlew :clients:checkstyleMain :clients:checkstyleTest — pass
  • ./gradlew :clients:spotbugsMain — pass
  • JMH microbench — raw JSON in codeflash-evidence/20260507-cand3-errors-forcode-delta.md with backing JSON files retained in the session directory
  • E2E — raw runs in codeflash-evidence/20260507-cand3-errors-forcode-e2e.md (N=5 single-box; result inconclusive as documented)

Target repo

This PR targets the fork codeflash-ai/kafka on the trunk branch,
NOT upstream apache/kafka. Opening an upstream PR would be an
explicit separate decision by the user.

Adds three tests that define the behavioral contract of Errors.forCode:

- testForCodeReturnsMatchingForEveryDefinedCode asserts round-trip identity
  for every code present in Errors.values(): forCode(e.code()) == e.

- testForCodeReturnsUnknownServerErrorForOutOfRangeCodes probes codes
  outside the defined range (including Short.MIN/MAX and just-past-edge
  values) and asserts the fallback to UNKNOWN_SERVER_ERROR. Codes that
  happen to be defined are skipped so the test stays robust as new
  errors are added.

- testForCodeIsIdentityInverseOfCode asserts forCode composed with
  Errors#code is idempotent.

These are behavior-preserving tests against the existing HashMap-based
implementation; they lock the contract before any refactor of the
lookup path.
Replaces the HashMap<Short, Errors> CODE_TO_ERROR with a dense Errors[]
indexed by (code - MIN_CODE). Error codes are a near-contiguous short
range (currently -1 to 133); MIN_CODE=-1 so UNKNOWN_SERVER_ERROR lands at
index 0. Any gaps in the range remain null in the array, and forCode
bounds-checks before indexing, then falls back to UNKNOWN_SERVER_ERROR
with the same log.warn behaviour as the prior implementation.

Why it works
- Removes the HashMap.get() + HashMap.getNode() + Integer.equals() chain
  observed in consumer profiles (cumulative ~0.4% of consumer CPU and
  ~0.15% of broker CPU on an lz4 reference workload).
- Removes the per-call Short autobox that would otherwise happen if the
  HashMap.get(short) callsite were devirtualized under heavy polymorphic
  load; the array path is allocation-free.
- The array entries live together in a small contiguous region of memory
  (< 280 bytes at current MAX_CODE=133), so hot calls hit L1.

Why it's correct
- For every Errors enum value e, the new forCode(e.code()) returns e
  (indexing by (code - MIN_CODE) is a bijection on the defined range
  and the static initializer places each enum at its own index, retaining
  the original duplicate-code check).
- For codes outside [MIN_CODE, MAX_CODE] the code short-circuits to the
  log.warn + UNKNOWN_SERVER_ERROR fallback, preserving the prior
  behaviour for unknown codes.
- For codes inside the range that do not correspond to any defined enum,
  the array slot is null and the same fallback runs.
- ExceptionInInitializerError still fires on duplicate codes (Errors
  loaded under the same ClassLoader) because we now guard against a
  pre-existing non-null slot at the same index.

Contract verified by the equivalence tests committed in the previous
commit: testForCodeReturnsMatchingForEveryDefinedCode,
testForCodeReturnsUnknownServerErrorForOutOfRangeCodes, and
testForCodeIsIdentityInverseOfCode.

No behaviour change beyond the lookup mechanism. Public API is unchanged.
No new allocations introduced; one Errors[] reference is retained at
class-init for the life of the JVM.
Four distributions approximate real-world usage:
- allZero: every call is NONE (code 0); the common happy path.
- small:   uniformly random over Errors.values(); exercises the table.
- mixed:   80/10/10 NONE / UNKNOWN_SERVER_ERROR / random; realistic.
- miss:    codes outside the defined range; worst-case fallback.

The harness preloads a shuffled code buffer at @setup and unrolls the
measured loop x4 so the per-call cost dominates loop overhead. All
results are consumed via Blackhole to prevent dead-code elimination.

Config: @fork(2) @WarmUp(5x5s) @measurement(10x5s) — the default
"full-confidence" Kafka profile used by prior candidate benchmarks in
this repo.
…(279)

The e2e evidence file committed in the previous commit used t=3.182
(df=3) when computing N=5 half-widths; the correct t-value for N=5 is
2.776 (df=4). The CI-overlap verdict is unchanged by this correction
(both producer and consumer 95% CIs still overlap with either t-value),
but the exact half-width numbers in the table were over-estimated by
a factor of ~1.146. This commit replaces those numbers with
half-widths recomputed directly from the raw records/sec txt files in
the session directory.

This commit also:
- Corrects the raw data rows for the trunk and feature consumer
  sequences so they match the actual per-run txt files (prior values
  were misattributed).
- Updates the caller-count claim (274 -> 279) in both evidence files
  to match a fresh grep of the current tree:
  `rg -n "Errors\.forCode" --type java | grep -v '/test/' | grep -v '/jmh-benchmarks/' | wc -l`
  -> 279.

No code or test changes — only evidence-file corrections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant