Skip to content

[fix][bk] Fix NPE in IsolatedBookieEnsemblePlacementPolicy when policy class does not match#25825

Open
Shawyeok wants to merge 1 commit into
apache:masterfrom
Shawyeok:hotfix/fix-npe-in-isolated-bookie-ensemble-placement-policy
Open

[fix][bk] Fix NPE in IsolatedBookieEnsemblePlacementPolicy when policy class does not match#25825
Shawyeok wants to merge 1 commit into
apache:masterfrom
Shawyeok:hotfix/fix-npe-in-isolated-bookie-ensemble-placement-policy

Conversation

@Shawyeok
Copy link
Copy Markdown
Contributor

Fixes NPE in IsolatedBookieEnsemblePlacementPolicy

Motivation

A NullPointerException occurs in getExcludedBookiesWithIsolationGroups when getIsolationGroup() returns a Pair whose left value is null. This happens when the EnsemblePlacementPolicyConfig carries a policy class name that does not equal IsolatedBookieEnsemblePlacementPolicy — in that case the if block that calls pair.setLeft() / pair.setRight() is skipped entirely, leaving both sides of the MutablePair as null.

This bug was observed during the upgrade from the 2.8 branch to the 3.0 branch. The same defect may also exist in master.

Full stack trace from production:

2026-05-18T13:28:40.756Z [BookKeeperClientWorker-OrderedExecutor-35-0] ERROR org.apache.bookkeeper.common.util.SingleThreadExecutor - Error while running task: Cannot invoke "java.util.Set.contains(Object)" because the return value of "org.apache.commons.lang3.tuple.Pair.getLeft()" is null
java.lang.NullPointerException: Cannot invoke "java.util.Set.contains(Object)" because the return value of "org.apache.commons.lang3.tuple.Pair.getLeft()" is null
    at org.apache.pulsar.bookie.rackawareness.IsolatedBookieEnsemblePlacementPolicy.getExcludedBookiesWithIsolationGroups(IsolatedBookieEnsemblePlacementPolicy.java:192)
    at org.apache.pulsar.bookie.rackawareness.IsolatedBookieEnsemblePlacementPolicy.getExcludedBookies(IsolatedBookieEnsemblePlacementPolicy.java:141)
    at org.apache.pulsar.bookie.rackawareness.IsolatedBookieEnsemblePlacementPolicy.replaceBookie(IsolatedBookieEnsemblePlacementPolicy.java:127)
    at org.apache.bookkeeper.client.BookieWatcherImpl.replaceBookie(BookieWatcherImpl.java:316)
    at org.apache.bookkeeper.client.EnsembleUtils.replaceBookiesInEnsemble(EnsembleUtils.java:69)
    at org.apache.bookkeeper.client.ReadOnlyLedgerHandle.handleBookieFailure(ReadOnlyLedgerHandle.java:222)
    at org.apache.bookkeeper.client.PendingAddOp.writeComplete(PendingAddOp.java:353)
    at org.apache.bookkeeper.proto.BookieClientImpl.completeAdd(BookieClientImpl.java:287)
    at org.apache.bookkeeper.proto.BookieClientImpl.access$200(BookieClientImpl.java:79)
    at org.apache.bookkeeper.proto.BookieClientImpl$ChannelReadyForAddEntryCallback.lambda$operationComplete$0(BookieClientImpl.java:405)
    at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
    at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
    at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)

This error causes topics to time out during loading. Since there is no automatic recovery path, the only way to restore service is by restarting the broker.

Modifications

Two complementary fixes:

  1. getIsolationGroup(): Initialize MutablePair with empty sets as defaults so the returned Pair always has non-null values, and remove the now-redundant else branches that were explicitly setting the same empty sets.

  2. getExcludedBookiesWithIsolationGroups() (defensive programming): Resolve primaryIsolationGroup and secondaryIsolationGroup at the top of the method using ObjectUtils.getIfNull, before any use. This eliminates both the original NPE at the early-return contains check and a second latent NPE at lines 218–219 where the pair values were previously assigned and dereferenced without null guards.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • Added testReplaceBookieWithNonMatchingPolicyClassShouldNotThrowNPE to reproduce the exact production failure: calling replaceBookie with custom metadata whose EnsemblePlacementPolicyConfig policy class does not match IsolatedBookieEnsemblePlacementPolicy, which previously caused the NPE.

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

…y class does not match

MutablePair was initialized without default values, leaving left/right as null
when the ensemble placement policy class name did not match. This caused a
NullPointerException when getLeft().contains() was called.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
}
for (String group : allGroups) {
Set<String> bookiesInGroup = allGroupsBookieMapping.get(group).keySet();
if (!primaryIsolationGroup.contains(group)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problematic behavior for the non-matching policyClass case. After this PR, getIsolationGroup() returns empty primary/secondary groups when the metadata policyClass is not
IsolatedBookieEnsemblePlacementPolicy. With primaryIsolationGroup empty, this condition is true for every configured group, so all grouped bookies are added to excludedBookies below. That avoids the NPE but can
still make replaceBookie fail with BKNotEnoughBookiesException. A non-matching policyClass should be ignored or fall back to defaultIsolationGroups, not be treated as an empty isolation group.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shawyeok Please check this. I wonder if there's a test case that would cover the explained scenario?

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants