Skip to content

feat: Bring Pollers & Workers on par with Go implementations #998#1055

Merged
Shaddoll merged 11 commits intomasterfrom
p2
May 6, 2026
Merged

feat: Bring Pollers & Workers on par with Go implementations #998#1055
Shaddoll merged 11 commits intomasterfrom
p2

Conversation

@Shaddoll
Copy link
Copy Markdown
Member

@Shaddoll Shaddoll commented May 1, 2026

What changed?

Cherry pick several changes from temporal to cadence

relates to temporalio/sdk-java#998

Why?
See temporalio/sdk-java#998.

Another reason is that we want to enforce rate limiting on sticky pollers.

How did you test it?
./gradlew test

Potential risks

This changes several components in java client. It could make workflow/poller to be stuck if there is anything wrong.

Release notes

Documentation Changes

Comment thread src/main/java/com/uber/cadence/internal/testservice/TaskQueue.java
Comment thread src/main/java/com/uber/cadence/internal/worker/WorkflowWorker.java Outdated
Comment thread src/main/java/com/uber/cadence/internal/testservice/TaskQueue.java Outdated
Comment thread src/main/java/com/uber/cadence/internal/worker/StickyQueueBalancer.java Outdated
@Shaddoll Shaddoll changed the title P2 Bring Pollers & Workers on par with Go implementations #998 May 1, 2026
@Shaddoll Shaddoll changed the title Bring Pollers & Workers on par with Go implementations #998 feat: Bring Pollers & Workers on par with Go implementations #998 May 1, 2026
@Shaddoll Shaddoll marked this pull request as ready for review May 1, 2026 22:20
Comment thread src/main/java/com/uber/cadence/internal/worker/WorkflowWorker.java Outdated
Comment thread src/main/java/com/uber/cadence/worker/WorkerFactoryOptions.java Outdated
Comment thread src/main/java/com/uber/cadence/worker/WorkerOptions.java Outdated
// situation that too many pollers (all of them in the worst case) will open only sticky queue
// polls observing a stickyBacklogSize == 1 for example (which actually can be 0 already at
// that moment) and get stuck causing dip in worker load.
if (stickyBacklogSize > pollersCount || stickyPollers <= normalPollers) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that we don't guarantee at least one Poll for each Kind. That seems like a safer/more conservative behavior.

Comment thread src/main/java/com/uber/cadence/internal/worker/WorkflowPollTask.java Outdated
Comment thread src/main/java/com/uber/cadence/internal/worker/WorkflowPollTask.java Outdated
// Unique id is needed to avoid collisions with other workers that may be created for the same
// task list and with the same identity.
UUID uniqueId = UUID.randomUUID();
return String.format("%s:%s", workerIdentity, uniqueId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks slightly different from the previous TaskList naming, does it match the Go client?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit different from go client because it also includes the pid

workerFactory.newWorker(
TASK_LIST, WorkerOptions.newBuilder().setMaxConcurrentWorkflowExecutionSize(2).build());
TASK_LIST,
WorkerOptions.newBuilder().setMaxConcurrentWorkflowExecutionSize(20).build());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is this still needed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the concurrent size needs to be larger than 2

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why? Shouldn't we be evicting sticky workflows for incoming workflows? This shouldn't impact it

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 6, 2026

Code Review ⚠️ Changes requested 5 resolved / 6 findings

Synchronizes Pollers and Workers with Go implementations, addressing race conditions and stopwatch management. However, the completion callback failure in the finally block requires resolution to prevent runLock leakage.

⚠️ Bug: Completion callback failure in finally block prevents runLock release

📄 src/main/java/com/uber/cadence/internal/worker/WorkflowWorker.java:321-335

In TaskHandlerImpl.handle(), both task.getCompletionCallback().apply() (line 326) and runLocks.unlock() (line 334) are in the same finally block. If the completion callback throws an exception, the runLock will never be released, causing a deadlock for all subsequent decision tasks for that workflow run. The completion callback releases a semaphore permit (which shouldn't throw), but defensive coding would protect against this.

Suggested fix
Wrap the completion callback in its own try-finally or move the runLock unlock into a nested finally block:

} finally {
  MDC.remove(LoggerTag.WORKFLOW_ID);
  MDC.remove(LoggerTag.WORKFLOW_TYPE);
  MDC.remove(LoggerTag.RUN_ID);
  try {
    task.getCompletionCallback().apply();
  } finally {
    if (runLock != null) {
      runLocks.unlock(response.getWorkflowExecution().getRunId());
    }
  }
}
✅ 5 resolved
Bug: awaitTermination may wait up to 2x the specified timeout

📄 src/main/java/com/uber/cadence/internal/worker/WorkflowWorker.java:237-244
In WorkflowWorker.awaitTermination(), the full timeout is passed to super.awaitTermination() and then the same full timeout is passed to stickyPoller.awaitTermination(). This means the method could block for up to 2x the requested timeout. The WorkerFactory correctly handles this by tracking elapsed time using InternalUtils.awaitTermination.

Bug: Race condition in StickyQueueBalancer.makePoll() check-then-act

📄 src/main/java/com/uber/cadence/internal/worker/StickyQueueBalancer.java:37-46
The makePoll() method reads stickyPollers and normalPollers atomically as individual operations, but the compound check stickyPollers.get() <= normalPollers.get() followed by stickyPollers.incrementAndGet() is not atomic. Under contention, multiple threads can observe the same counter values and all decide to poll sticky, temporarily starving the normal queue. While this is a polling balancer (so consequences are suboptimal distribution rather than data corruption), it contradicts the @ThreadSafe annotation and could cause normal-queue starvation bursts.

Bug: Stopwatch never stopped on error or null-result paths in doPoll

📄 src/main/java/com/uber/cadence/internal/worker/WorkflowPollTask.java:131-133 📄 src/main/java/com/uber/cadence/internal/worker/WorkflowPollTask.java:189
In doPoll(), a DualStopwatch is started at line 131 but sw.stop() is only called at line 189 on the success path. If an exception is thrown (lines 142-156) or the result is null/has no task token (line 174), the stopwatch is never stopped. Since DualStopwatch does not implement AutoCloseable and requires explicit stopping, this leaks the latency metric — it either never records or records an incorrect value on error paths.

Bug: TaskQueue backlog uses LIFO (push/pop) instead of FIFO ordering

📄 src/main/java/com/uber/cadence/internal/testservice/TaskQueue.java:51 📄 src/main/java/com/uber/cadence/internal/testservice/TaskQueue.java:69 📄 src/main/java/com/uber/cadence/internal/testservice/TaskQueue.java:46 📄 src/main/java/com/uber/cadence/internal/testservice/TaskQueue.java:66
The TaskQueue.add() method uses backlog.push(element) (adds to head) and poll() uses backlog.pop() (removes from head), resulting in LIFO/stack behavior. This contradicts the Javadoc which states "Adds the provided element to the tail of this queue" and the FIFO comment for waiters. In a task queue, LIFO ordering can cause starvation of older tasks. While this only affects the test service, it could lead to flaky or incorrect test behavior.

Quality: Redundant double synchronization in TaskQueue.poll()

📄 src/main/java/com/uber/cadence/internal/testservice/TaskQueue.java:61 📄 src/main/java/com/uber/cadence/internal/testservice/TaskQueue.java:64
The poll() method is declared synchronized (line 61) and then contains a synchronized (this) block (line 64), which locks on the same monitor. The inner synchronized block is redundant and adds confusion.

🤖 Prompt for agents
Code Review: Synchronizes Pollers and Workers with Go implementations, addressing race conditions and stopwatch management. However, the completion callback failure in the finally block requires resolution to prevent runLock leakage.

1. ⚠️ Bug: Completion callback failure in finally block prevents runLock release
   Files: src/main/java/com/uber/cadence/internal/worker/WorkflowWorker.java:321-335

   In `TaskHandlerImpl.handle()`, both `task.getCompletionCallback().apply()` (line 326) and `runLocks.unlock()` (line 334) are in the same `finally` block. If the completion callback throws an exception, the runLock will never be released, causing a deadlock for all subsequent decision tasks for that workflow run. The completion callback releases a semaphore permit (which shouldn't throw), but defensive coding would protect against this.

   Suggested fix:
   Wrap the completion callback in its own try-finally or move the runLock unlock into a nested finally block:
   
   } finally {
     MDC.remove(LoggerTag.WORKFLOW_ID);
     MDC.remove(LoggerTag.WORKFLOW_TYPE);
     MDC.remove(LoggerTag.RUN_ID);
     try {
       task.getCompletionCallback().apply();
     } finally {
       if (runLock != null) {
         runLocks.unlock(response.getWorkflowExecution().getRunId());
       }
     }
   }

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 76.71756% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.56%. Comparing base (b0f0a7e) to head (3919fd3).
⚠️ Report is 37 commits behind head on master.

Files with missing lines Patch % Lines
...ence/internal/testservice/TestWorkflowService.java 37.50% 18 Missing and 2 partials ⚠️
...m/uber/cadence/internal/testservice/TaskQueue.java 65.38% 11 Missing and 7 partials ⚠️
...uber/cadence/internal/worker/WorkflowPollTask.java 81.70% 12 Missing and 3 partials ⚠️
...in/java/com/uber/cadence/worker/WorkerOptions.java 60.00% 4 Missing ⚠️
...r/cadence/internal/worker/StickyQueueBalancer.java 87.50% 1 Missing and 2 partials ⚠️
...m/uber/cadence/internal/worker/WorkflowWorker.java 95.00% 0 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (76.71%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.

Files with missing lines Coverage Δ Complexity Δ
...uber/cadence/internal/sync/SyncWorkflowWorker.java 83.11% <100.00%> (+1.29%) 14.00 <0.00> (ø)
...rnal/testservice/TestWorkflowMutableStateImpl.java 82.16% <100.00%> (-0.73%) 185.00 <1.00> (ø)
...adence/internal/testservice/TestWorkflowStore.java 78.78% <ø> (ø) 0.00 <0.00> (ø)
...ce/internal/testservice/TestWorkflowStoreImpl.java 82.55% <100.00%> (-2.87%) 33.00 <0.00> (+1.00) ⬇️
...com/uber/cadence/internal/worker/DecisionTask.java 100.00% <100.00%> (ø) 3.00 <3.00> (?)
...uber/cadence/internal/worker/PollTaskExecutor.java 94.59% <ø> (+0.30%) 7.00 <0.00> (ø)
...r/cadence/internal/worker/SingleWorkerOptions.java 100.00% <100.00%> (ø) 14.00 <2.00> (+2.00)
src/main/java/com/uber/cadence/worker/Worker.java 84.68% <100.00%> (+1.18%) 17.00 <1.00> (+2.00)
...in/java/com/uber/cadence/worker/WorkerFactory.java 86.72% <100.00%> (-0.44%) 32.00 <0.00> (-9.00)
.../com/uber/cadence/worker/WorkerFactoryOptions.java 81.81% <ø> (+1.33%) 9.00 <0.00> (-1.00) ⬆️
... and 6 more

... and 94 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d67cf58...3919fd3. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pollRequest.setTaskList(tl);
TaskListKind taskListKind = stickyQueueBalancer.makePoll();
boolean isSticky = TaskListKind.STICKY.equals(taskListKind);
PollForDecisionTaskRequest request =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move to separate method, pass the TaskList as a parameter

@Shaddoll Shaddoll merged commit a582d59 into master May 6, 2026
14 of 15 checks passed
@Shaddoll Shaddoll deleted the p2 branch May 6, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants