Skip to content

[AI-8th] fix: replace WeakOrderQueue with WeakHashMap+MPSC to fix Recycler memory leak#1259

Open
jervyclaw wants to merge 4 commits intosofastack:masterfrom
jervyclaw:pr/fix/recycler-memory-leak
Open

[AI-8th] fix: replace WeakOrderQueue with WeakHashMap+MPSC to fix Recycler memory leak#1259
jervyclaw wants to merge 4 commits intosofastack:masterfrom
jervyclaw:pr/fix/recycler-memory-leak

Conversation

@jervyclaw
Copy link
Copy Markdown

Problem

The old Recycler used WeakOrderQueue + per-thread ThreadLocal WeakHashMap for cross-thread object recycling. When a thread terminated, its WeakOrderQueue entries accumulated because the WeakHashMap held strong references to the queues, causing unbounded memory growth (SOFAJRaft issue #1240).

Solution

Replaced the per-thread ThreadLocal WeakHashMap design with a single global WeakHashMap: Stack → MPSC queue.

Key design decisions:

  • WeakHashMap instead of ConcurrentHashMap: ConcurrentHashMap holds strong references to keys (Stack), preventing GC of dead threads. WeakHashMap's weak keys allow Stack to be GC'd when only the map holds a reference.
  • WeakHashMap's internal cleanup: WeakHashMap runs its own internal expungeStaleEntries() on every public map operation, automatically removing entries whose keys have been GC'd.
  • handle.stack cleared before enqueue: Prevents stale cross-thread references (matching old WeakOrderQueue semantics).
  • Owner isAlive() check: Dead threads discard incoming items to prevent unbounded queue growth.
  • Transfer limit checked before poll: Avoids unnecessary queue.poll() calls.

Testing

All 6 existing RecyclersTest cases pass.

jervyclaw and others added 3 commits April 4, 2026 22:30
… memory leak

Sync from Netty 4.1.108+ Recycler PR #11858 (partial).

Problem:
- The old Recycler used WeakOrderQueue + WeakHashMap for cross-thread object recycling
- When a thread terminated, its WeakOrderQueue entries remained in WeakHashMap
  due to strong references from the map, causing unbounded memory growth
- This manifested as memory leaks in long-running SOFAJRaft servers with
  many thread pool operations

Solution:
- Replaced WeakOrderQueue with JCTools MpscGrowableAtomicArrayQueue
- Replaced ThreadLocal<WeakHashMap> with ConcurrentHashMap<Stack<?>, MpscQueue>
- When the owning thread dies, the ThreadLocal entry becomes GC-reachable,
  and so does the MPSC queue (no strong references held externally)
- The bounded MPSC queue (1024 max) prevents unbounded growth even if
  the owner thread is slow to consume cross-thread returns

Trade-offs:
- WeakOrderQueue had free garbage collection of dead thread queues
  (via WeakReference to Thread)
- New design requires the queue to be bounded; items are dropped if
  the queue fills up before the owner can drain it
- For the intended use case (thread pool with bounded queues), this
  is acceptable since threads are long-lived and actively consuming

Tested:
- All 6 existing RecyclersTest cases pass
- Same-thread recycle/reuse behavior preserved
- Cross-thread recycle behavior preserved
- Max capacity enforcement preserved
…Map + ReferenceQueue

Claude Code review identified critical issues:

1. Critical: ConcurrentHashMap holds strong refs to Stack keys — Stack was
   NEVER GC'd when owner thread died. Switched to WeakHashMap with
   ReferenceQueue so dead threads' stacks are properly garbage collected.

2. Bug: Map entries for dead threads were never removed from ConcurrentHashMap.
   Now expungeStaleEntries() is called on every map access, cleaning up
   WeakReference entries that have been cleared by GC.

3. Bug: handle.stack not nulled before enqueue. Fixed — now nulled BEFORE
   offer() to prevent stale cross-thread references (matching old WeakOrderQueue
   lazySet semantics).

4. Minor: mpscQueueHead field was dead code — removed.

5. Minor: scavengeSome() silently skipped double-recycle handles. Now throws
   IllegalStateException to catch bugs early.

All 6 RecyclersTest cases still pass.
Dead code removal per Claude Code review:

- STACK_REF_QUEUE was never connected to any WeakReference (WeakHashMap
  creates its own internal ReferenceQueue automatically)
- expungeStaleEntries() always found STACK_REF_QUEUE.poll() == null
- WeakHashMap's internal expungeStaleEntries() handles cleanup on every
  public map operation — no manual cleanup needed
- Removed: STACK_REF_QUEUE field, expungeStaleEntries() method, and its
  two call sites

Updated comment to reflect that WeakHashMap's internal cleanup runs
automatically on every map operation.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bef670ceb6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

final int limit = 64; // Transfer up to 64 items per call (prevents long pauses)

// Transfer items from MPSC queue to local buffer
while ((handle = queue.poll()) != null && transferred < limit) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check transfer bounds before dequeuing handles

In Stack.scavengeSome, the loop condition calls queue.poll() before verifying transferred < limit, so once transferred reaches 64 the method still dequeues one additional handle and then exits, dropping that handle. The same pattern also drops one handle when size >= maxCapacity because capacity is checked only after dequeue. Under heavy cross-thread recycling this silently discards reusable objects and increases allocation churn; bounds/capacity checks need to happen before consuming from the queue (or the extra item must be preserved).

Useful? React with 👍 / 👎.

…dle drop

When the local buffer is full (size >= maxCapacity), the old code had
already called queue.poll() to dequeue the handle before discovering
the buffer was full, causing the handle to be silently dropped.
Restructure the loop to check transferred < limit and size < maxCapacity
BEFORE polling, so no item is consumed unless it can be stored.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 612fa97a70

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


private static final class SystemPropertyUtil {
static int getInt(final String name, final int defaultValue) {
final String value = System.getProperty(name);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore guarded property access for recycler config

The new inner SystemPropertyUtil reads System.getProperty directly, so a denied property read now throws SecurityException during Recyclers static initialization (MAX_CAPACITY_PER_THREAD setup) instead of falling back to defaults. In JVMs still running with a security policy, this can fail class loading with ExceptionInInitializerError and disable recycling entirely; the previous shared com.alipay.sofa.jraft.util.SystemPropertyUtil caught access failures and avoided this regression.

Useful? React with 👍 / 👎.

@killme2008
Copy link
Copy Markdown
Contributor

Thank you for your contribution.

We generally cannot accept pull requests primarily written by AI due to limited review time.

Before submitting a PR, please ensure you understand the problem, potential solutions, and the code itself. Create an issue beforehand if necessary.

Thanks.

@shihuili1218
Copy link
Copy Markdown
Collaborator

  1. The STACK_REF_QUEUE is never wired to any WeakReference, so poll() always returns null and the whole loop is a no-op.
  2. The old design gave each thread its own ThreadLocal — threads never blocked each other. Your replacement is a single Collections.synchronizedMap(WeakHashMap) shared by all threads.
  3. recycle(T o, Handle) was changed to recycle(Object o, Handle). This shouldn't be changed without a good reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Sync Recycler improvements from Netty to fix memory leak

3 participants