Skip to content

fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2613

Open
nan-li wants to merge 4 commits intofeat/identity_verification_5.8from
fix/jwt_misc_race_conditions
Open

fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2613
nan-li wants to merge 4 commits intofeat/identity_verification_5.8from
fix/jwt_misc_race_conditions

Conversation

@nan-li
Copy link
Copy Markdown
Contributor

@nan-li nan-li commented Apr 15, 2026

Description

One Line Summary

Fix multiple issues with identity verification (IV) enabled: startup race conditions, missing JWT invalidated callback, stuck IAM fetch after login, and IAM retry on expired JWT.

Details

Motivation

With identity verification enabled, several issues surfaced during testing:

  • Anonymous operations were sometimes not purged on cold start due to timing
  • The onUserJwtInvalidated listener may not be called when JWT was missing at SDK init time
  • In-App Messages may not be fetched after the first login
  • If a JWT expired mid-session, the IAM fetch could silently fail with no recovery

Scope

All changes are gated behind identity verification being enabled (useIdentityVerification == true). Normal (non-IV) SDK behavior is unaffected.

Supersedes #2609 — squashed into clean commits after rebasing on updated base branch.

Commit 1 -- Fix race condition: purge anonymous ops AFTER queue is loaded

  • IdentityVerificationService.onModelReplaced (config HYDRATE) could fire before OperationRepo.loadSavedOperations() finished, causing removeOperationsWithoutExternalId() to run against an empty queue
  • Fixed by wrapping the HYDRATE handler in suspendifyOnIO + awaitInitialized(), following the same pattern as RecoverFromDroppedLoginBug

Commit 2 -- Replay JWT invalidated event to late-registered listeners

  • fireJwtInvalidated now buffers the externalId under a synchronized lock when no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the first IUserJwtInvalidatedListener is added
  • Clears pending event on user switch (onModelReplaced) to prevent stale replay

Commit 3 -- Fix IAM fetch stuck after login with IV enabled

  • LoginUserOperationExecutor.createUser() was not passing the RYW token from the backend response to the ConsistencyManager, causing IamFetchReadyCondition to not resolve
  • Added rywData field to CreateUserResponse, parsed ryw_token/ryw_delay in JSONConverter, and set RYW data in ConsistencyManager after successful createUser
  • Fixed resolveConditionsWithID to only remove matching conditions (was removing all)
  • Wrapped resolveConditionsWithID in mutex for thread safety

Commit 4 -- Retry IAM fetch after JWT refresh on 401/403 response

  • InAppBackendService now throws BackendException on unauthorized (401/403) responses instead of returning null
  • InAppMessagesManager catches the exception, stores a pending retry state with @Volatile fields, and retries the fetch when JwtTokenStore notifies via IJwtUpdateListener that the JWT has been refreshed for the same user
  • Pending retry is cleared on user switch to avoid stale retries
  • Rate-limiter reset on 401 so retry is not throttled; RYW-aware fetches bypass rate limiter
  • Handles 401 in fetchInAppMessagesWithoutRywToken fallback path

Testing

Unit testing

  • Updated LoginUserOperationExecutorTests to include the new IConsistencyManager dependency
  • Added UserManagerTests for JWT invalidated replay behavior
  • Added test in InAppBackendServiceTests: 401 response throws BackendException
  • Added test in InAppBackendServiceTests: 401 from fallback (no RYW token) path throws BackendException
  • Added 5 tests in InAppMessagesManagerTests covering the JWT 401 retry flow:
    • 401 stores pending retry state
    • JWT refresh for matching user triggers retry
    • JWT refresh for different user does not retry
    • No-op when no pending retry exists
    • User switch clears pending retry state

Manual testing

  • Tested with IV enabled on a fresh install: verified no IAM fetch for anonymous user, IAM fetched after login
  • Tested JWT expiry mid-session: verified IAM fetch retries after updateUserJwt is called
  • Tested login/logout cycle: verified pending retry is cleared on user switch

Affected code checklist

  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing (fix IV-related issues for 5.8 release)
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible
  • I have reviewed this PR myself, ensuring it meets each checklist item

@nan-li nan-li changed the title Fix/jwt misc race conditions fix(jwt): race conditions and IAM reliability issues with identity verification enabled Apr 15, 2026
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR touches JWT retry logic and IAM fetch pipeline with multiple concurrency concerns — a human review is warranted before merging.

Extended reasoning...

Overview

The PR modifies 17 files across JWT token management (JwtTokenStore, UserManager), the in-app message fetch pipeline (InAppMessagesManager, InAppBackendService), the consistency manager, and the login executor. It introduces a new IJwtUpdateListener pub/sub mechanism, a pending-retry pattern for 401 IAM fetches, and plugs the ConsistencyManager's missing mutex in resolveConditionsWithID.

Security risks

JWT handling and authentication gating are directly affected. The new retry path (storing externalId + rywData and re-triggering fetchMessages on JWT refresh) must be correct to avoid fetching IAMs with stale or wrong credentials. The non-atomic @volatile pair identified by the bug hunter could, under a concurrent user-switch, cause fetchMessages to run with a mismatched externalId/rywData combination.

Level of scrutiny

High. This is production-critical code: incorrect JWT retry behavior can silently degrade IAM delivery or, in edge cases, fetch IAMs for the wrong user context. The changes are logically sound in the common case but have race conditions in corner cases (concurrent user switch during 401 handling, and putJwt called with unchanged value triggering a spurious retry).

Other factors

The PR description template was never filled out — the required One Line Summary, Motivation, and Scope sections remain as placeholder text, making it difficult to understand intent without reading the code. The subscribe/unsubscribe lifecycle for IJwtUpdateListener is asymmetric (InAppMessagesManager subscribes in start() but never unsubscribes). These factors, combined with the two identified bugs, make human review necessary.

@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch 2 times, most recently from b4b0da9 to a77a264 Compare April 15, 2026 17:31
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 15, 2026

@claude review

@nan-li nan-li mentioned this pull request Apr 15, 2026
18 tasks
Comment on lines 343 to 360

// lambda so that it is updated on each potential retry
val sessionDurationProvider = { _time.currentTimeMillis - _sessionService.startTime }
val newMessages = _backend.listInAppMessages(appId, aliasLabel, aliasValue, subscriptionId, rywData, sessionDurationProvider, jwt)
val newMessages =
try {
_backend.listInAppMessages(appId, aliasLabel, aliasValue, subscriptionId, rywData, sessionDurationProvider, jwt)
} catch (ex: BackendException) {
if (NetworkUtils.getResponseStatusType(ex.statusCode) == NetworkUtils.ResponseStatusType.UNAUTHORIZED) {
Logging.debug("InAppMessagesManager.fetchMessages: ${ex.statusCode} response. Will retry after JWT refresh for externalId=$externalId")
lastTimeFetchedIAMs = null
pendingJwtRetryRywData = rywData
pendingJwtRetryExternalId = externalId
}
null
}

if (newMessages != null) {
this.messages = newMessages as MutableList<InAppMessage>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In the 401 exception handler in fetchMessages, lastTimeFetchedIAMs = null is written outside fetchIAMMutex, while the corresponding rate-limit read-and-write (lastTimeFetchedIAMs = now) runs inside the mutex. A concurrent fetchMessages call can acquire the mutex after the null write, set a fresh timestamp, and cause the JWT retry path (via onJwtUpdated -> fetchMessages(null rywData)) to be silently throttled by the rate limiter — permanently dropping the retry since pendingJwtRetryExternalId was already cleared. Fix: move lastTimeFetchedIAMs = null inside fetchIAMMutex.withLock {}.

Extended reasoning...

What the bug is and how it manifests

In InAppMessagesManager.fetchMessages, the rate-limiter field lastTimeFetchedIAMs is protected by fetchIAMMutex for the read-then-write in the rate-limit check (the withLock block around lines 326–334 in the diff). However, in the 401/403 exception handler added by this PR, the reset lastTimeFetchedIAMs = null is performed outside the mutex (in the catch (ex: BackendException) block, ~line 352).

The specific code path that triggers it

Two paths can concurrently invoke fetchMessages via suspendifyOnIO: session start, subscription change, config model replacement, and identity model update. When one call gets a 401 and executes the catch block, it writes null to lastTimeFetchedIAMs without holding the mutex. A second concurrently-executing call can acquire the mutex in the window between the two writes and update lastTimeFetchedIAMs to a fresh now value.

Why existing code does not prevent it

@Volatile guarantees that each individual read/write to lastTimeFetchedIAMs is immediately visible across threads, but it provides no atomicity for the sequence: (null write outside mutex) → (mutex acquired) → (now write inside mutex). The mutex only guards the read-check-and-set inside withLock; it does not prevent non-mutex code from interleaving.

Step-by-step proof

  1. Coroutine A (rywData == null, triggered by onSessionStarted) acquires fetchIAMMutex, writes lastTimeFetchedIAMs = T_A, releases the mutex, then calls the backend.
  2. The backend returns 401. Coroutine A's catch block writes lastTimeFetchedIAMs = null without the mutex and sets pendingJwtRetryExternalId = externalId.
  3. Concurrently, Coroutine B (rywData != null, triggered by a subscription change or session event) acquires fetchIAMMutex after A's null write. It sees lastTimeFetchedIAMs == null, skips the rate-limit check (condition is false), writes lastTimeFetchedIAMs = T_B (a recent timestamp), and exits the mutex.
  4. The developer calls updateUserJwt, which fires onJwtUpdated -> fetchMessages(retryRywData=null). pendingJwtRetryExternalId is cleared before fetchMessages is called.
  5. Inside fetchMessages(null), the rate limiter evaluates: rywData == null && lastTimeFetchedIAMs == T_B (not null) && (now - T_B) < fetchIAMMinIntervalreturns early, silently dropping the retry.
  6. Because pendingJwtRetryExternalId was already cleared in step 4, no future JWT update will re-trigger the retry. The IAM fetch is permanently lost until an unrelated event (e.g., new session) fires fetchMessages again.

Why the impact is real

The scenario at login is realistic: an anonymous-user IAM fetch (null rywData) may get 401 during startup, while the login flow concurrently triggers a non-null rywData fetch (e.g., via onModelUpdated for the new OneSignal ID). The default fetchIAMMinInterval is 30 seconds, so the retry delay is significant. Multiple verifiers independently confirmed the race is exploitable in normal login flows with identity verification enabled.

How to fix

Move lastTimeFetchedIAMs = null inside fetchIAMMutex.withLock {} so the reset is serialized with the rate-limit check:

} catch (ex: BackendException) {
    if (NetworkUtils.getResponseStatusType(ex.statusCode) == NetworkUtils.ResponseStatusType.UNAUTHORIZED) {
        Logging.debug("...")
        fetchIAMMutex.withLock { lastTimeFetchedIAMs = null }  // serialized reset
        pendingJwtRetryRywData = rywData
        pendingJwtRetryExternalId = externalId
    }
    null
}

This ensures that any concurrent fetchMessages call that acquires the mutex will see null (or its own subsequent write), not a stale now written after the 401 catch cleared it.

Comment on lines 61 to 71
fun addJwtInvalidatedListener(listener: IUserJwtInvalidatedListener) {
jwtInvalidatedNotifier.subscribe(listener)
val pendingExternalId: String?
synchronized(jwtInvalidatedLock) {
jwtInvalidatedNotifier.subscribe(listener)
pendingExternalId = pendingJwtInvalidatedExternalId
pendingJwtInvalidatedExternalId = null
}
pendingExternalId?.let {
listener.onUserJwtInvalidated(UserJwtInvalidatedEvent(it))
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In addJwtInvalidatedListener, the listener is subscribed (making hasSubscribers=true) inside jwtInvalidatedLock, but the pending-replay delivery happens synchronously outside the lock, creating two problems. First, a concurrent fireJwtInvalidated call in that narrow window will also fire to the newly-registered listener, so the developer's onUserJwtInvalidated may be invoked twice in rapid succession. Second, the replay runs synchronously on the caller's thread rather than through jwtInvalidatedAppCallbackScope.launch (Dispatchers.Default), violating the async delivery contract documented on fireJwtInvalidated and missing the runCatching protection present on the normal path. Fix: schedule the pending replay via jwtInvalidatedAppCallbackScope.launch { runCatching { ... } } inside the synchronized block, matching the documented asynchronous contract.

Extended reasoning...

What the bugs are and how they manifest

addJwtInvalidatedListener (UserManager.kt lines 61–71) has two related defects introduced in the pending-replay logic.

Bug 1 — double-delivery race: The listener is subscribed to jwtInvalidatedNotifier inside jwtInvalidatedLock (which makes hasSubscribers == true visible to other threads immediately). The lock is then released, and only after that release does the method call listener.onUserJwtInvalidated(UserJwtInvalidatedEvent(it)) directly. In the window between lock release and that direct call, a concurrent thread running fireJwtInvalidated(externalId) can acquire the lock, observe hasSubscribers == true, and schedule a coroutine (via jwtInvalidatedAppCallbackScope.launch) that also fires to every subscriber. When that coroutine executes it calls the same listener for the same (or coincident) externalId, resulting in two onUserJwtInvalidated invocations in quick succession.

Bug 2 — synchronous replay on caller thread: The fireJwtInvalidated method is documented as delivering events "asynchronously so the caller … is not blocked by developer code" and uses jwtInvalidatedAppCallbackScope.launch throughout. The replay path at lines 68–70 is synchronous and unprotected by runCatching. If a developer registers the listener on the main thread and a pending event exists, the callback fires synchronously on the main thread before addJwtInvalidatedListener returns — and any exception from developer code propagates up to the registration call, unlike the normal path that catches and logs it.

Specific code path that triggers it

Between (A)'s lock release and (B)'s synchronous call, step (C) can observe hasSubscribers == true and schedule an async coroutine. Both (B) and the coroutine then deliver to the same listener.

Addressing the refutation

A refutation argues that the concurrent fireJwtInvalidated in the race window represents a new 401 event rather than a re-delivery of the buffered startup event, so two callbacks are technically correct. This is valid for the case where a genuine second 401 fires while the listener is being registered — those truly are separate events. However, the race is indistinguishable to the developer (both deliveries carry the same externalId), and the probability of a genuine concurrent 401 during the few-instruction window is low enough that in practice developers will treat this as a spurious duplicate. More importantly, both bugs are independently present regardless of this race: Bug 2 (synchronous delivery + missing runCatching) exists even without any concurrent thread.

Why existing code doesn't prevent it

EventProducer.subscribe uses the synchronized list's own internal lock, separate from jwtInvalidatedLock. Once the listener is added inside jwtInvalidatedLock, it becomes visible to jwtInvalidatedNotifier.hasSubscribers on other threads immediately, before the outer lock is released. There is no mechanism to suppress concurrent fireJwtInvalidated calls in the window between the lock release and the synchronous delivery.

Impact

A developer's onUserJwtInvalidated callback called twice causes duplicate token-refresh requests, double-posted analytics, or unexpected state-machine transitions. The synchronous delivery can cause an ANR if the callback does I/O and is called on the main thread. An unguarded developer exception in the replay path crashes or corrupts state at the registration call site instead of being logged and swallowed.

Step-by-step proof (double delivery)

  1. SDK starts; no listeners. fireJwtInvalidated('alice') is called → buffered: pendingJwtInvalidatedExternalId = 'alice'.
  2. Developer calls addJwtInvalidatedListener(listener). Lock acquired; listener subscribed (hasSubscribers=true); pendingExternalId='alice' captured; pending cleared; lock released.
  3. OperationRepo processes a 401 and calls fireJwtInvalidated('alice'). Lock acquired; hasSubscribers=true → launches coroutine to fire 'alice' to all listeners. Lock released.
  4. addJwtInvalidatedListener delivers 'alice' synchronously to listener (step B above).
  5. The coroutine from step 3 executes and fires 'alice' to all listeners, including our listener.
  6. listener.onUserJwtInvalidated('alice') has been called twice.

How to fix

Schedule the replay via the same async path used for normal delivery, inside the lock, before releasing it:

Scheduling the coroutine inside the lock ensures that any concurrent fireJwtInvalidated that acquires the lock afterwards will see the pending state already cleared, and its own coroutine will be the only remaining delivery of that event.

Comment on lines 228 to +232
backendSubscriptions.remove(backendSubscription)
}

if (response.rywData != null) {
_consistencyManager.setRywData(backendOneSignalId, IamFetchRywTokenKey.USER, response.rywData)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In LoginUserOperationExecutor.createUser, calling resolveConditionsWithID(IamFetchReadyCondition.ID) resolves ALL pending IamFetchReadyCondition instances regardless of which user they belong to, because every instance shares the same static id string 'IamFetchReadyCondition'. On cold start, when fetchMessagesWhenConditionIsMet (running on Dispatchers.IO) has already registered an IamFetchReadyCondition for the currently-active User B while a persisted User A LoginUserOperation from a prior session is still being processed, User A's createUser completion forcibly resolves User B's condition with null — triggering a premature non-RYW IAM fetch for User B that may show stale in-app messages. Fix: scope the condition id to include the onesignalId (e.g. "IamFetchReadyCondition:$onesignalId") so resolveConditionsWithID only affects the specific user whose operation completed.

Extended reasoning...

What the bug is and how it manifests

IamFetchReadyCondition.ID is the fixed constant string 'IamFetchReadyCondition' (line 17 of IamFetchReadyCondition.kt), and every instance returns this same value via override val id: String get() = ID. ConsistencyManager.resolveConditionsWithID(id) iterates all pending conditions and completes every entry where condition.id == id. Since all IamFetchReadyCondition instances share the same id, a call to resolveConditionsWithID(IamFetchReadyCondition.ID) in LoginUserOperationExecutor.createUser resolves all pending IamFetchReadyCondition entries in the manager — not just the one for the user whose createUser just completed.

The specific code path that triggers it

fetchMessagesWhenConditionIsMet runs via suspendifyOnIO on Dispatchers.IO, completely independent of OperationRepo's single-threaded execution scope. It registers an IamFetchReadyCondition keyed to the current _userManager.onesignalId and then awaits the deferred. This registration happens eagerly on any onSessionStarted, onModelReplaced (config), or subscription-id change event — none of which are gated on OperationRepo's queue state.

Why the refutation does not fully hold

The refutation argues that OperationRepo's sequential, single-threaded execution prevents the cross-user overlap: User A's createUser + resolveConditionsWithID completes before User B's LoginUserOperation even starts, and IamFetchReadyCondition for User B is only registered after User B's createUser updates the onesignalId from local to remote.

This reasoning covers the rapid user switch path where both users' backend IDs are unknown at the time. However, it misses the cold-start scenario: if the app restarts while User B (with a pre-existing backend ID) is active, onModelReplaced fires immediately during startup on the main/IO thread and registers IamFetchReadyCondition(B-backend-id). OperationRepo then loads and processes a persisted LoginUserOperation for User A from a prior session. When User A's createUser completes, resolveConditionsWithID('IamFetchReadyCondition') is called, which matches and forcibly completes User B's deferred with null. The identityModelChangeHandler.onModelUpdated path that registers conditions does NOT cover this case — User B's ID is already a backend ID, so the local→remote transition never fires; only fetchMessagesWhenConditionIsMet via onModelReplaced is responsible.

Step-by-step proof (cold-start scenario)

  1. App cold-starts with User B (backend id b-remote) active. Config model is replaced → InAppMessagesManager.onModelReplaced fires → fetchMessagesWhenConditionIsMet runs on Dispatchers.IO. _userManager.onesignalId = b-remote. Registers IamFetchReadyCondition("b-remote") in ConsistencyManager. indexedTokens is empty so isMet returns false; the deferred is parked awaiting resolution.
  2. OperationRepo loads a persisted LoginUserOperation for User A (never completed before app was killed). It runs on its single-threaded scope and calls createUser.
  3. Backend responds with b-a-remote for User A. resolveConditionsWithID("IamFetchReadyCondition") is called. It iterates all conditions; IamFetchReadyCondition("b-remote").id == "IamFetchReadyCondition" — match. Deferred is completed with null.
  4. User B's coroutine in step 1 unblocks with null rywData and calls fetchMessages(null) — a non-RYW-aware IAM fetch. Since User B's backend state may not have fully settled or RYW tokens are unavailable, stale in-app messages may be returned.
  5. User B will eventually get a correct RYW-aware fetch (after setRywData is called for b-remote), but there is a window where stale IAMs may have already been evaluated and displayed.

Impact

This is a correctness regression introduced by this PR. The consequence is a spurious non-RYW IAM fetch for the currently-active user triggered by a different user's completed login operation. The severity is bounded — the correct fetch follows shortly — but in-app messages shown in the gap may be stale.

How to fix

Scope the condition id to include the onesignalId, e.g. change IamFetchReadyCondition to return "IamFetchReadyCondition:$key" as its id. Then resolveConditionsWithID should be called with "IamFetchReadyCondition:$backendOneSignalId" so only the condition for the completed user is released. Alternatively, add an onesignalId field to the resolve call and filter on both id and a user key.

nan-li added 4 commits April 16, 2026 01:14
IdentityVerificationService.onModelReplaced (config HYDRATE) could
fire before OperationRepo.loadSavedOperations() finished, causing
removeOperationsWithoutExternalId() to run against an empty queue.

Wrap the HYDRATE handler in suspendifyOnIO + awaitInitialized() so
the purge waits for the queue to be fully populated, following the
same pattern as RecoverFromDroppedLoginBug.
fireJwtInvalidated now buffers the externalId under a synchronized
lock when no listeners are subscribed (e.g. during SDK init HYDRATE)
and replays it when the first IUserJwtInvalidatedListener is added.
Clears pending event on user switch (onModelReplaced) to prevent
stale replay.
LoginUserOperationExecutor.createUser() was not passing the RYW token
from the backend response to the ConsistencyManager, causing the
IamFetchRywTokenKey condition to never resolve after login. This meant
InAppMessagesManager coroutines awaited forever and IAMs were never
fetched for the logged-in user.

- Add rywData field to CreateUserResponse
- Parse ryw_token/ryw_delay in JSONConverter.convertToCreateUserResponse
- Set RYW data in ConsistencyManager after successful createUser
- Fix resolveConditionsWithID to only remove matching conditions
- Wrap resolveConditionsWithID in mutex for thread safety
When an IAM fetch returns an unauthorized response (401 or 403), the
SDK now saves the pending fetch state and automatically retries once
the JWT is refreshed for the same user via updateUserJwt. Switching
users clears any stale retry.

- Add IJwtUpdateListener to JwtTokenStore for post-putJwt notification
- InAppMessagesManager subscribes and retries on JWT update
- Reset rate-limiter on 401 so retry is not throttled
- Use @volatile on cross-thread fields (lastTimeFetchedIAMs,
  pendingJwtRetryExternalId, pendingJwtRetryRywData)
- RYW-aware fetches bypass rate limiter to avoid stale data
- Handle 401 in fetchInAppMessagesWithoutRywToken fallback path
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from a77a264 to f94d780 Compare April 16, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant