fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2613
fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2613nan-li wants to merge 4 commits intofeat/identity_verification_5.8from
Conversation
There was a problem hiding this comment.
This PR touches JWT retry logic and IAM fetch pipeline with multiple concurrency concerns — a human review is warranted before merging.
Extended reasoning...
Overview
The PR modifies 17 files across JWT token management (JwtTokenStore, UserManager), the in-app message fetch pipeline (InAppMessagesManager, InAppBackendService), the consistency manager, and the login executor. It introduces a new IJwtUpdateListener pub/sub mechanism, a pending-retry pattern for 401 IAM fetches, and plugs the ConsistencyManager's missing mutex in resolveConditionsWithID.
Security risks
JWT handling and authentication gating are directly affected. The new retry path (storing externalId + rywData and re-triggering fetchMessages on JWT refresh) must be correct to avoid fetching IAMs with stale or wrong credentials. The non-atomic @volatile pair identified by the bug hunter could, under a concurrent user-switch, cause fetchMessages to run with a mismatched externalId/rywData combination.
Level of scrutiny
High. This is production-critical code: incorrect JWT retry behavior can silently degrade IAM delivery or, in edge cases, fetch IAMs for the wrong user context. The changes are logically sound in the common case but have race conditions in corner cases (concurrent user switch during 401 handling, and putJwt called with unchanged value triggering a spurious retry).
Other factors
The PR description template was never filled out — the required One Line Summary, Motivation, and Scope sections remain as placeholder text, making it difficult to understand intent without reading the code. The subscribe/unsubscribe lifecycle for IJwtUpdateListener is asymmetric (InAppMessagesManager subscribes in start() but never unsubscribes). These factors, combined with the two identified bugs, make human review necessary.
b4b0da9 to
a77a264
Compare
|
@claude review |
|
|
||
| // lambda so that it is updated on each potential retry | ||
| val sessionDurationProvider = { _time.currentTimeMillis - _sessionService.startTime } | ||
| val newMessages = _backend.listInAppMessages(appId, aliasLabel, aliasValue, subscriptionId, rywData, sessionDurationProvider, jwt) | ||
| val newMessages = | ||
| try { | ||
| _backend.listInAppMessages(appId, aliasLabel, aliasValue, subscriptionId, rywData, sessionDurationProvider, jwt) | ||
| } catch (ex: BackendException) { | ||
| if (NetworkUtils.getResponseStatusType(ex.statusCode) == NetworkUtils.ResponseStatusType.UNAUTHORIZED) { | ||
| Logging.debug("InAppMessagesManager.fetchMessages: ${ex.statusCode} response. Will retry after JWT refresh for externalId=$externalId") | ||
| lastTimeFetchedIAMs = null | ||
| pendingJwtRetryRywData = rywData | ||
| pendingJwtRetryExternalId = externalId | ||
| } | ||
| null | ||
| } | ||
|
|
||
| if (newMessages != null) { | ||
| this.messages = newMessages as MutableList<InAppMessage> |
There was a problem hiding this comment.
🔴 In the 401 exception handler in fetchMessages, lastTimeFetchedIAMs = null is written outside fetchIAMMutex, while the corresponding rate-limit read-and-write (lastTimeFetchedIAMs = now) runs inside the mutex. A concurrent fetchMessages call can acquire the mutex after the null write, set a fresh timestamp, and cause the JWT retry path (via onJwtUpdated -> fetchMessages(null rywData)) to be silently throttled by the rate limiter — permanently dropping the retry since pendingJwtRetryExternalId was already cleared. Fix: move lastTimeFetchedIAMs = null inside fetchIAMMutex.withLock {}.
Extended reasoning...
What the bug is and how it manifests
In InAppMessagesManager.fetchMessages, the rate-limiter field lastTimeFetchedIAMs is protected by fetchIAMMutex for the read-then-write in the rate-limit check (the withLock block around lines 326–334 in the diff). However, in the 401/403 exception handler added by this PR, the reset lastTimeFetchedIAMs = null is performed outside the mutex (in the catch (ex: BackendException) block, ~line 352).
The specific code path that triggers it
Two paths can concurrently invoke fetchMessages via suspendifyOnIO: session start, subscription change, config model replacement, and identity model update. When one call gets a 401 and executes the catch block, it writes null to lastTimeFetchedIAMs without holding the mutex. A second concurrently-executing call can acquire the mutex in the window between the two writes and update lastTimeFetchedIAMs to a fresh now value.
Why existing code does not prevent it
@Volatile guarantees that each individual read/write to lastTimeFetchedIAMs is immediately visible across threads, but it provides no atomicity for the sequence: (null write outside mutex) → (mutex acquired) → (now write inside mutex). The mutex only guards the read-check-and-set inside withLock; it does not prevent non-mutex code from interleaving.
Step-by-step proof
- Coroutine A (
rywData == null, triggered byonSessionStarted) acquiresfetchIAMMutex, writeslastTimeFetchedIAMs = T_A, releases the mutex, then calls the backend. - The backend returns 401. Coroutine A's catch block writes
lastTimeFetchedIAMs = nullwithout the mutex and setspendingJwtRetryExternalId = externalId. - Concurrently, Coroutine B (
rywData != null, triggered by a subscription change or session event) acquiresfetchIAMMutexafter A's null write. It seeslastTimeFetchedIAMs == null, skips the rate-limit check (condition is false), writeslastTimeFetchedIAMs = T_B(a recent timestamp), and exits the mutex. - The developer calls
updateUserJwt, which firesonJwtUpdated -> fetchMessages(retryRywData=null).pendingJwtRetryExternalIdis cleared beforefetchMessagesis called. - Inside
fetchMessages(null), the rate limiter evaluates:rywData == null && lastTimeFetchedIAMs == T_B (not null) && (now - T_B) < fetchIAMMinInterval→ returns early, silently dropping the retry. - Because
pendingJwtRetryExternalIdwas already cleared in step 4, no future JWT update will re-trigger the retry. The IAM fetch is permanently lost until an unrelated event (e.g., new session) firesfetchMessagesagain.
Why the impact is real
The scenario at login is realistic: an anonymous-user IAM fetch (null rywData) may get 401 during startup, while the login flow concurrently triggers a non-null rywData fetch (e.g., via onModelUpdated for the new OneSignal ID). The default fetchIAMMinInterval is 30 seconds, so the retry delay is significant. Multiple verifiers independently confirmed the race is exploitable in normal login flows with identity verification enabled.
How to fix
Move lastTimeFetchedIAMs = null inside fetchIAMMutex.withLock {} so the reset is serialized with the rate-limit check:
} catch (ex: BackendException) {
if (NetworkUtils.getResponseStatusType(ex.statusCode) == NetworkUtils.ResponseStatusType.UNAUTHORIZED) {
Logging.debug("...")
fetchIAMMutex.withLock { lastTimeFetchedIAMs = null } // serialized reset
pendingJwtRetryRywData = rywData
pendingJwtRetryExternalId = externalId
}
null
}This ensures that any concurrent fetchMessages call that acquires the mutex will see null (or its own subsequent write), not a stale now written after the 401 catch cleared it.
| fun addJwtInvalidatedListener(listener: IUserJwtInvalidatedListener) { | ||
| jwtInvalidatedNotifier.subscribe(listener) | ||
| val pendingExternalId: String? | ||
| synchronized(jwtInvalidatedLock) { | ||
| jwtInvalidatedNotifier.subscribe(listener) | ||
| pendingExternalId = pendingJwtInvalidatedExternalId | ||
| pendingJwtInvalidatedExternalId = null | ||
| } | ||
| pendingExternalId?.let { | ||
| listener.onUserJwtInvalidated(UserJwtInvalidatedEvent(it)) | ||
| } | ||
| } |
There was a problem hiding this comment.
🔴 In addJwtInvalidatedListener, the listener is subscribed (making hasSubscribers=true) inside jwtInvalidatedLock, but the pending-replay delivery happens synchronously outside the lock, creating two problems. First, a concurrent fireJwtInvalidated call in that narrow window will also fire to the newly-registered listener, so the developer's onUserJwtInvalidated may be invoked twice in rapid succession. Second, the replay runs synchronously on the caller's thread rather than through jwtInvalidatedAppCallbackScope.launch (Dispatchers.Default), violating the async delivery contract documented on fireJwtInvalidated and missing the runCatching protection present on the normal path. Fix: schedule the pending replay via jwtInvalidatedAppCallbackScope.launch { runCatching { ... } } inside the synchronized block, matching the documented asynchronous contract.
Extended reasoning...
What the bugs are and how they manifest
addJwtInvalidatedListener (UserManager.kt lines 61–71) has two related defects introduced in the pending-replay logic.
Bug 1 — double-delivery race: The listener is subscribed to jwtInvalidatedNotifier inside jwtInvalidatedLock (which makes hasSubscribers == true visible to other threads immediately). The lock is then released, and only after that release does the method call listener.onUserJwtInvalidated(UserJwtInvalidatedEvent(it)) directly. In the window between lock release and that direct call, a concurrent thread running fireJwtInvalidated(externalId) can acquire the lock, observe hasSubscribers == true, and schedule a coroutine (via jwtInvalidatedAppCallbackScope.launch) that also fires to every subscriber. When that coroutine executes it calls the same listener for the same (or coincident) externalId, resulting in two onUserJwtInvalidated invocations in quick succession.
Bug 2 — synchronous replay on caller thread: The fireJwtInvalidated method is documented as delivering events "asynchronously so the caller … is not blocked by developer code" and uses jwtInvalidatedAppCallbackScope.launch throughout. The replay path at lines 68–70 is synchronous and unprotected by runCatching. If a developer registers the listener on the main thread and a pending event exists, the callback fires synchronously on the main thread before addJwtInvalidatedListener returns — and any exception from developer code propagates up to the registration call, unlike the normal path that catches and logs it.
Specific code path that triggers it
Between (A)'s lock release and (B)'s synchronous call, step (C) can observe hasSubscribers == true and schedule an async coroutine. Both (B) and the coroutine then deliver to the same listener.
Addressing the refutation
A refutation argues that the concurrent fireJwtInvalidated in the race window represents a new 401 event rather than a re-delivery of the buffered startup event, so two callbacks are technically correct. This is valid for the case where a genuine second 401 fires while the listener is being registered — those truly are separate events. However, the race is indistinguishable to the developer (both deliveries carry the same externalId), and the probability of a genuine concurrent 401 during the few-instruction window is low enough that in practice developers will treat this as a spurious duplicate. More importantly, both bugs are independently present regardless of this race: Bug 2 (synchronous delivery + missing runCatching) exists even without any concurrent thread.
Why existing code doesn't prevent it
EventProducer.subscribe uses the synchronized list's own internal lock, separate from jwtInvalidatedLock. Once the listener is added inside jwtInvalidatedLock, it becomes visible to jwtInvalidatedNotifier.hasSubscribers on other threads immediately, before the outer lock is released. There is no mechanism to suppress concurrent fireJwtInvalidated calls in the window between the lock release and the synchronous delivery.
Impact
A developer's onUserJwtInvalidated callback called twice causes duplicate token-refresh requests, double-posted analytics, or unexpected state-machine transitions. The synchronous delivery can cause an ANR if the callback does I/O and is called on the main thread. An unguarded developer exception in the replay path crashes or corrupts state at the registration call site instead of being logged and swallowed.
Step-by-step proof (double delivery)
- SDK starts; no listeners.
fireJwtInvalidated('alice')is called → buffered:pendingJwtInvalidatedExternalId = 'alice'. - Developer calls
addJwtInvalidatedListener(listener). Lock acquired; listener subscribed (hasSubscribers=true); pendingExternalId='alice' captured; pending cleared; lock released. - OperationRepo processes a 401 and calls
fireJwtInvalidated('alice'). Lock acquired; hasSubscribers=true → launches coroutine to fire 'alice' to all listeners. Lock released. addJwtInvalidatedListenerdelivers 'alice' synchronously to listener (step B above).- The coroutine from step 3 executes and fires 'alice' to all listeners, including our listener.
listener.onUserJwtInvalidated('alice')has been called twice.
How to fix
Schedule the replay via the same async path used for normal delivery, inside the lock, before releasing it:
Scheduling the coroutine inside the lock ensures that any concurrent fireJwtInvalidated that acquires the lock afterwards will see the pending state already cleared, and its own coroutine will be the only remaining delivery of that event.
| backendSubscriptions.remove(backendSubscription) | ||
| } | ||
|
|
||
| if (response.rywData != null) { | ||
| _consistencyManager.setRywData(backendOneSignalId, IamFetchRywTokenKey.USER, response.rywData) |
There was a problem hiding this comment.
🔴 In LoginUserOperationExecutor.createUser, calling resolveConditionsWithID(IamFetchReadyCondition.ID) resolves ALL pending IamFetchReadyCondition instances regardless of which user they belong to, because every instance shares the same static id string 'IamFetchReadyCondition'. On cold start, when fetchMessagesWhenConditionIsMet (running on Dispatchers.IO) has already registered an IamFetchReadyCondition for the currently-active User B while a persisted User A LoginUserOperation from a prior session is still being processed, User A's createUser completion forcibly resolves User B's condition with null — triggering a premature non-RYW IAM fetch for User B that may show stale in-app messages. Fix: scope the condition id to include the onesignalId (e.g. "IamFetchReadyCondition:$onesignalId") so resolveConditionsWithID only affects the specific user whose operation completed.
Extended reasoning...
What the bug is and how it manifests
IamFetchReadyCondition.ID is the fixed constant string 'IamFetchReadyCondition' (line 17 of IamFetchReadyCondition.kt), and every instance returns this same value via override val id: String get() = ID. ConsistencyManager.resolveConditionsWithID(id) iterates all pending conditions and completes every entry where condition.id == id. Since all IamFetchReadyCondition instances share the same id, a call to resolveConditionsWithID(IamFetchReadyCondition.ID) in LoginUserOperationExecutor.createUser resolves all pending IamFetchReadyCondition entries in the manager — not just the one for the user whose createUser just completed.
The specific code path that triggers it
fetchMessagesWhenConditionIsMet runs via suspendifyOnIO on Dispatchers.IO, completely independent of OperationRepo's single-threaded execution scope. It registers an IamFetchReadyCondition keyed to the current _userManager.onesignalId and then awaits the deferred. This registration happens eagerly on any onSessionStarted, onModelReplaced (config), or subscription-id change event — none of which are gated on OperationRepo's queue state.
Why the refutation does not fully hold
The refutation argues that OperationRepo's sequential, single-threaded execution prevents the cross-user overlap: User A's createUser + resolveConditionsWithID completes before User B's LoginUserOperation even starts, and IamFetchReadyCondition for User B is only registered after User B's createUser updates the onesignalId from local to remote.
This reasoning covers the rapid user switch path where both users' backend IDs are unknown at the time. However, it misses the cold-start scenario: if the app restarts while User B (with a pre-existing backend ID) is active, onModelReplaced fires immediately during startup on the main/IO thread and registers IamFetchReadyCondition(B-backend-id). OperationRepo then loads and processes a persisted LoginUserOperation for User A from a prior session. When User A's createUser completes, resolveConditionsWithID('IamFetchReadyCondition') is called, which matches and forcibly completes User B's deferred with null. The identityModelChangeHandler.onModelUpdated path that registers conditions does NOT cover this case — User B's ID is already a backend ID, so the local→remote transition never fires; only fetchMessagesWhenConditionIsMet via onModelReplaced is responsible.
Step-by-step proof (cold-start scenario)
- App cold-starts with User B (backend id
b-remote) active. Config model is replaced →InAppMessagesManager.onModelReplacedfires →fetchMessagesWhenConditionIsMetruns onDispatchers.IO._userManager.onesignalId=b-remote. RegistersIamFetchReadyCondition("b-remote")in ConsistencyManager.indexedTokensis empty soisMetreturns false; the deferred is parked awaiting resolution. OperationRepoloads a persistedLoginUserOperationfor User A (never completed before app was killed). It runs on its single-threaded scope and callscreateUser.- Backend responds with
b-a-remotefor User A.resolveConditionsWithID("IamFetchReadyCondition")is called. It iterates all conditions;IamFetchReadyCondition("b-remote").id == "IamFetchReadyCondition"— match. Deferred is completed withnull. - User B's coroutine in step 1 unblocks with
nullrywData and callsfetchMessages(null)— a non-RYW-aware IAM fetch. Since User B's backend state may not have fully settled or RYW tokens are unavailable, stale in-app messages may be returned. - User B will eventually get a correct RYW-aware fetch (after
setRywDatais called forb-remote), but there is a window where stale IAMs may have already been evaluated and displayed.
Impact
This is a correctness regression introduced by this PR. The consequence is a spurious non-RYW IAM fetch for the currently-active user triggered by a different user's completed login operation. The severity is bounded — the correct fetch follows shortly — but in-app messages shown in the gap may be stale.
How to fix
Scope the condition id to include the onesignalId, e.g. change IamFetchReadyCondition to return "IamFetchReadyCondition:$key" as its id. Then resolveConditionsWithID should be called with "IamFetchReadyCondition:$backendOneSignalId" so only the condition for the completed user is released. Alternatively, add an onesignalId field to the resolve call and filter on both id and a user key.
IdentityVerificationService.onModelReplaced (config HYDRATE) could fire before OperationRepo.loadSavedOperations() finished, causing removeOperationsWithoutExternalId() to run against an empty queue. Wrap the HYDRATE handler in suspendifyOnIO + awaitInitialized() so the purge waits for the queue to be fully populated, following the same pattern as RecoverFromDroppedLoginBug.
fireJwtInvalidated now buffers the externalId under a synchronized lock when no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the first IUserJwtInvalidatedListener is added. Clears pending event on user switch (onModelReplaced) to prevent stale replay.
LoginUserOperationExecutor.createUser() was not passing the RYW token from the backend response to the ConsistencyManager, causing the IamFetchRywTokenKey condition to never resolve after login. This meant InAppMessagesManager coroutines awaited forever and IAMs were never fetched for the logged-in user. - Add rywData field to CreateUserResponse - Parse ryw_token/ryw_delay in JSONConverter.convertToCreateUserResponse - Set RYW data in ConsistencyManager after successful createUser - Fix resolveConditionsWithID to only remove matching conditions - Wrap resolveConditionsWithID in mutex for thread safety
When an IAM fetch returns an unauthorized response (401 or 403), the SDK now saves the pending fetch state and automatically retries once the JWT is refreshed for the same user via updateUserJwt. Switching users clears any stale retry. - Add IJwtUpdateListener to JwtTokenStore for post-putJwt notification - InAppMessagesManager subscribes and retries on JWT update - Reset rate-limiter on 401 so retry is not throttled - Use @volatile on cross-thread fields (lastTimeFetchedIAMs, pendingJwtRetryExternalId, pendingJwtRetryRywData) - RYW-aware fetches bypass rate limiter to avoid stale data - Handle 401 in fetchInAppMessagesWithoutRywToken fallback path
a77a264 to
f94d780
Compare
Description
One Line Summary
Fix multiple issues with identity verification (IV) enabled: startup race conditions, missing JWT invalidated callback, stuck IAM fetch after login, and IAM retry on expired JWT.
Details
Motivation
With identity verification enabled, several issues surfaced during testing:
onUserJwtInvalidatedlistener may not be called when JWT was missing at SDK init timeScope
All changes are gated behind identity verification being enabled (
useIdentityVerification == true). Normal (non-IV) SDK behavior is unaffected.Supersedes #2609 — squashed into clean commits after rebasing on updated base branch.
Commit 1 -- Fix race condition: purge anonymous ops AFTER queue is loaded
IdentityVerificationService.onModelReplaced(config HYDRATE) could fire beforeOperationRepo.loadSavedOperations()finished, causingremoveOperationsWithoutExternalId()to run against an empty queuesuspendifyOnIO+awaitInitialized(), following the same pattern asRecoverFromDroppedLoginBugCommit 2 -- Replay JWT invalidated event to late-registered listeners
fireJwtInvalidatednow buffers theexternalIdunder a synchronized lock when no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the firstIUserJwtInvalidatedListeneris addedonModelReplaced) to prevent stale replayCommit 3 -- Fix IAM fetch stuck after login with IV enabled
LoginUserOperationExecutor.createUser()was not passing the RYW token from the backend response to theConsistencyManager, causingIamFetchReadyConditionto not resolverywDatafield toCreateUserResponse, parsedryw_token/ryw_delayinJSONConverter, and set RYW data inConsistencyManagerafter successfulcreateUserresolveConditionsWithIDto only remove matching conditions (was removing all)resolveConditionsWithIDin mutex for thread safetyCommit 4 -- Retry IAM fetch after JWT refresh on 401/403 response
InAppBackendServicenow throwsBackendExceptionon unauthorized (401/403) responses instead of returningnullInAppMessagesManagercatches the exception, stores a pending retry state with@Volatilefields, and retries the fetch whenJwtTokenStorenotifies viaIJwtUpdateListenerthat the JWT has been refreshed for the same userfetchInAppMessagesWithoutRywTokenfallback pathTesting
Unit testing
LoginUserOperationExecutorTeststo include the newIConsistencyManagerdependencyUserManagerTestsfor JWT invalidated replay behaviorInAppBackendServiceTests: 401 response throwsBackendExceptionInAppBackendServiceTests: 401 from fallback (no RYW token) path throwsBackendExceptionInAppMessagesManagerTestscovering the JWT 401 retry flow:Manual testing
updateUserJwtis calledAffected code checklist
Checklist
Overview
Testing
Final pass