fix: throttle retriable transport upload failures#770
Conversation
Detect retriable transport errors via a shared Swift detector and apply throttling for message/alias retry paths while preserving existing HTTP retry ordering.
Remove the dedicated transport-throttle helper and handle transport retry throttling inline for message and alias uploads to keep the retry flow straightforward.
PR SummaryMedium Risk Overview Updates Reviewed by Cursor Bugbot for commit 3c29a84. Bugbot is set up for automated code reviews on this repo. Configure here. |
📦 SDK Size Impact ReportMeasures how much the SDK adds to an app's size (with-SDK minus without-SDK).
➡️ SDK size impact change is minimal. Raw measurementsTarget branch (main): {"baseline_app_size_kb":84,"baseline_executable_size_bytes":75464,"with_sdk_app_size_kb":1880,"with_sdk_executable_size_bytes":76312,"sdk_impact_kb":1796,"sdk_executable_impact_bytes":848,"xcframework_size_kb":6576}This PR: {"baseline_app_size_kb":84,"baseline_executable_size_bytes":75464,"with_sdk_app_size_kb":1896,"with_sdk_executable_size_bytes":76312,"sdk_impact_kb":1812,"sdk_executable_impact_bytes":848,"xcframework_size_kb":6592} |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 42b6d59. Configure here.
|
|
||
| if error.code == noConnectionErrorCode { | ||
| return true | ||
| } |
There was a problem hiding this comment.
Domain-agnostic error code check misclassifies build failures as transport errors
Medium Severity
The noConnectionErrorCode check (error.code == 1) matches errors from any domain, but the only code-1 errors in the codebase come from MPConnector.m with domain @"MPConnector" — these represent URL request build failures (e.g., nil URL), not transient transport errors. When a URL builder returns nil, this misclassification triggers throttleWithHTTPResponse: with a nil HTTP response, applying a default 2-hour upload throttle for what is a configuration-related failure, not a network issue.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 42b6d59. Configure here.
Keep transport throttling focused on host/connectivity failures by excluding not-connected and timed-out NSURL errors, and update tests to lock in the new retry classification behavior.
BrandonStalnaker
left a comment
There was a problem hiding this comment.
The code looks solid to me. I'm somewhat worried the change is a bit broad. I'm worried that for a few of these error types we would want to retry sooner than 2 hours from the failure (I think thats the minimum right). I'd like anothers opinion before I'll approve.
NSURLErrorCannotFindHost,
NSURLErrorCannotConnectToHost,
NSURLErrorNetworkConnectionLost,
NSURLErrorDNSLookupFailed,
NSURLErrorCannotLoadFromNetwork,
NSURLErrorSecureConnectionFailed,
NSURLErrorInternationalRoamingOff,
NSURLErrorDataNotAllowed,
NSURLErrorCallIsActive,
NSURLErrorAppTransportSecurityRequiresSecureConnection:
| } | ||
| } | ||
|
|
||
| return error.domain == "com.mparticle" && error.code == 0 |
There was a problem hiding this comment.
We should extract these magic strings and number
Edit: Changed from Nit as a simple change in MPConnector.m could could an issue for us but the tests would still pass. We should have this and MPConnector.m share the same constant
| if (!isSuccessCode && !isInvalidCode) { | ||
| if ([self isRetriableTransportError:error]) { | ||
| MPILogWarning(@"Throttling uploads after transport error."); | ||
| [self throttleWithHTTPResponse:httpResponse uploadType:MPUploadTypeMessage]; |
There was a problem hiding this comment.
Agree with Brandon. These retriable transport errors shouldn't be treated the same as a 429 or 503.
Those status codes are deliberate server-side backoffs to avoid DDoS'ing ourselves. Transport errors are transient client-side conditions that typically clear in seconds. With this approach, a user on the subway losing signal for 10 seconds would freeze uploads for 2 hours.
Can we use a separate, shorter window for transport errors e.g. 30s with exponential backoff, capped at a few minutes?
| } | ||
| } | ||
|
|
||
| return error.domain == "com.mparticle" && error.code == 0 |
There was a problem hiding this comment.
It looks like com.mparticle code 0 is the SDK's own semaphore timeout (MPConnector.m:276/:321), and the test naming confirms it's treated as a timeout. But NSURLErrorTimedOut isn't in the switch above and falls through to false.
May be mistaken but it looks like these are the same condition at different layers, the SDK timeout fires precisely when the URLSession callback didn't return in time. We should treat both the same way: either add NSURLErrorTimedOut to the retriable list, or drop this last return.
jamesnrokt
left a comment
There was a problem hiding this comment.
Requesting changes based on the above comments.
My core concern: as it stands, any retriable transport error puts the SDK into a 2-hour upload freeze with no recovery hook as minUploadDate is only checked, never cleared early.
Even a brief connectivity drop costs 2 hours of buffered events. Worse, because every client uses the same hardcoded default with no jitter, a brief endpoint blip triggers a synchronized thundering herd of requests hitting our servers 2 hours later.


Background
Some transport/network failures can produce excessive retry traffic when uploads cannot reach mParticle endpoints. This update introduces explicit transport-error detection and keeps throttling behavior aligned with existing retry flow.
What Has Changed
MPTransportErrorDetectorto centralize retriable transport error classification.Screenshots/Video
N/A
Checklist
Additional Notes