Skip to content

fix: throttle retriable transport upload failures#770

Open
denischilik wants to merge 3 commits into
mainfrom
fix/retry-transport-detection
Open

fix: throttle retriable transport upload failures#770
denischilik wants to merge 3 commits into
mainfrom
fix/retry-transport-detection

Conversation

@denischilik
Copy link
Copy Markdown
Contributor

@denischilik denischilik commented May 14, 2026

Background

Some transport/network failures can produce excessive retry traffic when uploads cannot reach mParticle endpoints. This update introduces explicit transport-error detection and keeps throttling behavior aligned with existing retry flow.

What Has Changed

  • Added a Swift MPTransportErrorDetector to centralize retriable transport error classification.
  • Applied transport-aware throttling in both message and alias upload retry paths.
  • Kept existing 429/503 handling order while adding transport-failure throttling in the non-success retry branch.
  • Added unit tests for transport error detection and wired them into the iOS test target.

Screenshots/Video

N/A

Checklist

  • I have performed a self-review of my own code.
  • I have made corresponding changes to the documentation.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have tested this locally.

Additional Notes

Detect retriable transport errors via a shared Swift detector and apply throttling for message/alias retry paths while preserving existing HTTP retry ordering.
Remove the dedicated transport-throttle helper and handle transport retry throttling inline for message and alias uploads to keep the retry flow straightforward.
@denischilik denischilik requested a review from a team as a code owner May 14, 2026 17:26
@cursor
Copy link
Copy Markdown

cursor Bot commented May 14, 2026

PR Summary

Medium Risk
Touches the upload retry/throttling path for both message and alias requests, so misclassification of errors could delay uploads or increase retry traffic. Scope is localized and covered by new unit tests for the classifier.

Overview
Adds a new Swift MPTransportErrorDetector to centralize detection of retriable transport errors (selected NSURLErrorDomain codes, a legacy no-connection code, and com.mparticle timeout).

Updates MPNetworkCommunication.m so message and alias upload retries also throttle when failures are identified as transport-level retriable errors (in addition to existing 429/503 throttling), reducing excessive retry traffic on network failures. Includes new unit tests for the detector and wires them into the Xcode test target.

Reviewed by Cursor Bugbot for commit 3c29a84. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

📦 SDK Size Impact Report

Measures how much the SDK adds to an app's size (with-SDK minus without-SDK).

Metric Target Branch This PR Change
App Bundle Impact 1.75 MB 1.77 MB +16 KB
Executable Impact 848 bytes 848 bytes +N/A
XCFramework Size 6.42 MB 6.44 MB +16 KB

➡️ SDK size impact change is minimal.

Raw measurements

Target branch (main):

{"baseline_app_size_kb":84,"baseline_executable_size_bytes":75464,"with_sdk_app_size_kb":1880,"with_sdk_executable_size_bytes":76312,"sdk_impact_kb":1796,"sdk_executable_impact_bytes":848,"xcframework_size_kb":6576}

This PR:

{"baseline_app_size_kb":84,"baseline_executable_size_bytes":75464,"with_sdk_app_size_kb":1896,"with_sdk_executable_size_bytes":76312,"sdk_impact_kb":1812,"sdk_executable_impact_bytes":848,"xcframework_size_kb":6592}

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 42b6d59. Configure here.


if error.code == noConnectionErrorCode {
return true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Domain-agnostic error code check misclassifies build failures as transport errors

Medium Severity

The noConnectionErrorCode check (error.code == 1) matches errors from any domain, but the only code-1 errors in the codebase come from MPConnector.m with domain @"MPConnector" — these represent URL request build failures (e.g., nil URL), not transient transport errors. When a URL builder returns nil, this misclassification triggers throttleWithHTTPResponse: with a nil HTTP response, applying a default 2-hour upload throttle for what is a configuration-related failure, not a network issue.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 42b6d59. Configure here.

Keep transport throttling focused on host/connectivity failures by excluding not-connected and timed-out NSURL errors, and update tests to lock in the new retry classification behavior.
Copy link
Copy Markdown
Collaborator

@BrandonStalnaker BrandonStalnaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks solid to me. I'm somewhat worried the change is a bit broad. I'm worried that for a few of these error types we would want to retry sooner than 2 hours from the failure (I think thats the minimum right). I'd like anothers opinion before I'll approve.
NSURLErrorCannotFindHost,
NSURLErrorCannotConnectToHost,
NSURLErrorNetworkConnectionLost,
NSURLErrorDNSLookupFailed,
NSURLErrorCannotLoadFromNetwork,
NSURLErrorSecureConnectionFailed,
NSURLErrorInternationalRoamingOff,
NSURLErrorDataNotAllowed,
NSURLErrorCallIsActive,
NSURLErrorAppTransportSecurityRequiresSecureConnection:

}
}

return error.domain == "com.mparticle" && error.code == 0
Copy link
Copy Markdown
Collaborator

@jamesnrokt jamesnrokt May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should extract these magic strings and number

Edit: Changed from Nit as a simple change in MPConnector.m could could an issue for us but the tests would still pass. We should have this and MPConnector.m share the same constant

if (!isSuccessCode && !isInvalidCode) {
if ([self isRetriableTransportError:error]) {
MPILogWarning(@"Throttling uploads after transport error.");
[self throttleWithHTTPResponse:httpResponse uploadType:MPUploadTypeMessage];
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Brandon. These retriable transport errors shouldn't be treated the same as a 429 or 503.

Those status codes are deliberate server-side backoffs to avoid DDoS'ing ourselves. Transport errors are transient client-side conditions that typically clear in seconds. With this approach, a user on the subway losing signal for 10 seconds would freeze uploads for 2 hours.

Can we use a separate, shorter window for transport errors e.g. 30s with exponential backoff, capped at a few minutes?

}
}

return error.domain == "com.mparticle" && error.code == 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like com.mparticle code 0 is the SDK's own semaphore timeout (MPConnector.m:276/:321), and the test naming confirms it's treated as a timeout. But NSURLErrorTimedOut isn't in the switch above and falls through to false.

May be mistaken but it looks like these are the same condition at different layers, the SDK timeout fires precisely when the URLSession callback didn't return in time. We should treat both the same way: either add NSURLErrorTimedOut to the retriable list, or drop this last return.

Copy link
Copy Markdown
Collaborator

@jamesnrokt jamesnrokt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes based on the above comments.

My core concern: as it stands, any retriable transport error puts the SDK into a 2-hour upload freeze with no recovery hook as minUploadDate is only checked, never cleared early.

Even a brief connectivity drop costs 2 hours of buffered events. Worse, because every client uses the same hardcoded default with no jitter, a brief endpoint blip triggers a synchronized thundering herd of requests hitting our servers 2 hours later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants