[Fix] recover stale external db requests#1428
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughConsolidates stale outgoing-request recovery into an exported transactional ChangesStale Request Recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR replaces the old read-only stale-request detection with an active recovery mechanism: stale
Confidence Score: 5/5Safe to merge — recovery is best-effort and isolated from the main pending-queue processing path. The CTE logic is correct: staleness re-checks in mutation CTEs guard against concurrent recovery, ROW_NUMBER correctly identifies the oldest stale row to reset among duplicates, and the EXISTS predicate excludes stale siblings from the active sibling count. The try/catch in handleStaleRequests ensures a P2010 unique-constraint race or any other recovery failure returns empty arrays rather than crashing the iteration. Both previously flagged concerns appear addressed in this rebase. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Poller Iteration Start] --> B{pollerEnabled?}
B -- No --> C[Return: disabled]
B -- Yes --> D[handleStaleRequests]
D --> E[recoverStaleOutgoingRequests CTE]
E --> F{Error?}
F -- Yes --> G[captureError + return empty arrays]
F -- No --> H{total > 0?}
H -- Yes --> I[captureError to Sentry with ID samples]
H -- No --> J[Return ResetIds + DeletedIds]
I --> J
J --> K[claimPendingRequests FOR UPDATE SKIP LOCKED]
K --> L{directSyncEnabled?}
L -- Yes --> M[processRequest one-by-one via upstash.publishJSON]
L -- No --> N{requests.length == 0?}
N -- Yes --> O[Return 0, skip batch call]
N -- No --> P[upstash.batchJSON then deleteOutgoingRequests]
Reviews (3): Last reviewed commit: "fix: recover stale external db requests" | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
This PR addresses a reliability issue in the external DB sync poller where failures between “claiming” outgoing requests and deleting them can leave rows permanently stuck as in-flight. It adds a recovery step that resets or deletes stale claimed OutgoingRequest rows so the queue can make forward progress.
Changes:
- Added
recoverStaleOutgoingRequests()to classify stale claimed rows and either reset them (retry) or delete them (if a pending sibling exists / duplicates). - Integrated stale-recovery into the external-db-sync poller iteration loop and added telemetry + bounded ID sampling in error reporting.
- Removed the previous “detect only” stale-query logic from the poller.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| apps/backend/src/lib/external-db-sync-queue.ts | Introduces stale-claim recovery logic for OutgoingRequest, including retry handling for partial-unique-index races. |
| apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts | Calls the new recovery logic each poll iteration and records recovery metrics / diagnostic error payloads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts (1)
127-152: ⚡ Quick winEmit one stale-recovery event per poller run instead of per batch.
recoverStaleOutgoingRequests()is capped per call, so a backlog can generate several near-identicalcaptureError(...)events in one cron execution. Aggregating the counts and sampled IDs across iterations, then reporting once near the final response, will keep the signal useful without flooding error monitoring.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts` around lines 127 - 152, The poller currently calls captureError inside each batch of recoverStaleOutgoingRequests(), causing many near-duplicate stale-recovery events; instead aggregate totals and ID samples across all recoverStaleOutgoingRequests() calls within a single poller run and emit one captureError at the end of the poller route. Modify the poller/route code that calls recoverStaleOutgoingRequests() to accumulate total, resetIds and deletedIds (keep sampling logic with ID_SAMPLE_LIMIT and the notes like staleResetIdsSampleNote/staleDeletedIdsSampleNote) across iterations, then call captureError once with the combined counts, combined sampled id slices, and STALE_REQUEST_THRESHOLD_MS in the message; keep the current StackAssertionError payload shape and keys so downstream tooling is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/src/lib/external-db-sync-queue.ts`:
- Around line 78-91: The CASE must also treat rows as duplicates when there
exists any other in-flight sibling for the same deduplicationKey so we don't
reset a stale row while a newer claim is active; update the CASE in the query
that computes action for OutgoingRequest (references o."deduplicationKey",
o."startedFulfillingAt", o."createdAt", o."id") to add an EXISTS clause that
checks for any other row with the same deduplicationKey and startedFulfillingAt
IS NOT NULL (and id != o.id) and return 'delete' in that branch before the
'reset' branch (keep the existing pending-sibling check too and still use
staleThresholdMs for the stale check).
- Around line 117-128: The error-handling branch in recoverStaleOutgoingRequests
is checking for Prisma.PrismaClientKnownRequestError with code "P2002" which
never matches unique-constraint errors thrown by tx.$queryRaw (they surface as
"P2010"); change the check to error.code === "P2010" (keep the existing
PARTIAL_UNIQUE_INDEX_NAME and MAX_PARTIAL_UNIQUE_RETRIES logic intact) so the
retry/StackAssertionError path executes correctly when $queryRaw hits the
partial unique index contention.
---
Nitpick comments:
In `@apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts`:
- Around line 127-152: The poller currently calls captureError inside each batch
of recoverStaleOutgoingRequests(), causing many near-duplicate stale-recovery
events; instead aggregate totals and ID samples across all
recoverStaleOutgoingRequests() calls within a single poller run and emit one
captureError at the end of the poller route. Modify the poller/route code that
calls recoverStaleOutgoingRequests() to accumulate total, resetIds and
deletedIds (keep sampling logic with ID_SAMPLE_LIMIT and the notes like
staleResetIdsSampleNote/staleDeletedIdsSampleNote) across iterations, then call
captureError once with the combined counts, combined sampled id slices, and
STALE_REQUEST_THRESHOLD_MS in the message; keep the current StackAssertionError
payload shape and keys so downstream tooling is unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 677f7488-bf39-4a88-a8f3-34270d413183
📒 Files selected for processing (2)
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.tsapps/backend/src/lib/external-db-sync-queue.ts
adf33e9 to
638f49b
Compare
|
@greptile review again please. Please check the sql query in detail and consider edge cases |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/src/lib/external-db-sync-queue.ts`:
- Around line 101-118: The mutation CTEs deleted and reset currently only join
on id/action and must also enforce the staleness predicate so they become no-ops
if a row was reclaimed before mutation; update the deleted CTE (DELETE FROM
"OutgoingRequest" o USING stale s WHERE o."id" = s."id" AND s.action = 'delete')
and the reset CTE (UPDATE "OutgoingRequest" o SET "startedFulfillingAt" = NULL
FROM stale s WHERE o."id" = s."id" AND s.action = 'reset') to also include the
same stale condition used in the planning CTE (e.g., check the stale predicate
fields from stale) so the WHERE clauses guard by both id/action and staleness.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f3f26ec-3580-4091-84bf-0c01a59f6888
📒 Files selected for processing (2)
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.tsapps/backend/src/lib/external-db-sync-queue.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts
255a197 to
720128e
Compare
|
@greptile review again please. Please check the sql query in detail and consider edge cases |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 21484
SQL Review:
|
| Concern | Severity | Status |
|---|---|---|
| Oldest row outside LIMIT deleted before reset | Low | Safe; multi-iteration drain |
| NULL-key stale rows all reset unconditionally | Low | Safe; potential queue flood edge case |
| 23505 swallowed, stale row persists one tick | Low | Acceptable; well-documented |
stale CTE potentially inlined twice |
Low | Add MATERIALIZED to force single eval |
ROW_NUMBER computed pre-LIMIT (correct) |
✅ N/A | Works as intended |
| Mutation CTE re-checks for READ COMMITTED | ✅ N/A | Correct concurrency guard |
The most actionable fix is adding MATERIALIZED to the stale CTE (Issue 4). The rest are well-handled by the existing design or acceptable trade-offs given the poller's cron cadence.
Failures between claiming and the deletion of outgoing requests from the handler can leave requests stale and never clean them up. Some of these requests may also have duplicates that are fresh in the outgoing queue. These requests need to be deleted or retried.
720128e to
cd9431a
Compare
| type Row = { action: "reset" | "delete", id: string }; | ||
| const rows = await retryTransaction(globalPrismaClient, async (tx) => { | ||
| return await tx.$queryRaw<Row[]>` | ||
| WITH stale AS ( |
| } catch (error) { | ||
| staleSpan.setAttribute("stack.external-db-sync.stale-recovery-error", true); | ||
| captureError("poller-stale-recovery-error", error); | ||
| return { resetIds: [], deletedIds: [] }; |
| type Row = { action: "reset" | "delete", id: string }; | ||
| const rows = await retryTransaction(globalPrismaClient, async (tx) => { | ||
| return await tx.$queryRaw<Row[]>` | ||
| WITH stale AS ( |
Failures between claiming and the deletion of outgoing requests from the handler can leave requests stale and never clean them up. Some of these requests may also have duplicates that are fresh in the outgoing queue. These requests need to be deleted or retried.
It's important to still log the stale requests to sentry so the root cause can be investigated.
Summary by CodeRabbit
Bug Fixes
Refactor