execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit#21039
execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit#21039
Conversation
…oop exit When the parallel exec loop hits its batch size limit mid-fixture, it returns nil to the apply loop with reachedMaxBlock=false. The apply loop's completeness check then ran applyLoopMissingBlocks with a fallback clause that flagged maxBlockNum as "missing" whenever !reachedMaxBlock — turning every legitimate partial-batch exit into a spurious InvalidBlock error. The size-limit path is normal: it returns ErrLoopExhausted and the stage loop resumes the next batch from lastBlockResult+1. Each block still executes exactly once. The fallback clause was unnecessary and incorrect. Reproduces deterministically as the github.com/erigontech/erigon/rpc/gasprice BenchmarkFeeHistory parallel-mode failure: 200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114; first batch covers blocks 1..114, second batch covers 115..200, both clean. Drops the fallback clause, simplifies applyLoopMissingBlocks to a single map-comparison, and tightens the message string. Adds: * TestApplyLoopMissingBlocks "partial batch — size-limit hit" case * TestApplyLoopPartialBatchReturnsErrLoopExhausted exercising the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) * TestApplyLoopChannelCloseOrder pinning the documented commitResults-then-applyResults shutdown order Precursor for #21017 (CI matrix exec_mode={serial,parallel}); without this, every workflow that runs a parallel-exec test against a chain that exceeds the batch budget fails on the first such size-limit boundary.
Same class of partial-batch orchestration bug as the apply-loop fix in this PR. executeBlocks marks the final dispatched block with Exhausted when the per-cycle block limit fires, then exits its goroutine — without closing pe.execRequests, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting for execRequests / rws.ResultCh / ctx.Done — none of which ever fire. The apply loop never got the channel-close signal it needed to return ErrLoopExhausted. Reproduced as a chiado parallel-exec stall at block 150662 with `EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000` from block 0 (issue #20711). The hang was masking a wrong-trie-root divergence at the same block; with the exhaust signal honored the underlying ErrWrongTrieRoot now surfaces cleanly through the apply loop. The new branch sits between the maxBlockNum-reached check and the StopAfterBlock debug exit, matching the precedence of the existing partial-batch exits. ## Tests * TestExecLoopHonorsBlockResultExhausted — orchestration test with two models of the exec loop's blockResult decision tree (post-fix and pre-fix). The pre-fix model must hang past the timeout — proves the scaffolding genuinely surfaces the bug rather than passing vacuously. * TestExecLoopExhaustedOnlySetOnFinalBlock — locks in the dispatcher convention (Exhausted only on the trailing block of a partial batch) so future executeBlocks changes can't silently drop already-queued work by setting it mid-stream.
There was a problem hiding this comment.
Pull request overview
This PR fixes partial-batch orchestration in the parallel execution path so that normal “stop early and resume next cycle” exits are not misclassified as InvalidBlock, and ensures the exec loop honors an explicit “exhausted” signal to avoid stalling.
Changes:
- Removes the
maxBlockNumfallback fromapplyLoopMissingBlocks, so only blocks that actually produced tx-results are considered “missing” when the apply loop exits. - Updates
execLoopto exit cleanly whenblockResult.Exhaustedis observed, allowing deferred channel closes to unblock the apply loop and propagateErrLoopExhausted. - Extends robustness tests around partial-batch exits and the exhausted signal (with some tests currently modeling behavior rather than exercising production code).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
execution/stagedsync/exec3_parallel.go |
Adjusts missing-block completeness logic and adds blockResult.Exhausted handling in execLoop to prevent partial-batch stalls. |
execution/stagedsync/exec3_parallel_robustness_test.go |
Updates/expands tests for partial-batch behavior and exec-loop exhausted handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -352,8 +352,8 @@ func (pe *parallelExecutor) exec(ctx context.Context, execStage *StageState, u U | |||
| // (no blockResult for it) but the channel closed cleanly: that | |||
| // means the exec loop signaled "done" without delivering the | |||
| // final block. | |||
| // TestApplyLoopChannelCloseOrder exercises the documented invariant | ||
| // that the exec loop closes commitResults BEFORE applyResults on | ||
| // shutdown. The calculator drains commitResults and signals the apply | ||
| // loop via rootResults; if applyResults closes first, the apply loop | ||
| // can race with the calculator's final commitment write. | ||
| // | ||
| // The exec loop's deferred close in execLoop() does the right thing | ||
| // (commitResults first), but the close ordering is load-bearing — | ||
| // changing it silently corrupts the shutdown sequence. This test | ||
| // captures the shape of the deferred close to pin the order down. | ||
| func TestApplyLoopChannelCloseOrder(t *testing.T) { | ||
| commitResults := make(chan struct{}) | ||
| applyResults := make(chan struct{}) | ||
|
|
||
| closeOrder := make(chan string, 2) | ||
|
|
||
| // Mimic execLoop's deferred close: commitResults first, then | ||
| // applyResults. | ||
| closeBoth := func() { | ||
| close(commitResults) | ||
| closeOrder <- "commitResults" | ||
| close(applyResults) | ||
| closeOrder <- "applyResults" | ||
| } | ||
|
|
||
| go closeBoth() | ||
|
|
||
| first := <-closeOrder | ||
| second := <-closeOrder | ||
|
|
||
| if first != "commitResults" { | ||
| t.Fatalf("commitResults must close first; got close order [%s, %s]", first, second) | ||
| } | ||
| if second != "applyResults" { | ||
| t.Fatalf("applyResults must close second; got close order [%s, %s]", first, second) | ||
| } | ||
|
|
||
| // Sanity: both channels actually closed. | ||
| if _, ok := <-commitResults; ok { | ||
| t.Fatal("commitResults not actually closed") | ||
| } | ||
| if _, ok := <-applyResults; ok { | ||
| t.Fatal("applyResults not actually closed") | ||
| } |
| // Production decision tree, mirroring exec3_parallel.go's | ||
| // if blockResult != nil { ... } block: | ||
| // 1. sizeEst > batchLimit → trigger + return | ||
| // 2. blockResult.BlockNum >= maxBlockNum → reachedMaxBlock + trigger + return | ||
| // 3. blockResult.Exhausted != nil → trigger + return | ||
| // 4. dbg.StopAfterBlock → trigger + return | ||
| // 5. fall through to next-block scheduling. | ||
| withFix := func(b *blockResult, trigger func()) (done bool) { |
| // Inner ctx lets the goroutine cooperate with the outer timeout | ||
| // without leaking when the model hangs on <-br. We translate | ||
| // "exited via ctx" to result.hung — that is the hang signal. | ||
| ctx, cancel := context.WithTimeout(context.Background(), 1500*time.Millisecond) | ||
| defer cancel() |
| // TestExecLoopExhaustedOnlySetOnFinalBlock is a guard against future | ||
| // regressions in the dispatcher: blockResult.Exhausted being set on a | ||
| // non-final block would cause the exec loop to drop later blocks that | ||
| // have already been queued. Locks down the precedence: among blocks in | ||
| // flight, only the trailing dispatched block's Exhausted is honored — | ||
| // any earlier Exhausted signal would short-circuit the batch and lose | ||
| // already-scheduled work. This test pins the convention to exec3.go's | ||
| // `if exhausted != nil { break }` after dispatch: the break ensures | ||
| // no later blockResult is produced once Exhausted is set on a | ||
| // dispatched request. | ||
| func TestExecLoopExhaustedOnlySetOnFinalBlock(t *testing.T) { | ||
| // Simulate executeBlocks' loop: it sets exhausted, dispatches the | ||
| // blockResult containing it, and breaks out — no further blocks. | ||
| // Verify the convention: count(Exhausted != nil) == 1 always, and | ||
| // it's always the last blockResult. | ||
| type dispatched struct { | ||
| num uint64 | ||
| exhausted bool | ||
| } | ||
| dispatch := func(blocks []dispatched) error { | ||
| var prevExhausted bool | ||
| for _, b := range blocks { | ||
| if prevExhausted { | ||
| return errors.New("dispatched a block after Exhausted was set — exec loop would drop it") | ||
| } | ||
| if b.exhausted { | ||
| prevExhausted = true | ||
| } | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| t.Run("Exhausted only on last block — valid", func(t *testing.T) { | ||
| err := dispatch([]dispatched{ | ||
| {num: 1, exhausted: false}, | ||
| {num: 2, exhausted: false}, | ||
| {num: 3, exhausted: true}, | ||
| }) | ||
| if err != nil { | ||
| t.Fatalf("valid case rejected: %v", err) | ||
| } | ||
| }) | ||
|
|
||
| t.Run("no Exhausted — valid (full-batch path)", func(t *testing.T) { | ||
| err := dispatch([]dispatched{ | ||
| {num: 1, exhausted: false}, | ||
| {num: 2, exhausted: false}, | ||
| }) | ||
| if err != nil { | ||
| t.Fatalf("valid case rejected: %v", err) | ||
| } | ||
| }) | ||
|
|
||
| t.Run("Exhausted mid-stream — invalid (regression guard)", func(t *testing.T) { | ||
| err := dispatch([]dispatched{ | ||
| {num: 1, exhausted: false}, | ||
| {num: 2, exhausted: true}, | ||
| {num: 3, exhausted: false}, | ||
| }) | ||
| if err == nil { | ||
| t.Fatal("dispatch convention violated: blockResult emitted after Exhausted") | ||
| } | ||
| }) |
yperbasis
left a comment
There was a problem hiding this comment.
Major
1. Stale comment at the apply-loop close site (real bug, easy fix)
exec3_parallel.go:351-354 still says:
// Also flag the case where maxBlockNum was never reached at all
// (no blockResult for it) but the channel closed cleanly: that
// means the exec loop signaled "done" without delivering the
// final block.This is exactly the behavior commit 550cd1d574 removed. The function-level comment on applyLoopMissingBlocks was correctly updated; the call-site comment was missed. Should be deleted or replaced to explain that not-reaching-maxBlockNum is now a normal partial-batch state.
Minor (optional)
2. All new tests are decision-tree models, not the real execLoop
The PR description acknowledges this — full integration would need workers + calculator + state setup. But the consequence is real: TestExecLoopHonorsBlockResultExhausted's withFix model is hand-coded to mirror production precedence. If the production order changes, the test passes vacuously while the model drifts. Worth a comment in the test pointing at the exact production lines being mirrored, so a future change is more likely to update both. Same for TestApplyLoopPartialBatchReturnsErrLoopExhausted.
3. Comment slightly imprecise
Lines 901-902:
(initialCycle gates it on crossing a step boundary;
later cycles enforce it unconditionally)
The initial-cycle path also requires !dbg.DiscardCommitment() (exec3.go:675). Minor.
4. Unused field in test result type
TestApplyLoopPartialBatchReturnsErrLoopExhausted's errSubstring is set but never asserted on. Trivial cleanup.
|
@yperbasis CI is green now — ready for re-review. The earlier failure was |
- Replace stale comment at the apply-loop close site (exec3_parallel.go:351-354) that still described the removed maxBlockNum fallback. - Refine the executeBlocks-Exhausted comment to mention the !dbg.DiscardCommitment() initial-cycle gate at exec3.go:675. - Add a cross-reference comment to TestApplyLoopPartialBatchReturnsErrLoopExhausted pinning it to exec3_parallel.go's apply-loop exit branch around line 355, matching the convention already used in TestExecLoopHonorsBlockResultExhausted. - Drop unused errSubstring field from the test's result struct. - Reduce TestExecLoopHonorsBlockResultExhausted hang-detection timeout from 1500ms inner / 2000ms outer to 300ms / 500ms — in-memory channel + closure dispatch resolves on a microsecond timescale, so the larger budget was pure test runtime overhead.
|
@yperbasis pushed 686e11b addressing the review:
The Copilot suggestions about extracting production logic into helpers and unit-testing the helpers (so the tests drive real code rather than a model) are good but a larger refactor than fits this PR — would prefer to leave them as a follow-up. Lint clean, full |
…ction Addresses Copilot's review feedback on #21039: the original tests modeled the exec-loop and apply-loop decision trees in local closures, which would let the model drift from production silently. This follow-up extracts three pure helpers from the production code and rewires the tests to drive them directly. Helpers extracted: * `closeApplyChannels` (pe method) — pins commitResults-then-applyResults close ordering. Returns the close sequence as a string slice so tests can verify the order deterministically without racing on observer- goroutine wakeups (the previous test had a real flake here). * `execLoopShouldExit` (pure func) — encodes the per-blockResult exit precedence: sizeLimit > maxBlockNum > Exhausted > StopAfterBlock. Production exec-loop now switches on the returned decision; tests pin each precedence rule with table-driven cases. * `shouldMarkExhaustedAtBlock` (pure func) — encodes executeBlocks's per-cycle block-limit gate (initial-cycle protections + blockLimit span check + maxBlockNum guard). Replaces the old "dispatch convention" model test that didn't drive any production code. Tests updated: * `TestApplyLoopChannelCloseOrder` — calls `pe.closeApplyChannels()` on a real `parallelExecutor` with mock channels and asserts on the returned close-order slice. Also covers double-close recovery and the post-close nil-out. * `TestExecLoopShouldExitPriority` — table-driven coverage of the exit-priority decisions including the cross-condition cases (e.g. maxBlockNum AND Exhausted both set on the final block — maxBlockNum must win). * `TestShouldMarkExhaustedAtBlock` — table-driven coverage of the initial-cycle gates, blockLimit==0 disabled case, span < limit case, and the maxBlockNum exact-hit no-mark case. No production behaviour change — pure refactor + test rewire.
|
Follow-up #21046 opens against this branch — extracts |
…ction (#21046) Follow-up to #21039 addressing Copilot's review feedback that the new robustness tests were modelling the exec-loop and apply-loop decision trees in local closures — which would let the model drift from production silently. This PR extracts three pure helpers from the production code and rewires the tests to drive them directly. **No production behaviour change** — the helpers wrap the exact same logic that was inline before. ## Helpers extracted | Helper | Production call site | What it pins | |---|---|---| | `closeApplyChannels` (pe method) | `execLoop` deferred close | commitResults-then-applyResults order | | `execLoopShouldExit` (pure) | `execLoop` per-blockResult branch | sizeLimit > maxBlockNum > Exhausted > StopAfterBlock precedence | | `shouldMarkExhaustedAtBlock` (pure) | `executeBlocks` per-cycle limit gate | initial-cycle gates + blockLimit + maxBlockNum guard | `closeApplyChannels` returns the actual close-order as a string slice — the previous test was racy because observer goroutines could wake up out of order even when the helper closed the channels in the right order; the slice removes that race entirely. ## Tests rewired * **`TestApplyLoopChannelCloseOrder`** — now calls `pe.closeApplyChannels()` on a real `parallelExecutor` with mock channels and asserts on the returned close-order slice. Also covers double-close recovery and the post-close nil-out. * **`TestCloseApplyChannelsDoubleCloseRecovers`** — verifies the recover-on-double-close path doesn't count pre-closed channels as freshly closed. * **`TestExecLoopShouldExitPriority`** — table-driven coverage of each branch, including cross-condition cases (e.g. maxBlockNum AND Exhausted both set on the final block — maxBlockNum must win for the clean reachedMaxBlock path). * **`TestShouldMarkExhaustedAtBlock`** — table-driven coverage of the initial-cycle gates (\`lastExecutedStep\`/\`lastFrozenStep\`/\`DiscardCommitment\`), the \`blockLimit==0\` disabled case, the span < limit case, and the \`maxBlockNum\` exact-hit no-mark case. The previous \`TestExecLoopHonorsBlockResultExhausted\` and \`TestExecLoopExhaustedOnlySetOnFinalBlock\` tests are removed — both are subsumed by the new \`TestExecLoopShouldExitPriority\` and \`TestShouldMarkExhaustedAtBlock\` cases that drive production code. ## Test plan - [x] All four new tests pass: \`TestApplyLoopChannelCloseOrder\`, \`TestCloseApplyChannelsDoubleCloseRecovers\`, \`TestExecLoopShouldExitPriority\`, \`TestShouldMarkExhaustedAtBlock\` - [x] Existing tests in the same file (\`TestApplyLoopMissingBlocks\`, \`TestExecLoopExitCheck\`, \`TestApplyLoopRootResultsCloseDoesNotRace\`, etc.) still pass - [x] \`make lint\` clean - [x] No production behaviour change (\`go build ./...\` clean) ## Stack Stacked on #21039 (the apply-loop fix). Both should land together — this PR alone doesn't make sense without #21039's helper-using tests as context.
…op-partial-batch-no-spurious-flag
|
Re-pushed with main merged in (
The merge with main + re-run should clear both. If |
|
Filed #21062 to fix the underlying data race surfaced on this PR's race-tests run ( |
|
Blocked by #21062 — that PR fixes the underlying data race in Once #21062 merges to main and we rebase, this PR's CI clears. I deliberately did not fold #21062's commit into this branch — that'd muddy two unrelated PRs (apply-loop fix vs aggregator race fix) and force a single review/revert decision on both. Reviewing them separately preserves cleaner attribution and rollback boundaries. |
Summary
Drops the fallback clause in
applyLoopMissingBlocksthat turned every legitimate partial-batch exit into a spuriousInvalidBlockerror.When the parallel exec loop hits its size-limit mid-fixture, it returns nil with
reachedMaxBlock=false. The previous completeness check flaggedmaxBlockNumas "missing" whenever!reachedMaxBlock— but the size-limit path is normal: the apply loop returnsErrLoopExhausted, the stage loop resumes the next batch fromlastBlockResult+1, and each block still executes exactly once. No re-execution.Update — additional fix in commit
310a094514(execution/stagedsync: honor blockResult.Exhausted in exec loop)Same class of partial-batch orchestration bug, on the exec-loop side.
executeBlocksmarks the final dispatched block withExhaustedwhen the per-cycle block limit fires, then exits its goroutine — without closingpe.execRequests, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting forexecRequests/rws.ResultCh/ctx.Done— none of which ever fire. The apply loop never got the channel-close signal it needs to returnErrLoopExhausted, so the original (apply-loop-side) fix in this PR couldn't even run.Reproduced as a chiado parallel-exec silent stall at block 150662 with
EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000from block 0 (issue #20711). The hang was masking a wrong-trie-root divergence at the same block — with the exhaust signal honored, the underlyingErrWrongTrieRootnow surfaces cleanly through the apply loop and the original-issue symptom appears as expected.The new branch sits between the
maxBlockNum-reached check and theStopAfterBlockdebug exit, matching the precedence of the existing partial-batch exits.Tests:
TestExecLoopHonorsBlockResultExhaustedruns two models of the exec-loop blockResult decision tree (post-fix and pre-fix) against identical channel orchestration. The pre-fix model must hang past the timeout — proves the test scaffolding genuinely surfaces the bug rather than passing vacuously.TestExecLoopExhaustedOnlySetOnFinalBlocklocks in the dispatcher convention so futureexecuteBlockschanges can't silently drop already-queued work by settingExhaustedmid-stream.Why
Reproduces deterministically as the parallel-mode
BenchmarkFeeHistoryfailure on PR #21017 (the CI matrix that runs every test under bothserialandparallelexec modes):200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114. Trace confirms:
ErrLoopExhaustedreachedMaxBlock=true, clean nilSd.mem is not contaminated by in-flight future-block writes —
txResultBlocksat exit was exactly[1..114]. Block 200 was flagged solely from the fallback clause.Shutdown sequence verified
Tracing the deferred-close in
execLoop:commitResultscloses first, calculator drains and closesrootResults, apply loop sees both closes, drains, runs the completeness check, returnsErrLoopExhausted.executorCancelthen cancelsexecLoopCtx,executeBlocksand workers exit,pe.waitjoins. Three batches inBenchmarkFeeHistoryrun end-to-end with no goroutine leaks.Tests
TestApplyLoopMissingBlocks— adds a "partial batch — size-limit hit, no spurious flag for unreached blocks" caseTestApplyLoopPartialBatchReturnsErrLoopExhausted— exercises the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) for partial-batch / full-batch / silent-failure scenariosTestApplyLoopChannelCloseOrder— pins the load-bearingcommitResults-before-applyResultsclose orderTestExecLoopHonorsBlockResultExhausted— orchestration test for the exec-loop side of the partial-batch exit (added in commit310a094514)TestExecLoopExhaustedOnlySetOnFinalBlock— pins the executeBlocks dispatcher convention (added in commit310a094514)Test plan
BenchmarkFeeHistory/full/blocks=200passes withERIGON_EXEC3_PARALLEL=truemake lintcleango test -race ./execution/stagedsync/...for new tests passesTestApplyLoopRootResultsCloseDoesNotRace,TestApplyLoopDoesNotHangAfterRootResultsClose,TestExecLoopExitCheck*still passEXEC3_PARALLEL=true ... --chain=chiado --sync.loop.block.limit=10_000 --prune.mode=archivefrom block 0 reaches the underlyingErrWrongTrieRootat block 150662 instead of silently stalling (verifies the partial-batch exec-loop exit fires)CI fix linkage
Precursor for #21017 (CI matrix
exec_mode={serial,parallel}). Without this,bench / benchmarks (parallel)and any other workflow exercising a chain large enough to cross the batch-size boundary fails on first crossing. Companion to #21032 (incarnation/SD differentiator fix).