Skip to content

execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit#21039

Open
mh0lt wants to merge 6 commits intomainfrom
fix/parallel-apply-loop-partial-batch-no-spurious-flag
Open

execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit#21039
mh0lt wants to merge 6 commits intomainfrom
fix/parallel-apply-loop-partial-batch-no-spurious-flag

Conversation

@mh0lt
Copy link
Copy Markdown
Contributor

@mh0lt mh0lt commented May 7, 2026

Summary

Drops the fallback clause in applyLoopMissingBlocks that turned every legitimate partial-batch exit into a spurious InvalidBlock error.

When the parallel exec loop hits its size-limit mid-fixture, it returns nil with reachedMaxBlock=false. The previous completeness check flagged maxBlockNum as "missing" whenever !reachedMaxBlock — but the size-limit path is normal: the apply loop returns ErrLoopExhausted, the stage loop resumes the next batch from lastBlockResult+1, and each block still executes exactly once. No re-execution.

Update — additional fix in commit 310a094514 (execution/stagedsync: honor blockResult.Exhausted in exec loop)

Same class of partial-batch orchestration bug, on the exec-loop side. executeBlocks marks the final dispatched block with Exhausted when the per-cycle block limit fires, then exits its goroutine — without closing pe.execRequests, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting for execRequests / rws.ResultCh / ctx.Done — none of which ever fire. The apply loop never got the channel-close signal it needs to return ErrLoopExhausted, so the original (apply-loop-side) fix in this PR couldn't even run.

Reproduced as a chiado parallel-exec silent stall at block 150662 with EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000 from block 0 (issue #20711). The hang was masking a wrong-trie-root divergence at the same block — with the exhaust signal honored, the underlying ErrWrongTrieRoot now surfaces cleanly through the apply loop and the original-issue symptom appears as expected.

The new branch sits between the maxBlockNum-reached check and the StopAfterBlock debug exit, matching the precedence of the existing partial-batch exits.

Tests: TestExecLoopHonorsBlockResultExhausted runs two models of the exec-loop blockResult decision tree (post-fix and pre-fix) against identical channel orchestration. The pre-fix model must hang past the timeout — proves the test scaffolding genuinely surfaces the bug rather than passing vacuously. TestExecLoopExhaustedOnlySetOnFinalBlock locks in the dispatcher convention so future executeBlocks changes can't silently drop already-queued work by setting Exhausted mid-stream.

Why

Reproduces deterministically as the parallel-mode BenchmarkFeeHistory failure on PR #21017 (the CI matrix that runs every test under both serial and parallel exec modes):

apply loop exited (reachedMaxBlock=false lastBlockResult=114 maxBlockNum=200) but 1 block(s) had tx-results without a blockResult or were never delivered: [200]

200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114. Trace confirms:

  • Batch 1: blocks 1..114, size-limit fires, ErrLoopExhausted
  • Batch 2: blocks 115..200, reachedMaxBlock=true, clean nil

Sd.mem is not contaminated by in-flight future-block writes — txResultBlocks at exit was exactly [1..114]. Block 200 was flagged solely from the fallback clause.

Shutdown sequence verified

Tracing the deferred-close in execLoop: commitResults closes first, calculator drains and closes rootResults, apply loop sees both closes, drains, runs the completeness check, returns ErrLoopExhausted. executorCancel then cancels execLoopCtx, executeBlocks and workers exit, pe.wait joins. Three batches in BenchmarkFeeHistory run end-to-end with no goroutine leaks.

Tests

  • TestApplyLoopMissingBlocks — adds a "partial batch — size-limit hit, no spurious flag for unreached blocks" case
  • TestApplyLoopPartialBatchReturnsErrLoopExhausted — exercises the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) for partial-batch / full-batch / silent-failure scenarios
  • TestApplyLoopChannelCloseOrder — pins the load-bearing commitResults-before-applyResults close order
  • TestExecLoopHonorsBlockResultExhausted — orchestration test for the exec-loop side of the partial-batch exit (added in commit 310a094514)
  • TestExecLoopExhaustedOnlySetOnFinalBlock — pins the executeBlocks dispatcher convention (added in commit 310a094514)

Test plan

  • BenchmarkFeeHistory/full/blocks=200 passes with ERIGON_EXEC3_PARALLEL=true
  • make lint clean
  • go test -race ./execution/stagedsync/... for new tests passes
  • Existing TestApplyLoopRootResultsCloseDoesNotRace, TestApplyLoopDoesNotHangAfterRootResultsClose, TestExecLoopExitCheck* still pass
  • Chiado EXEC3_PARALLEL=true ... --chain=chiado --sync.loop.block.limit=10_000 --prune.mode=archive from block 0 reaches the underlying ErrWrongTrieRoot at block 150662 instead of silently stalling (verifies the partial-batch exec-loop exit fires)

CI fix linkage

Precursor for #21017 (CI matrix exec_mode={serial,parallel}). Without this, bench / benchmarks (parallel) and any other workflow exercising a chain large enough to cross the batch-size boundary fails on first crossing. Companion to #21032 (incarnation/SD differentiator fix).

…oop exit

When the parallel exec loop hits its batch size limit mid-fixture, it
returns nil to the apply loop with reachedMaxBlock=false. The apply
loop's completeness check then ran applyLoopMissingBlocks with a
fallback clause that flagged maxBlockNum as "missing" whenever
!reachedMaxBlock — turning every legitimate partial-batch exit into a
spurious InvalidBlock error.

The size-limit path is normal: it returns ErrLoopExhausted and the
stage loop resumes the next batch from lastBlockResult+1. Each block
still executes exactly once. The fallback clause was unnecessary and
incorrect.

Reproduces deterministically as the
github.com/erigontech/erigon/rpc/gasprice BenchmarkFeeHistory
parallel-mode failure: 200 blocks × 100 simple transfers exceed the
test fixture's 5MB batch budget at block 114; first batch covers
blocks 1..114, second batch covers 115..200, both clean.

Drops the fallback clause, simplifies applyLoopMissingBlocks to a
single map-comparison, and tightens the message string. Adds:

* TestApplyLoopMissingBlocks "partial batch — size-limit hit" case
* TestApplyLoopPartialBatchReturnsErrLoopExhausted exercising the
  apply-loop exit decision tree (exhausted vs nil vs InvalidBlock)
* TestApplyLoopChannelCloseOrder pinning the documented
  commitResults-then-applyResults shutdown order

Precursor for #21017 (CI matrix exec_mode={serial,parallel}); without
this, every workflow that runs a parallel-exec test against a chain
that exceeds the batch budget fails on the first such size-limit boundary.
Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is red

Same class of partial-batch orchestration bug as the apply-loop
fix in this PR. executeBlocks marks the final dispatched block
with Exhausted when the per-cycle block limit fires, then exits
its goroutine — without closing pe.execRequests, without
cancelling ctx. The exec loop had no branch to react to that
signal: after the last blockResult was applied, the main select
parked waiting for execRequests / rws.ResultCh / ctx.Done — none
of which ever fire. The apply loop never got the channel-close
signal it needed to return ErrLoopExhausted.

Reproduced as a chiado parallel-exec stall at block 150662 with
`EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000` from
block 0 (issue #20711). The hang was masking a wrong-trie-root
divergence at the same block; with the exhaust signal honored
the underlying ErrWrongTrieRoot now surfaces cleanly through
the apply loop.

The new branch sits between the maxBlockNum-reached check and
the StopAfterBlock debug exit, matching the precedence of the
existing partial-batch exits.

## Tests

* TestExecLoopHonorsBlockResultExhausted — orchestration test
  with two models of the exec loop's blockResult decision tree
  (post-fix and pre-fix). The pre-fix model must hang past the
  timeout — proves the scaffolding genuinely surfaces the bug
  rather than passing vacuously.
* TestExecLoopExhaustedOnlySetOnFinalBlock — locks in the
  dispatcher convention (Exhausted only on the trailing block
  of a partial batch) so future executeBlocks changes can't
  silently drop already-queued work by setting it mid-stream.
@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 7, 2026

Linked as a precursor for #21017 alongside #21032 (SD/incarnation differentiator) and #21010 (asOfReader.txNum lazy-load fix on snapshot-loaded chains).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes partial-batch orchestration in the parallel execution path so that normal “stop early and resume next cycle” exits are not misclassified as InvalidBlock, and ensures the exec loop honors an explicit “exhausted” signal to avoid stalling.

Changes:

  • Removes the maxBlockNum fallback from applyLoopMissingBlocks, so only blocks that actually produced tx-results are considered “missing” when the apply loop exits.
  • Updates execLoop to exit cleanly when blockResult.Exhausted is observed, allowing deferred channel closes to unblock the apply loop and propagate ErrLoopExhausted.
  • Extends robustness tests around partial-batch exits and the exhausted signal (with some tests currently modeling behavior rather than exercising production code).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
execution/stagedsync/exec3_parallel.go Adjusts missing-block completeness logic and adds blockResult.Exhausted handling in execLoop to prevent partial-batch stalls.
execution/stagedsync/exec3_parallel_robustness_test.go Updates/expands tests for partial-batch behavior and exec-loop exhausted handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/stagedsync/exec3_parallel.go Outdated
Comment on lines 351 to 354
@@ -352,8 +352,8 @@ func (pe *parallelExecutor) exec(ctx context.Context, execStage *StageState, u U
// (no blockResult for it) but the channel closed cleanly: that
// means the exec loop signaled "done" without delivering the
// final block.
Comment on lines +436 to +479
// TestApplyLoopChannelCloseOrder exercises the documented invariant
// that the exec loop closes commitResults BEFORE applyResults on
// shutdown. The calculator drains commitResults and signals the apply
// loop via rootResults; if applyResults closes first, the apply loop
// can race with the calculator's final commitment write.
//
// The exec loop's deferred close in execLoop() does the right thing
// (commitResults first), but the close ordering is load-bearing —
// changing it silently corrupts the shutdown sequence. This test
// captures the shape of the deferred close to pin the order down.
func TestApplyLoopChannelCloseOrder(t *testing.T) {
commitResults := make(chan struct{})
applyResults := make(chan struct{})

closeOrder := make(chan string, 2)

// Mimic execLoop's deferred close: commitResults first, then
// applyResults.
closeBoth := func() {
close(commitResults)
closeOrder <- "commitResults"
close(applyResults)
closeOrder <- "applyResults"
}

go closeBoth()

first := <-closeOrder
second := <-closeOrder

if first != "commitResults" {
t.Fatalf("commitResults must close first; got close order [%s, %s]", first, second)
}
if second != "applyResults" {
t.Fatalf("applyResults must close second; got close order [%s, %s]", first, second)
}

// Sanity: both channels actually closed.
if _, ok := <-commitResults; ok {
t.Fatal("commitResults not actually closed")
}
if _, ok := <-applyResults; ok {
t.Fatal("applyResults not actually closed")
}
Comment on lines +589 to +596
// Production decision tree, mirroring exec3_parallel.go's
// if blockResult != nil { ... } block:
// 1. sizeEst > batchLimit → trigger + return
// 2. blockResult.BlockNum >= maxBlockNum → reachedMaxBlock + trigger + return
// 3. blockResult.Exhausted != nil → trigger + return
// 4. dbg.StopAfterBlock → trigger + return
// 5. fall through to next-block scheduling.
withFix := func(b *blockResult, trigger func()) (done bool) {
Comment on lines +553 to +557
// Inner ctx lets the goroutine cooperate with the outer timeout
// without leaking when the model hangs on <-br. We translate
// "exited via ctx" to result.hung — that is the hang signal.
ctx, cancel := context.WithTimeout(context.Background(), 1500*time.Millisecond)
defer cancel()
Comment on lines +659 to +721
// TestExecLoopExhaustedOnlySetOnFinalBlock is a guard against future
// regressions in the dispatcher: blockResult.Exhausted being set on a
// non-final block would cause the exec loop to drop later blocks that
// have already been queued. Locks down the precedence: among blocks in
// flight, only the trailing dispatched block's Exhausted is honored —
// any earlier Exhausted signal would short-circuit the batch and lose
// already-scheduled work. This test pins the convention to exec3.go's
// `if exhausted != nil { break }` after dispatch: the break ensures
// no later blockResult is produced once Exhausted is set on a
// dispatched request.
func TestExecLoopExhaustedOnlySetOnFinalBlock(t *testing.T) {
// Simulate executeBlocks' loop: it sets exhausted, dispatches the
// blockResult containing it, and breaks out — no further blocks.
// Verify the convention: count(Exhausted != nil) == 1 always, and
// it's always the last blockResult.
type dispatched struct {
num uint64
exhausted bool
}
dispatch := func(blocks []dispatched) error {
var prevExhausted bool
for _, b := range blocks {
if prevExhausted {
return errors.New("dispatched a block after Exhausted was set — exec loop would drop it")
}
if b.exhausted {
prevExhausted = true
}
}
return nil
}

t.Run("Exhausted only on last block — valid", func(t *testing.T) {
err := dispatch([]dispatched{
{num: 1, exhausted: false},
{num: 2, exhausted: false},
{num: 3, exhausted: true},
})
if err != nil {
t.Fatalf("valid case rejected: %v", err)
}
})

t.Run("no Exhausted — valid (full-batch path)", func(t *testing.T) {
err := dispatch([]dispatched{
{num: 1, exhausted: false},
{num: 2, exhausted: false},
})
if err != nil {
t.Fatalf("valid case rejected: %v", err)
}
})

t.Run("Exhausted mid-stream — invalid (regression guard)", func(t *testing.T) {
err := dispatch([]dispatched{
{num: 1, exhausted: false},
{num: 2, exhausted: true},
{num: 3, exhausted: false},
})
if err == nil {
t.Fatal("dispatch convention violated: blockResult emitted after Exhausted")
}
})
Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major

1. Stale comment at the apply-loop close site (real bug, easy fix)

exec3_parallel.go:351-354 still says:

// Also flag the case where maxBlockNum was never reached at all
// (no blockResult for it) but the channel closed cleanly: that
// means the exec loop signaled "done" without delivering the
// final block.

This is exactly the behavior commit 550cd1d574 removed. The function-level comment on applyLoopMissingBlocks was correctly updated; the call-site comment was missed. Should be deleted or replaced to explain that not-reaching-maxBlockNum is now a normal partial-batch state.

Minor (optional)

2. All new tests are decision-tree models, not the real execLoop

The PR description acknowledges this — full integration would need workers + calculator + state setup. But the consequence is real: TestExecLoopHonorsBlockResultExhausted's withFix model is hand-coded to mirror production precedence. If the production order changes, the test passes vacuously while the model drifts. Worth a comment in the test pointing at the exact production lines being mirrored, so a future change is more likely to update both. Same for TestApplyLoopPartialBatchReturnsErrLoopExhausted.

3. Comment slightly imprecise

Lines 901-902:

(initialCycle gates it on crossing a step boundary;
 later cycles enforce it unconditionally)

The initial-cycle path also requires !dbg.DiscardCommitment() (exec3.go:675). Minor.

4. Unused field in test result type

TestApplyLoopPartialBatchReturnsErrLoopExhausted's errSubstring is set but never asserted on. Trivial cleanup.

@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 7, 2026

@yperbasis CI is green now — ready for re-review. The earlier failure was kurtosis assertoor_regular_test flake (context-canceled in the assertoor task runner, not a test assertion); the re-run passed without any code change.

- Replace stale comment at the apply-loop close site (exec3_parallel.go:351-354)
  that still described the removed maxBlockNum fallback.
- Refine the executeBlocks-Exhausted comment to mention the
  !dbg.DiscardCommitment() initial-cycle gate at exec3.go:675.
- Add a cross-reference comment to TestApplyLoopPartialBatchReturnsErrLoopExhausted
  pinning it to exec3_parallel.go's apply-loop exit branch around line 355,
  matching the convention already used in TestExecLoopHonorsBlockResultExhausted.
- Drop unused errSubstring field from the test's result struct.
- Reduce TestExecLoopHonorsBlockResultExhausted hang-detection timeout from
  1500ms inner / 2000ms outer to 300ms / 500ms — in-memory channel + closure
  dispatch resolves on a microsecond timescale, so the larger budget was
  pure test runtime overhead.
@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 7, 2026

@yperbasis pushed 686e11b addressing the review:

Item Fix
Major #1 — stale comment at exec3_parallel.go:351-354 describing the removed maxBlockNum fallback Replaced with a short explanation of why not-reaching-maxBlockNum is now a normal partial-batch state
Minor #2 — tests are decision-tree models; risk of model drifting from production Added a cross-reference comment in TestApplyLoopPartialBatchReturnsErrLoopExhausted pinning the closure to exec3_parallel.go around line 355 (matching the convention TestExecLoopHonorsBlockResultExhausted already used)
Minor #3 — initialCycle comment imprecision Updated to mention the !dbg.DiscardCommitment() gate and reference exec3.go:675
Minor #4 — unused errSubstring field Dropped
Copilot test-runtime concern — 1.5s+2s budget per test Reduced to 300ms inner / 500ms outer; in-memory channel dispatch resolves on a microsecond timescale, so the larger budget was pure overhead

The Copilot suggestions about extracting production logic into helpers and unit-testing the helpers (so the tests drive real code rather than a model) are good but a larger refactor than fits this PR — would prefer to leave them as a follow-up.

Lint clean, full go test -race ./execution/stagedsync/ for the robustness tests passes.

…ction

Addresses Copilot's review feedback on #21039: the original tests
modeled the exec-loop and apply-loop decision trees in local closures,
which would let the model drift from production silently. This
follow-up extracts three pure helpers from the production code and
rewires the tests to drive them directly.

Helpers extracted:

* `closeApplyChannels` (pe method) — pins commitResults-then-applyResults
  close ordering. Returns the close sequence as a string slice so tests
  can verify the order deterministically without racing on observer-
  goroutine wakeups (the previous test had a real flake here).

* `execLoopShouldExit` (pure func) — encodes the per-blockResult exit
  precedence: sizeLimit > maxBlockNum > Exhausted > StopAfterBlock.
  Production exec-loop now switches on the returned decision; tests
  pin each precedence rule with table-driven cases.

* `shouldMarkExhaustedAtBlock` (pure func) — encodes executeBlocks's
  per-cycle block-limit gate (initial-cycle protections + blockLimit
  span check + maxBlockNum guard). Replaces the old "dispatch
  convention" model test that didn't drive any production code.

Tests updated:

* `TestApplyLoopChannelCloseOrder` — calls `pe.closeApplyChannels()`
  on a real `parallelExecutor` with mock channels and asserts on the
  returned close-order slice. Also covers double-close recovery and
  the post-close nil-out.
* `TestExecLoopShouldExitPriority` — table-driven coverage of the
  exit-priority decisions including the cross-condition cases (e.g.
  maxBlockNum AND Exhausted both set on the final block — maxBlockNum
  must win).
* `TestShouldMarkExhaustedAtBlock` — table-driven coverage of the
  initial-cycle gates, blockLimit==0 disabled case, span < limit
  case, and the maxBlockNum exact-hit no-mark case.

No production behaviour change — pure refactor + test rewire.
@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 7, 2026

Follow-up #21046 opens against this branch — extracts closeApplyChannels / execLoopShouldExit / shouldMarkExhaustedAtBlock from production and rewires the robustness tests to drive them directly, addressing Copilot's "tests model rather than drive production" feedback. No production behaviour change in the follow-up; once this PR merges I'll rebase #21046 on main.

mh0lt and others added 2 commits May 7, 2026 21:26
…ction (#21046)

Follow-up to #21039 addressing Copilot's review feedback that the new
robustness tests were modelling the exec-loop and apply-loop decision
trees in local closures — which would let the model drift from
production silently.

This PR extracts three pure helpers from the production code and rewires
the tests to drive them directly. **No production behaviour change** —
the helpers wrap the exact same logic that was inline before.

## Helpers extracted

| Helper | Production call site | What it pins |
|---|---|---|
| `closeApplyChannels` (pe method) | `execLoop` deferred close |
commitResults-then-applyResults order |
| `execLoopShouldExit` (pure) | `execLoop` per-blockResult branch |
sizeLimit > maxBlockNum > Exhausted > StopAfterBlock precedence |
| `shouldMarkExhaustedAtBlock` (pure) | `executeBlocks` per-cycle limit
gate | initial-cycle gates + blockLimit + maxBlockNum guard |

`closeApplyChannels` returns the actual close-order as a string slice —
the previous test was racy because observer goroutines could wake up out
of order even when the helper closed the channels in the right order;
the slice removes that race entirely.

## Tests rewired

* **`TestApplyLoopChannelCloseOrder`** — now calls
`pe.closeApplyChannels()` on a real `parallelExecutor` with mock
channels and asserts on the returned close-order slice. Also covers
double-close recovery and the post-close nil-out.
* **`TestCloseApplyChannelsDoubleCloseRecovers`** — verifies the
recover-on-double-close path doesn't count pre-closed channels as
freshly closed.
* **`TestExecLoopShouldExitPriority`** — table-driven coverage of each
branch, including cross-condition cases (e.g. maxBlockNum AND Exhausted
both set on the final block — maxBlockNum must win for the clean
reachedMaxBlock path).
* **`TestShouldMarkExhaustedAtBlock`** — table-driven coverage of the
initial-cycle gates
(\`lastExecutedStep\`/\`lastFrozenStep\`/\`DiscardCommitment\`), the
\`blockLimit==0\` disabled case, the span < limit case, and the
\`maxBlockNum\` exact-hit no-mark case.

The previous \`TestExecLoopHonorsBlockResultExhausted\` and
\`TestExecLoopExhaustedOnlySetOnFinalBlock\` tests are removed — both
are subsumed by the new \`TestExecLoopShouldExitPriority\` and
\`TestShouldMarkExhaustedAtBlock\` cases that drive production code.

## Test plan

- [x] All four new tests pass: \`TestApplyLoopChannelCloseOrder\`,
\`TestCloseApplyChannelsDoubleCloseRecovers\`,
\`TestExecLoopShouldExitPriority\`, \`TestShouldMarkExhaustedAtBlock\`
- [x] Existing tests in the same file (\`TestApplyLoopMissingBlocks\`,
\`TestExecLoopExitCheck\`, \`TestApplyLoopRootResultsCloseDoesNotRace\`,
etc.) still pass
- [x] \`make lint\` clean
- [x] No production behaviour change (\`go build ./...\` clean)

## Stack

Stacked on #21039 (the apply-loop fix). Both should land together — this
PR alone doesn't make sense without #21039's helper-using tests as
context.
@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 8, 2026

Re-pushed with main merged in (494b1446f4). The two CI failures on the previous run are both unrelated to this PR:

  1. race-tests / execution-otherTestHistoryVerification_SimpleBlocks: pre-existing data race in db/state/aggregator.go:623 (SetCompressWorkers) ↔ db/state/inverted_index.go:1003 ((*InvertedIndex).collate()). Stack trace shows the race is between ExecV3 calling agg.PresetChainTipConcurrency() (write) and a background-aggregator collation goroutine reading compressWorkers (read). Neither file is touched by this PR — the 3-dot diff vs main is purely exec3_parallel.go + exec3_parallel_robustness_test.go. Race-tests is also a fresh job since this PR's previous run, so the race may have surfaced from another concurrent change.
  2. mainnet-rpc-integ-tests: ran for 6h before the runner cleaned up orphan processes — clearly an infra timeout / disconnected runner, not a test assertion failure.

The merge with main + re-run should clear both. If TestHistoryVerification_SimpleBlocks keeps racing, that's a separate issue to file against the aggregator code.

@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 8, 2026

Filed #21062 to fix the underlying data race surfaced on this PR's race-tests run (TestHistoryVerification_SimpleBlocks — race between SetCompressWorkers and InvertedIndex.collate's read of CompressorCfg). Verified locally: -race passes with the fix. Once #21062 lands and this branch picks it up via main-merge, the race-tests entry should clear.

@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 8, 2026

Blocked by #21062 — that PR fixes the underlying data race in db/state/aggregator.godb/state/inverted_index.go that surfaces here as the race-tests / execution-other failure. The race is not introduced by this PR (verified locally: 3-dot diff vs main only touches exec3*.go, no db/state/ files), it's a pre-existing race that was hidden until race-tests started running on this PR's matrix.

Once #21062 merges to main and we rebase, this PR's CI clears.

I deliberately did not fold #21062's commit into this branch — that'd muddy two unrelated PRs (apply-loop fix vs aggregator race fix) and force a single review/revert decision on both. Reviewing them separately preserves cleaner attribution and rollback boundaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants