fix(nodes): retry daemon status during boot race to kill 10s flicker by Nic-dorman · Pull Request #100 · WithAutonomi/ant-ui

Nic-dorman · 2026-05-12T11:05:12Z

Summary

The "Cannot connect to node daemon" message flashes for ~10 seconds on every cold launch. Root cause: a boot race between the daemon's port-file write and its HTTP-server bind.

Cold-launch timing today

t (s)	Step
0	app boot, `nodesStore.init()` runs
~0.5	`ensure_daemon_running` spawns daemon, waits for port file
~1	port file written → returns URL, but the daemon's HTTP server hasn't bound the port yet
~1	first `fetchDaemonStatus` → connection refused → throws
~1	outer catch: `daemonConnected=false`, `initializing=false`, `scheduleReconnect()` (10 s timer)
1–11	UI: "Cannot connect to node daemon"
~11	reconnect timer fires, retry succeeds, UI flips to nodes

Fix

New `fetchDaemonStatusWithRetry({ attempts, delayMs })` helper retries up to ~5 s (20 × 250 ms) inside `init()`. `initializing` stays true through the retries, so the user sees "Starting node daemon..." until the daemon actually serves or the budget exhausts.

The disconnected path and the existing `scheduleReconnect` 10 s loop are untouched — those still kick in for genuine outages.

Test plan

Unit: new test mocks `status` to fail once then succeed, asserts no transient `daemonConnected=false` escapes init and store ends up connected with 2 status calls
Manual cold launch: kill any `ant.exe` / `ant-gui.exe`, delete `%APPDATA%\ant\daemon.{pid,port}`, run `npm run tauri:dev`. Expect "Starting node daemon..." for ~1 s, then nodes list. No "Cannot connect" flash.
Genuine outage: rename the bundled `ant.exe` to break `ensure_daemon_running`. Launch. Expect "Cannot connect to node daemon" + retry-every-10s behavior unchanged.
Restart-daemon flow: click "Restart Daemon" in settings. Expect "Restarting node daemon..." (existing) — not affected by this PR.

🤖 Generated with Claude Code

Cold-launch flow before this fix: t=0 app boot, nodesStore.init() runs t=~0.5s ensure_daemon_running spawns daemon, polls for port file t=~1s port file appears, ensure_daemon_running returns URL (daemon's HTTP server may not have bound the port yet) t=~1s first fetchDaemonStatus call → connection refused → throws t=~1s outer catch: daemonConnected=false, initializing=false, scheduleReconnect (10s timer) t=1-11s UI shows "Cannot connect to node daemon" for ~10s t=~11s reconnect timer fires, retry succeeds, UI flips to nodes The port file is written before the HTTP server binds, so a single status fetch immediately after `ensure_daemon_running` reliably races the bind on cold boot. New behavior: `fetchDaemonStatusWithRetry` retries up to ~5s (20 × 250ms) inside init() with `initializing=true` throughout. The user sees "Starting node daemon..." until the daemon actually serves or the budget exhausts. The disconnected path and scheduleReconnect loop are unchanged for genuine outages. Test added: mocks a first-attempt failure + second-attempt success under fake timers. Asserts no transient `daemonConnected=false` escapes init() and the store ends up connected with two status calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The original test relied on \`.mockRejectedValueOnce → .mockResolvedValueOnce\` to script the daemon race, plus an implicit cross-test fallback for any extra calls. CI hit 23 status() calls instead of 2 because: 1. \`beforeEach\` did not clear vue-i18n / daemon-api module mocks, so the \`.mockResolvedValue\` default fallback set by the prior test leaked into this one and silently absorbed extra calls. 2. \`nodesStore.\$reset()\` clears state references like \`_pollTimer\` and \`_reconnectTimer\` but does not call \`clearInterval\`/\`clearTimeout\` on the underlying JS timers, so a previous test's polling or reconnect schedule could fire \`fetchNodes\`/\`fetchDaemonStatus\` against this test's mocks mid-run. Fix: - \`beforeEach\` now \`vi.clearAllMocks\`s, restores real timers, and explicitly calls \`stopPolling\` / \`cancelReconnect\` on the store. - The retry test uses a counter-based \`mockImplementation\` (throws on call 1, succeeds on call \>=2) instead of the Once queue, so the fixture is self-contained and the call-count assertion is meaningful. - The retry test cleans up its own polling / reconnect timers before flipping back to real timers. No behaviour change in the store under test — only test hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Nic-dorman and others added 2 commits May 12, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nodes): retry daemon status during boot race to kill 10s flicker#100

fix(nodes): retry daemon status during boot race to kill 10s flicker#100
Nic-dorman wants to merge 2 commits into
mainfrom
fix/daemon-connect-boot-flicker

Nic-dorman commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nic-dorman commented May 12, 2026

Summary

Cold-launch timing today

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant