Skip to content

fix(nodes): retry daemon status during boot race to kill 10s flicker#100

Open
Nic-dorman wants to merge 2 commits into
mainfrom
fix/daemon-connect-boot-flicker
Open

fix(nodes): retry daemon status during boot race to kill 10s flicker#100
Nic-dorman wants to merge 2 commits into
mainfrom
fix/daemon-connect-boot-flicker

Conversation

@Nic-dorman
Copy link
Copy Markdown
Contributor

Summary

The "Cannot connect to node daemon" message flashes for ~10 seconds on every cold launch. Root cause: a boot race between the daemon's port-file write and its HTTP-server bind.

Cold-launch timing today

t (s) Step
0 app boot, `nodesStore.init()` runs
~0.5 `ensure_daemon_running` spawns daemon, waits for port file
~1 port file written → returns URL, but the daemon's HTTP server hasn't bound the port yet
~1 first `fetchDaemonStatus` → connection refused → throws
~1 outer catch: `daemonConnected=false`, `initializing=false`, `scheduleReconnect()` (10 s timer)
1–11 UI: "Cannot connect to node daemon"
~11 reconnect timer fires, retry succeeds, UI flips to nodes

Fix

New `fetchDaemonStatusWithRetry({ attempts, delayMs })` helper retries up to ~5 s (20 × 250 ms) inside `init()`. `initializing` stays true through the retries, so the user sees "Starting node daemon..." until the daemon actually serves or the budget exhausts.

The disconnected path and the existing `scheduleReconnect` 10 s loop are untouched — those still kick in for genuine outages.

Test plan

  • Unit: new test mocks `status` to fail once then succeed, asserts no transient `daemonConnected=false` escapes init and store ends up connected with 2 status calls
  • Manual cold launch: kill any `ant.exe` / `ant-gui.exe`, delete `%APPDATA%\ant\daemon.{pid,port}`, run `npm run tauri:dev`. Expect "Starting node daemon..." for ~1 s, then nodes list. No "Cannot connect" flash.
  • Genuine outage: rename the bundled `ant.exe` to break `ensure_daemon_running`. Launch. Expect "Cannot connect to node daemon" + retry-every-10s behavior unchanged.
  • Restart-daemon flow: click "Restart Daemon" in settings. Expect "Restarting node daemon..." (existing) — not affected by this PR.

🤖 Generated with Claude Code

Nic-dorman and others added 2 commits May 12, 2026 12:04
Cold-launch flow before this fix:

  t=0     app boot, nodesStore.init() runs
  t=~0.5s ensure_daemon_running spawns daemon, polls for port file
  t=~1s   port file appears, ensure_daemon_running returns URL
          (daemon's HTTP server may not have bound the port yet)
  t=~1s   first fetchDaemonStatus call → connection refused → throws
  t=~1s   outer catch: daemonConnected=false, initializing=false,
          scheduleReconnect (10s timer)
  t=1-11s UI shows "Cannot connect to node daemon" for ~10s
  t=~11s  reconnect timer fires, retry succeeds, UI flips to nodes

The port file is written before the HTTP server binds, so a single
status fetch immediately after `ensure_daemon_running` reliably races
the bind on cold boot.

New behavior: `fetchDaemonStatusWithRetry` retries up to ~5s (20 ×
250ms) inside init() with `initializing=true` throughout. The user
sees "Starting node daemon..." until the daemon actually serves or
the budget exhausts. The disconnected path and scheduleReconnect
loop are unchanged for genuine outages.

Test added: mocks a first-attempt failure + second-attempt success
under fake timers. Asserts no transient `daemonConnected=false`
escapes init() and the store ends up connected with two status calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original test relied on \`.mockRejectedValueOnce → .mockResolvedValueOnce\`
to script the daemon race, plus an implicit cross-test fallback for any
extra calls. CI hit 23 status() calls instead of 2 because:

1. \`beforeEach\` did not clear vue-i18n / daemon-api module mocks, so the
   \`.mockResolvedValue\` default fallback set by the prior test leaked
   into this one and silently absorbed extra calls.
2. \`nodesStore.\$reset()\` clears state references like \`_pollTimer\` and
   \`_reconnectTimer\` but does not call \`clearInterval\`/\`clearTimeout\`
   on the underlying JS timers, so a previous test's polling or reconnect
   schedule could fire \`fetchNodes\`/\`fetchDaemonStatus\` against this
   test's mocks mid-run.

Fix:
- \`beforeEach\` now \`vi.clearAllMocks\`s, restores real timers, and
  explicitly calls \`stopPolling\` / \`cancelReconnect\` on the store.
- The retry test uses a counter-based \`mockImplementation\` (throws on
  call 1, succeeds on call \>=2) instead of the Once queue, so the
  fixture is self-contained and the call-count assertion is meaningful.
- The retry test cleans up its own polling / reconnect timers before
  flipping back to real timers.

No behaviour change in the store under test — only test hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant