Skip to content

fix: orchestration monitor checks events.jsonl for lazy placeholder workers (fixes #400)#823

Open
github-actions[bot] wants to merge 1 commit intomainfrom
fix/issue-400-orchestration-relaunch
Open

fix: orchestration monitor checks events.jsonl for lazy placeholder workers (fixes #400)#823
github-actions[bot] wants to merge 1 commit intomainfrom
fix/issue-400-orchestration-relaunch

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot commented May 1, 2026

Problem

When a multi-agent worker runs relaunch.sh (or the app is rebuilt/relaunched for any reason), MonitorAndSynthesizeAsync could immediately see workers as idle and abort the orchestration — even though the workers were still actively processing on the CLI server.

Root Cause

Race condition between two threads during app relaunch recovery:

  1. Restore thread (RestoreSessionsInBackgroundAsync): Creates lazy placeholder SessionState objects with IsProcessing = false. Sets IsProcessing = true via InvokeOnUI() — which is async (queued on UI thread).

  2. Monitor thread (MonitorAndSynthesizeAsync): Starts on Task.Run and immediately checks state.Info.IsProcessing. Since InvokeOnUI hasn't fired yet, it sees false for all workers → considers them idle → collects empty/stale results → aborts orchestration.

The orchestrator never receives the completion signal, and the reflection loop hangs forever.

Fix

1. IsWorkerIdleForMonitor() — new internal method

Checks worker idle state with a fallback for lazy placeholders:

  • If IsProcessing == true → not idle (normal path)
  • If Session == null (lazy placeholder) AND IsSessionStillProcessing(sessionId) returns true (events.jsonl shows active tool execution) → not idle (fixes the race)
  • Otherwise → idle

This bypasses the InvokeOnUI race by checking the source of truth (events.jsonl on disk) directly.

2. Restore completion wait

MonitorAndSynthesizeAsync now waits for IsRestoring to become false before polling, plus a 2-second settle time for queued InvokeOnUI callbacks to execute.

Files Changed

File Change
CopilotService.Organization.cs Added IsWorkerIdleForMonitor(), updated MonitorAndSynthesizeAsync polling loop
OrchestrationRelaunchTests.cs (unit) 8 behavioral tests + 2 structural guards
OrchestrationRelaunchTests.cs (integration) 3 integration test stubs for CI

Testing

  • 3680 tests pass (0 failures, 0 skipped)

  • 10 new tests cover all IsWorkerIdleForMonitor scenarios:

    • Session not found → idle
    • Connected + processing → not idle
    • Connected + idle → idle
    • Lazy placeholder + active CLI → not idle (key fix test)
    • Lazy placeholder + idle CLI → idle
    • Lazy placeholder + no events → idle
    • Lazy placeholder + stale events → idle
    • Lazy placeholder + no session ID → idle
    • Structural: MonitorAndSynthesizeAsync uses IsWorkerIdleForMonitor
    • Structural: MonitorAndSynthesizeAsync waits for IsRestoring
  • Integration tests build successfully

  • Fixes Workers stop when relaunch.sh rebuilds the app mid-orchestration #400

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • 192.0.2.1

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "192.0.2.1"

See Network Configuration for more information.

Generated by Agent Fix for issue #400 · ● 35.5M ·

…orkers

After app relaunch, MonitorAndSynthesizeAsync could see workers as idle
before InvokeOnUI had set IsProcessing=true (race between background
thread and UI thread dispatch). This caused the monitor to collect
empty/stale results and abort orchestration.

Fix:
- Add IsWorkerIdleForMonitor() that checks events.jsonl directly for
  lazy placeholder workers (Session == null), bypassing the InvokeOnUI
  race condition
- Wait for IsRestoring to become false before starting the poll loop
- Add 2s settle time for queued InvokeOnUI callbacks

Tests: 10 new tests (8 behavioral + 2 structural guards)

Fixes #400

Co-authored-by: copilot-agentic-workflow[bot] <224017+copilot-agentic-workflow[bot]@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workers stop when relaunch.sh rebuilds the app mid-orchestration

0 participants