Skip to content

sidecar sync: retry SSH on boot#341

Open
schurchleycci wants to merge 2 commits into
mainfrom
sidecar-ssh-retry
Open

sidecar sync: retry SSH on boot#341
schurchleycci wants to merge 2 commits into
mainfrom
sidecar-ssh-retry

Conversation

@schurchleycci
Copy link
Copy Markdown
Contributor

Summary

  • Adds openSessionWithRetry to sidecar.Sync, retrying on transient net.Error values (connection refused, timeout) for up to 90s with exponential backoff — so a freshly-created sidecar doesn't fail immediately while its SSH service is still starting
  • isTransientSSHError classifies retryable errors; permanent errors (auth failures, missing key) return immediately without retrying
  • Adds cenkalti/backoff/v4 as a direct dependency
  • Extracts hookEarlyExits in validate.go to fix a pre-existing gocyclo lint failure (newValidateCmd was at complexity 31, limit is 30)

This is the SSH-related subset of sidecar-sync-retry-fetch, without the git changes (remote origin/HEAD resolution).

Test plan

  • TestIsTransientSSHError covers transient vs permanent error classification
  • TestSync_NonApplyFailureReturnsImmediately continues to pass (no regression to existing sync behaviour)
  • task lint passes (0 issues)
  • task test passes (845 tests)

🤖 Generated with Claude Code

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk Assessment

Overall: Medium — The diff is small and scoped, but it changes a critical sidecar SSH sync path with a new 90s retry behavior and only helper-level tests for the retry decision.

Dimension Finding
Blast radius Touches shared sidecar sync behavior used by chunk sidecar sync and sidecar setup flows via internal/sidecar/sync.go:59; public flags and file formats are unchanged, but failure latency/output behavior can change for users.
Test coverage Concern: internal/sidecar/sync_whitebox_test.go:13 covers isTransientSSHError, but the actual openSessionWithRetry control flow in internal/sidecar/sync.go:211 is not tested for retry count, cancellation, status notification, or permanent-error short-circuiting. Focused local test run was blocked by missing Go 1.26 toolchain.
Breaking changes No concerns: no command names, flags, defaults, JSON/markdown output formats, or .chunk/ structure changed.
Security No significant concerns: no new token handling, shell execution, or file permissions; go.mod:21 adds github.com/cenkalti/backoff/v4 and the retry repeats the existing public-key registration API call.
Complexity Low-to-medium concern: internal/sidecar/sync.go:245 treats any wrapped net.Error as retryable, so CircleCI API transport timeouts can now wait/retry for up to 90s, not just sidecar SSH readiness. The validate.go change is a straightforward extraction and keeps the cmd/ to internal/ boundary intact.

Recommendation: add a small test seam around OpenSession or extract the retry operation so internal/sidecar/sync.go:211 is covered for transient retries, permanent errors, and context cancellation.

Open in Web View Automation 

Sent by Cursor Automation: chunk-cli risk assessment

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required

[internal/sidecar/sync.go:59] Retry wraps OpenSession, not the actual SSH connection

OpenSession only registers/selects the SSH key and returns session metadata; the actual WebSocket/SSH dial still happens later inside ExecOverSSH when syncWorkspace runs its first remote command. If the sidecar accepts the key but the SSH tunnel/handshake is still booting, sync will still fail immediately. Could the retry wrap the first SSH operation/dial path, or make session opening actually verify SSH readiness before returning?

[internal/sidecar/sync.go:247] Transient classification retries every net.Error

errors.As(err, &netErr) matches any net.Error, including permanent network failures such as bad DNS, unreachable hosts, or non-timeout transport errors, so those can now stall chunk sidecar sync for up to 90 seconds. Could this be narrowed to the intended boot-time cases, such as timeout/temporary errors and real connection-refused dial errors?

Open in Web View Automation 

Sent by Cursor Automation: chunk-cli PR review

schurchleycci and others added 2 commits May 20, 2026 15:05
Adds openSessionWithRetry to give newly-created sidecars time to finish
booting before their SSH service is ready. Retries on transient net.Error
values (connection refused, timeout) for up to 90s with exponential backoff;
permanent errors (auth failures, missing key) return immediately.

Also extracts hookEarlyExits helper in validate.go to bring newValidateCmd
cyclomatic complexity within the gocyclo limit (31 → 28).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move retry from OpenSession (CircleCI API call) to waitForSSHReady,
  which probes dialSSH directly — the actual WebSocket+SSH connection.
  This is the layer where connection-refused errors occur when the
  sidecar's SSH daemon is still starting.
- Narrow isTransientSSHError from "any net.Error" to Timeout() or
  ECONNREFUSED specifically, so DNS failures and unreachable-host errors
  don't stall chunk sidecar sync for up to 90s.
- Add TestWaitForSSHReady covering: immediate success, permanent error
  short-circuit (no notification), transient retry with single
  notification, and retry-until-success.
- Update TestIsTransientSSHError to use real syscall.ECONNREFUSED errors
  and add EHOSTUNREACH as a not-transient case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant