sidecar sync: retry SSH on boot#341
Conversation
There was a problem hiding this comment.
Risk Assessment
Overall: Medium — The diff is small and scoped, but it changes a critical sidecar SSH sync path with a new 90s retry behavior and only helper-level tests for the retry decision.
| Dimension | Finding |
|---|---|
| Blast radius | Touches shared sidecar sync behavior used by chunk sidecar sync and sidecar setup flows via internal/sidecar/sync.go:59; public flags and file formats are unchanged, but failure latency/output behavior can change for users. |
| Test coverage | Concern: internal/sidecar/sync_whitebox_test.go:13 covers isTransientSSHError, but the actual openSessionWithRetry control flow in internal/sidecar/sync.go:211 is not tested for retry count, cancellation, status notification, or permanent-error short-circuiting. Focused local test run was blocked by missing Go 1.26 toolchain. |
| Breaking changes | No concerns: no command names, flags, defaults, JSON/markdown output formats, or .chunk/ structure changed. |
| Security | No significant concerns: no new token handling, shell execution, or file permissions; go.mod:21 adds github.com/cenkalti/backoff/v4 and the retry repeats the existing public-key registration API call. |
| Complexity | Low-to-medium concern: internal/sidecar/sync.go:245 treats any wrapped net.Error as retryable, so CircleCI API transport timeouts can now wait/retry for up to 90s, not just sidecar SSH readiness. The validate.go change is a straightforward extraction and keeps the cmd/ to internal/ boundary intact. |
Recommendation: add a small test seam around
OpenSessionor extract the retry operation sointernal/sidecar/sync.go:211is covered for transient retries, permanent errors, and context cancellation.
Sent by Cursor Automation: chunk-cli risk assessment
There was a problem hiding this comment.
Required
[internal/sidecar/sync.go:59] Retry wraps OpenSession, not the actual SSH connection
OpenSession only registers/selects the SSH key and returns session metadata; the actual WebSocket/SSH dial still happens later inside ExecOverSSH when syncWorkspace runs its first remote command. If the sidecar accepts the key but the SSH tunnel/handshake is still booting, sync will still fail immediately. Could the retry wrap the first SSH operation/dial path, or make session opening actually verify SSH readiness before returning?
[internal/sidecar/sync.go:247] Transient classification retries every net.Error
errors.As(err, &netErr) matches any net.Error, including permanent network failures such as bad DNS, unreachable hosts, or non-timeout transport errors, so those can now stall chunk sidecar sync for up to 90 seconds. Could this be narrowed to the intended boot-time cases, such as timeout/temporary errors and real connection-refused dial errors?
Sent by Cursor Automation: chunk-cli PR review
Adds openSessionWithRetry to give newly-created sidecars time to finish booting before their SSH service is ready. Retries on transient net.Error values (connection refused, timeout) for up to 90s with exponential backoff; permanent errors (auth failures, missing key) return immediately. Also extracts hookEarlyExits helper in validate.go to bring newValidateCmd cyclomatic complexity within the gocyclo limit (31 → 28). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move retry from OpenSession (CircleCI API call) to waitForSSHReady, which probes dialSSH directly — the actual WebSocket+SSH connection. This is the layer where connection-refused errors occur when the sidecar's SSH daemon is still starting. - Narrow isTransientSSHError from "any net.Error" to Timeout() or ECONNREFUSED specifically, so DNS failures and unreachable-host errors don't stall chunk sidecar sync for up to 90s. - Add TestWaitForSSHReady covering: immediate success, permanent error short-circuit (no notification), transient retry with single notification, and retry-until-success. - Update TestIsTransientSSHError to use real syscall.ECONNREFUSED errors and add EHOSTUNREACH as a not-transient case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d6bc5b3 to
5a56aba
Compare


Summary
openSessionWithRetrytosidecar.Sync, retrying on transientnet.Errorvalues (connection refused, timeout) for up to 90s with exponential backoff — so a freshly-created sidecar doesn't fail immediately while its SSH service is still startingisTransientSSHErrorclassifies retryable errors; permanent errors (auth failures, missing key) return immediately without retryingcenkalti/backoff/v4as a direct dependencyhookEarlyExitsinvalidate.goto fix a pre-existinggocyclolint failure (newValidateCmdwas at complexity 31, limit is 30)This is the SSH-related subset of
sidecar-sync-retry-fetch, without the git changes (remoteorigin/HEADresolution).Test plan
TestIsTransientSSHErrorcovers transient vs permanent error classificationTestSync_NonApplyFailureReturnsImmediatelycontinues to pass (no regression to existing sync behaviour)task lintpasses (0 issues)task testpasses (845 tests)🤖 Generated with Claude Code