Skip to content

libstore: send SSH ServerAlive keep-alives to remote stores by default#15620

Open
lovesegfault wants to merge 2 commits intomasterfrom
hung-builders
Open

libstore: send SSH ServerAlive keep-alives to remote stores by default#15620
lovesegfault wants to merge 2 commits intomasterfrom
hung-builders

Conversation

@lovesegfault
Copy link
Copy Markdown
Member

Motivation

When a remote builder reboots, has sshd restarted, or otherwise drops off the network without the local kernel seeing a FIN, the ssh process spawned by the build hook blocks forever on a half-open TCP connection. Because the hook is registered with respectTimeouts = false, neither --max-silent-time nor --timeout will ever kill it, so the build slot is occupied indefinitely.

We hit this in practice: local ssh processes pointing at a builder that had no matching sshd-session on the remote side, with __build-remote parked in read() on a dead pipe.

Context

SSHMaster::addCommonSSHOpts() previously passed no liveness-related options. This change passes -o ServerAliveInterval=30 -o ServerAliveCountMax=3 by default, so a dead peer is detected in roughly 90 seconds and the build fails cleanly instead of hanging.

The values are exposed as new per-store settings on ssh:// / ssh-ng://:

  • ssh-server-alive-interval (default 30, set to 0 to disable and defer to ssh_config)
  • ssh-server-alive-count-max (default 3)

NIX_SSHOPTS is emitted before these defaults, so it continues to take precedence (OpenSSH uses the first-obtained value for -o options).

The only existing workaround was setting NIX_SSHOPTS in the daemon's environment, which is awkward to discover and configure.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

When a remote builder reboots or otherwise drops off the network without
closing the TCP connection, the local ssh process never sees an EOF and
the build hook blocks forever on a half-open pipe. Pass
`-o ServerAliveInterval=30 -o ServerAliveCountMax=3` so that ssh detects
a dead peer in roughly 90 seconds. The values are exposed as the new
`ssh-server-alive-interval` and `ssh-server-alive-count-max` store
settings (interval `0` disables them), and `NIX_SSHOPTS` continues to
take precedence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant