Skip to content

lean_api, upstreams, poller, server: bound poll workers with SO_RCVTIMEO / SO_SNDTIMEO#14

Merged
ch4r10t33r merged 1 commit intomainfrom
fix/socket-timeouts-prevent-fd-leak
Apr 24, 2026
Merged

lean_api, upstreams, poller, server: bound poll workers with SO_RCVTIMEO / SO_SNDTIMEO#14
ch4r10t33r merged 1 commit intomainfrom
fix/socket-timeouts-prevent-fd-leak

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Collaborator

Fixes #13.

Summary

  • Apply SO_RCVTIMEO / SO_SNDTIMEO on the HTTP request's underlying TCP socket after client.open() so blocking recv/send returns within request_timeout_ms instead of hanging on the kernel's TCP retransmit window (~15 min on Linux).
  • Thread request_timeout_ms through fetchSlots, fetchSlotFromSSZEndpoint, fetchJustifiedSlotFromJsonEndpoint and fetchForkChoice; update callers in poller.zig, upstreams.zig (via PollCtx) and server.zig.

Why

Zig 0.14's std.http.Client does not expose connect/read timeouts — the @hasField guards in poller.zig / server.zig are no-ops on 0.14.1. To enforce a deadline, pollUpstreams spawns a worker per upstream and abandons stragglers via thread.detach() after request_timeout_ms. But detaching does not cancel the thread: if the peer accepted the TCP connection and then stopped responding (or half-closed with FIN), the worker stays blocked inside recv() with the socket still open. The process accumulates ESTABLISHED / CLOSE-WAIT sockets until RLIMIT_NOFILE is exhausted and the service becomes unreachable for both outbound and inbound connections.

Socket-level receive/send timeouts unblock the detached worker deterministically; its existing defer client.deinit() then closes the socket within the deadline.

Scope

Only addresses post-connect() hangs, which matched the observed incident (full detail in #13: 1021 socket fds with 960 ESTAB + 63 CLOSE-WAIT). Connect-phase black-holes (SYN with no SYN-ACK) still rely on the kernel's retransmit window; switching to non-blocking connect is left as a follow-up.

Test plan

  • zig fmt --check src
  • zig build
  • zig build test
  • Deploy to the tooling server and confirm ls /proc/$PID/fd | wc -l stays flat across upstream outages (previously grew monotonically).

…MEO / SO_SNDTIMEO

Zig 0.14's std.http.Client does not expose connect/read timeouts, so the
existing detach-on-deadline logic in upstreams.pollUpstreams leaves the
worker's TCP socket open until the kernel's retransmit window elapses (or
the peer sends RST/FIN). On the devnet-4 tool server this produced a
steady accumulation of ESTABLISHED + CLOSE-WAIT sockets that eventually
exhausted RLIMIT_NOFILE=1024 and wedged the entire service.

Apply SO_RCVTIMEO and SO_SNDTIMEO directly on the Request's underlying
stream handle after client.open(). Blocking recv/send now returns within
request_timeout_ms regardless of peer behavior, so detached workers
reliably self-terminate and their `defer client.deinit()` closes the
socket within the configured deadline.

Plumb request_timeout_ms through fetchSlots, fetchSlotFromSSZEndpoint,
fetchJustifiedSlotFromJsonEndpoint and fetchForkChoice. Update all
callers in poller.zig, upstreams.zig (via PollCtx) and server.zig.

Refs #13.
@ch4r10t33r ch4r10t33r merged commit 3c6fe09 into main Apr 24, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File descriptor leak crashes leanpoint under unresponsive upstreams

1 participant