lean_api, upstreams, poller, server: bound poll workers with SO_RCVTIMEO / SO_SNDTIMEO#14
Merged
ch4r10t33r merged 1 commit intomainfrom Apr 24, 2026
Merged
Conversation
…MEO / SO_SNDTIMEO Zig 0.14's std.http.Client does not expose connect/read timeouts, so the existing detach-on-deadline logic in upstreams.pollUpstreams leaves the worker's TCP socket open until the kernel's retransmit window elapses (or the peer sends RST/FIN). On the devnet-4 tool server this produced a steady accumulation of ESTABLISHED + CLOSE-WAIT sockets that eventually exhausted RLIMIT_NOFILE=1024 and wedged the entire service. Apply SO_RCVTIMEO and SO_SNDTIMEO directly on the Request's underlying stream handle after client.open(). Blocking recv/send now returns within request_timeout_ms regardless of peer behavior, so detached workers reliably self-terminate and their `defer client.deinit()` closes the socket within the configured deadline. Plumb request_timeout_ms through fetchSlots, fetchSlotFromSSZEndpoint, fetchJustifiedSlotFromJsonEndpoint and fetchForkChoice. Update all callers in poller.zig, upstreams.zig (via PollCtx) and server.zig. Refs #13.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #13.
Summary
SO_RCVTIMEO/SO_SNDTIMEOon the HTTP request's underlying TCP socket afterclient.open()so blockingrecv/sendreturns withinrequest_timeout_msinstead of hanging on the kernel's TCP retransmit window (~15 min on Linux).request_timeout_msthroughfetchSlots,fetchSlotFromSSZEndpoint,fetchJustifiedSlotFromJsonEndpointandfetchForkChoice; update callers inpoller.zig,upstreams.zig(viaPollCtx) andserver.zig.Why
Zig 0.14's
std.http.Clientdoes not expose connect/read timeouts — the@hasFieldguards inpoller.zig/server.zigare no-ops on 0.14.1. To enforce a deadline,pollUpstreamsspawns a worker per upstream and abandons stragglers viathread.detach()afterrequest_timeout_ms. But detaching does not cancel the thread: if the peer accepted the TCP connection and then stopped responding (or half-closed withFIN), the worker stays blocked insiderecv()with the socket still open. The process accumulatesESTABLISHED/CLOSE-WAITsockets untilRLIMIT_NOFILEis exhausted and the service becomes unreachable for both outbound and inbound connections.Socket-level receive/send timeouts unblock the detached worker deterministically; its existing
defer client.deinit()then closes the socket within the deadline.Scope
Only addresses post-
connect()hangs, which matched the observed incident (full detail in #13: 1021 socket fds with 960ESTAB+ 63CLOSE-WAIT). Connect-phase black-holes (SYN with no SYN-ACK) still rely on the kernel's retransmit window; switching to non-blocking connect is left as a follow-up.Test plan
zig fmt --check srczig buildzig build testls /proc/$PID/fd | wc -lstays flat across upstream outages (previously grew monotonically).