Skip to content

daemon: fix hang on SSH disconnect during remote builds#14667

Merged
Ericson2314 merged 1 commit into
NixOS:masterfrom
Mic92:fix-remote-builder-hang
Nov 27, 2025
Merged

daemon: fix hang on SSH disconnect during remote builds#14667
Ericson2314 merged 1 commit into
NixOS:masterfrom
Mic92:fix-remote-builder-hang

Conversation

@Mic92
Copy link
Copy Markdown
Member

@Mic92 Mic92 commented Nov 27, 2025

When an SSH connection dies during a remote build, MonitorFdHup correctly detects the disconnect and calls triggerInterrupt(). However, without ReceiveInterrupts instantiated, no SIGUSR1 is sent to interrupt the blocking read() syscall. This causes the daemon to hang indefinitely while holding file locks, blocking subsequent builds.

The fix instantiates ReceiveInterrupts in processConnection(), which registers a callback to send SIGUSR1 to the current thread when triggerInterrupt() is called. This allows the blocking read() to return with EINTR, causing checkInterrupt() to throw and the daemon to exit cleanly.

This pattern is already used in ThreadPool::doWork() and SubstitutionGoal for the same purpose.

Motivation

Context


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

When an SSH connection dies during a remote build, MonitorFdHup correctly
detects the disconnect and calls triggerInterrupt(). However, without
ReceiveInterrupts instantiated, no SIGUSR1 is sent to interrupt the
blocking read() syscall. This causes the daemon to hang indefinitely
while holding file locks, blocking subsequent builds.

The fix instantiates ReceiveInterrupts in processConnection(), which
registers a callback to send SIGUSR1 to the current thread when
triggerInterrupt() is called. This allows the blocking read() to return
with EINTR, causing checkInterrupt() to throw and the daemon to exit
cleanly.

This pattern is already used in ThreadPool::doWork() and
SubstitutionGoal for the same purpose.
@Mic92 Mic92 requested a review from Ericson2314 as a code owner November 27, 2025 12:57
@Mic92
Copy link
Copy Markdown
Member Author

Mic92 commented Nov 27, 2025

I wasn't able to reproduce the bug after this fix being deployed in two moderate busy CIs. I created little c programs with the old vs the new approach to verify the behavior the signal blocking + read().

@Ericson2314 Ericson2314 added this pull request to the merge queue Nov 27, 2025
Merged via the queue into NixOS:master with commit 11b0fcd Nov 27, 2025
16 checks passed
@Mic92 Mic92 deleted the fix-remote-builder-hang branch November 27, 2025 17:02
@edolstra edolstra mentioned this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants