Skip to content

Reduce node handshake read timeout from 60s to 5s#13357

Open
JakeRadMSFT wants to merge 1 commit intodotnet:mainfrom
JakeRadMSFT:fix/handshake-read-timeout
Open

Reduce node handshake read timeout from 60s to 5s#13357
JakeRadMSFT wants to merge 1 commit intodotnet:mainfrom
JakeRadMSFT:fix/handshake-read-timeout

Conversation

@JakeRadMSFT
Copy link
Copy Markdown
Member

Problem

Idle nodes block in WaitForConnection, then read the handshake with a 60s timeout (ClientConnectTimeout). If a reuse probe connects but the handshake fails or stalls, the node is blocked for up to 60s before it can loop back and check the idle timeout. This means nodes can linger much longer than the configured idle timeout.

Fix

Reduce ClientConnectTimeout from 60s to 5s. A failed handshake now only blocks the node briefly, allowing it to promptly re-check whether it should exit.

Also updates stale "wait a long time" comments to match the new timeout value.

Split from #13336 per reviewer request. @baronfel asked @rainersigwald about this change — the message processing pump is single-threaded, so a long block here prevents the node from servicing other work.

Copilot AI review requested due to automatic review settings March 10, 2026 06:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces the node-side handshake read timeout to prevent idle nodes from being blocked for long periods when a reuse probe connects but doesn’t complete the handshake, allowing the node to more promptly re-check its idle/connection timeout logic.

Changes:

  • Reduced ClientConnectTimeout from 60s to 5s in the out-of-proc node endpoint (NETCOREAPP2_1+).
  • Updated inline comments around handshake reads to match the shorter timeout.

Comment thread src/Shared/NodeEndpointOutOfProcBase.cs Outdated
Comment thread src/Shared/NodeEndpointOutOfProcBase.cs Outdated
Idle nodes block in WaitForConnection, then read the handshake with a 60s
timeout. If a reuse probe connects but the handshake fails or stalls,
the node is blocked for up to 60s before it can check the idle timeout
and exit. This means nodes can linger much longer than the configured
idle timeout.

Reducing to 5s means a failed handshake only blocks the node briefly,
allowing it to loop back and check the connection timeout promptly.

Also updates stale 'wait a long time' comments to match the new timeout.
@JakeRadMSFT JakeRadMSFT force-pushed the fix/handshake-read-timeout branch from 9c39855 to 5018488 Compare March 10, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants