Fix Unix/macOS node reuse bugs + reduce idle timeout#13336
Fix Unix/macOS node reuse bugs + reduce idle timeout#13336JakeRadMSFT wants to merge 1 commit intodotnet:mainfrom
Conversation
Four fixes that dramatically improve MSBuild node reuse on Unix/macOS: 1. SessionId = 0 on Unix (was getsid() which returns different values per terminal, preventing cross-terminal node reuse) 2. TimeoutForNodeReuse = 1000ms (was 0ms poll-only, too fast for sleeping nodes to respond) 3. ClientConnectTimeout = 5000ms (was 60000ms, blocking idle nodes from reaching their connection timeout check) 4. DefaultNodeConnectionTimeout = 30s (was 15 minutes, so idle nodes clean up promptly instead of lingering) Relates to dotnet#13334
There was a problem hiding this comment.
Pull request overview
This PR aims to restore effective out-of-proc node reuse on Unix/macOS by fixing handshake/session semantics and adjusting timeouts so reuse probes succeed and idle nodes exit much sooner.
Changes:
- Set handshake
SessionIdto0on non-Windows to enable cross-terminal node reuse. - Introduce a non-zero reuse connection timeout (1s) and reduce handshake-read connect timeout (5s) to avoid long blocking behavior.
- Reduce default node connection/idle timeout to 30s and add unit tests for the Unix handshake behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/Shared/NodeEndpointOutOfProcBase.cs | Reduces handshake read/connect timeout so failed probes don’t stall node endpoints. |
| src/Shared/CommunicationsUtilities.cs | Adjusts Unix SessionId semantics and lowers the default node connection timeout. |
| src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs | Adds a 1s timeout when attempting to reuse an existing node (instead of polling). |
| src/Build.UnitTests/BackEnd/UnixNodeReuseFixes_Tests.cs | Adds tests covering Unix SessionId behavior in handshake keys. |
| [Fact] | ||
| public void Handshake_OnUnix_SessionIdIsZero() | ||
| { | ||
| if (!NativeMethodsShared.IsUnixLike) | ||
| { | ||
| return; | ||
| } | ||
|
|
||
| // Two handshakes created from different contexts should have the same | ||
| // session ID (0) on Unix, enabling cross-terminal node reuse. | ||
| var h1 = new Handshake(HandshakeOptions.NodeReuse); | ||
| var h2 = new Handshake(HandshakeOptions.NodeReuse); | ||
|
|
||
| // Same handshake key means same session ID was used | ||
| h1.GetKey().ShouldBe(h2.GetKey()); | ||
| } |
There was a problem hiding this comment.
Handshake_OnUnix_SessionIdIsZero doesn't actually validate that SessionId is 0: both Handshake instances are created in the same process, so their keys would match even if SessionId were non-zero (pre-fix). Consider asserting handshake.RetrieveHandshakeComponents().SessionId == 0 (and/or remove the redundant key-equality check).
| if (!NativeMethodsShared.IsUnixLike) | ||
| { | ||
| return; | ||
| } |
There was a problem hiding this comment.
These tests use if (!NativeMethodsShared.IsUnixLike) return;, which makes them silently pass on non-Unix platforms. Prefer [UnixOnlyFact] (or another conditional test attribute) so the test is reported as skipped instead of passing without assertions.
| if (!NativeMethodsShared.IsUnixLike) | ||
| { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Same issue here: returning early when not on Unix causes a pass-without-testing on Windows. Use [UnixOnlyFact]/conditional attributes so the test runner correctly reports it as skipped.
There was a problem hiding this comment.
Agree with using OS-specific facts - we should have a full set available.
| int sessionId = 0; | ||
| if (includeSessionId) | ||
| { | ||
| using var currentProcess = Process.GetCurrentProcess(); | ||
| sessionId = currentProcess.SessionId; | ||
| if (NativeMethodsShared.IsWindows) | ||
| { | ||
| using var currentProcess = Process.GetCurrentProcess(); | ||
| sessionId = currentProcess.SessionId; | ||
| } | ||
| // On Unix, getsid() returns the session leader PID which differs per terminal, | ||
| // preventing cross-terminal node reuse. Use 0 since Unix doesn't need | ||
| // RDP-style session isolation. | ||
| } |
There was a problem hiding this comment.
This changes Handshake SessionId behavior only in src/Shared/CommunicationsUtilities.cs. There is a separate handshake implementation in src/MSBuildTaskHost/CommunicationsUtilities.cs that still uses EnvironmentUtilities.ProcessSessionId (getsid on Unix), so cross-terminal TaskHost reuse may remain broken unless it's updated similarly (or the scope is clarified).
There was a problem hiding this comment.
Agree that these should be unified in the same way.
| /// <summary> | ||
| /// The timeout to connect to a node. | ||
| /// </summary> | ||
| private const int DefaultNodeConnectionTimeout = 900 * 1000; // 15 minutes; enough time that a dev will typically do another build in this time | ||
| private const int DefaultNodeConnectionTimeout = 30 * 1000; // 30 seconds |
There was a problem hiding this comment.
DefaultNodeConnectionTimeout is reduced to 30s here, but src/MSBuildTaskHost/CommunicationsUtilities.cs still defaults to 15 minutes. If TaskHost processes also rely on this timeout for idle cleanup, they'll continue to linger much longer than worker nodes. Consider aligning the defaults (or documenting why TaskHost should remain different).
There was a problem hiding this comment.
This is a controversial change - please split it out into a separate PR from the pure bugfixes for argument discussion.
| /// <summary> | ||
| /// The amount of time to wait for the client to connect to the host. | ||
| /// Reduced from 60s to 5s so that failed reuse probes don't block idle nodes | ||
| /// from reaching their connection timeout check. | ||
| /// </summary> | ||
| private const int ClientConnectTimeout = 60000; | ||
| private const int ClientConnectTimeout = 5000; |
There was a problem hiding this comment.
After reducing this to 5s, a couple of callsites still describe this as "wait a long time" (for handshake reads). Consider updating the wording to avoid confusion when diagnosing timeouts.
There was a problem hiding this comment.
This is reasonable to want to make more responsive - @rainersigwald this appears to lock out nodes because the message processing pump is single-threaded and serial execution, yeah?
|
@JakeRadMSFT I'd really appreciate it if you could split this into individual bugfixes, because the review and refactorings for the different components are fairly distinct. |
|
Closing in favor of individual PRs per reviewer request:
|
Fix Unix/macOS Node Reuse Bugs + Reduce Idle Timeout
Relates to #13334
Problem
On macOS/Unix, MSBuild node reuse is effectively broken due to three bugs, causing every build to spawn fresh worker nodes. After builds finish, those nodes linger for 15 minutes consuming memory.
Fixes
sessionId = 0on Unix —getsid()returns the session leader PID which differs per terminal on Unix. MSBuild uses this in the handshake, making nodes from one terminal invisible to builds in another. Since Unix doesn't need RDP-style session isolation, we use 0.TimeoutForNodeReuse = 1000ms(was 0ms poll-only) — The previous behavior was a non-blocking poll that was too fast for sleeping nodes to wake and respond to the handshake.ClientConnectTimeout = 5000ms(was 60000ms) — Idle nodes waiting for work blocked for a full minute on each failed reuse probe, making them unresponsive.DefaultNodeConnectionTimeout = 30s(was 15 minutes) — Idle nodes now clean up in ~30 seconds instead of lingering for 15 minutes.Test Results
10 concurrent builds on a 12-core Mac:
Tests
2 new tests (
UnixNodeReuseFixes_Tests.cs):Changes
4 files changed, 74 insertions, 5 deletions:
CommunicationsUtilities.cs— sessionId fix + idle timeoutNodeEndpointOutOfProcBase.cs— ClientConnectTimeoutNodeProviderOutOfProcBase.cs— TimeoutForNodeReuseUnixNodeReuseFixes_Tests.cs— new tests