Skip to content

Fix Unix/macOS node reuse bugs + reduce idle timeout#13336

Closed
JakeRadMSFT wants to merge 1 commit intodotnet:mainfrom
JakeRadMSFT:fix/unix-node-reuse-bugs
Closed

Fix Unix/macOS node reuse bugs + reduce idle timeout#13336
JakeRadMSFT wants to merge 1 commit intodotnet:mainfrom
JakeRadMSFT:fix/unix-node-reuse-bugs

Conversation

@JakeRadMSFT
Copy link
Copy Markdown
Member

Fix Unix/macOS Node Reuse Bugs + Reduce Idle Timeout

Relates to #13334

Problem

On macOS/Unix, MSBuild node reuse is effectively broken due to three bugs, causing every build to spawn fresh worker nodes. After builds finish, those nodes linger for 15 minutes consuming memory.

Fixes

  1. sessionId = 0 on Unixgetsid() returns the session leader PID which differs per terminal on Unix. MSBuild uses this in the handshake, making nodes from one terminal invisible to builds in another. Since Unix doesn't need RDP-style session isolation, we use 0.

  2. TimeoutForNodeReuse = 1000ms (was 0ms poll-only) — The previous behavior was a non-blocking poll that was too fast for sleeping nodes to wake and respond to the handshake.

  3. ClientConnectTimeout = 5000ms (was 60000ms) — Idle nodes waiting for work blocked for a full minute on each failed reuse probe, making them unresponsive.

  4. DefaultNodeConnectionTimeout = 30s (was 15 minutes) — Idle nodes now clean up in ~30 seconds instead of lingering for 15 minutes.

Test Results

10 concurrent builds on a 12-core Mac:

Metric Before (15min timeout) After (30s timeout)
Idle nodes after builds finish 110 (for 15+ minutes) 110 → 0 in ~30s
Cross-terminal node reuse Broken Works

Tests

2 new tests (UnixNodeReuseFixes_Tests.cs):

  • Handshake SessionId is 0 on Unix
  • Cross-terminal handshake key equality

Changes

4 files changed, 74 insertions, 5 deletions:

  • CommunicationsUtilities.cs — sessionId fix + idle timeout
  • NodeEndpointOutOfProcBase.cs — ClientConnectTimeout
  • NodeProviderOutOfProcBase.cs — TimeoutForNodeReuse
  • UnixNodeReuseFixes_Tests.cs — new tests

Four fixes that dramatically improve MSBuild node reuse on Unix/macOS:

1. SessionId = 0 on Unix (was getsid() which returns different values
   per terminal, preventing cross-terminal node reuse)
2. TimeoutForNodeReuse = 1000ms (was 0ms poll-only, too fast for
   sleeping nodes to respond)
3. ClientConnectTimeout = 5000ms (was 60000ms, blocking idle nodes
   from reaching their connection timeout check)
4. DefaultNodeConnectionTimeout = 30s (was 15 minutes, so idle nodes
   clean up promptly instead of lingering)

Relates to dotnet#13334
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to restore effective out-of-proc node reuse on Unix/macOS by fixing handshake/session semantics and adjusting timeouts so reuse probes succeed and idle nodes exit much sooner.

Changes:

  • Set handshake SessionId to 0 on non-Windows to enable cross-terminal node reuse.
  • Introduce a non-zero reuse connection timeout (1s) and reduce handshake-read connect timeout (5s) to avoid long blocking behavior.
  • Reduce default node connection/idle timeout to 30s and add unit tests for the Unix handshake behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
src/Shared/NodeEndpointOutOfProcBase.cs Reduces handshake read/connect timeout so failed probes don’t stall node endpoints.
src/Shared/CommunicationsUtilities.cs Adjusts Unix SessionId semantics and lowers the default node connection timeout.
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs Adds a 1s timeout when attempting to reuse an existing node (instead of polling).
src/Build.UnitTests/BackEnd/UnixNodeReuseFixes_Tests.cs Adds tests covering Unix SessionId behavior in handshake keys.

Comment on lines +19 to +34
[Fact]
public void Handshake_OnUnix_SessionIdIsZero()
{
if (!NativeMethodsShared.IsUnixLike)
{
return;
}

// Two handshakes created from different contexts should have the same
// session ID (0) on Unix, enabling cross-terminal node reuse.
var h1 = new Handshake(HandshakeOptions.NodeReuse);
var h2 = new Handshake(HandshakeOptions.NodeReuse);

// Same handshake key means same session ID was used
h1.GetKey().ShouldBe(h2.GetKey());
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handshake_OnUnix_SessionIdIsZero doesn't actually validate that SessionId is 0: both Handshake instances are created in the same process, so their keys would match even if SessionId were non-zero (pre-fix). Consider asserting handshake.RetrieveHandshakeComponents().SessionId == 0 (and/or remove the redundant key-equality check).

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +25
if (!NativeMethodsShared.IsUnixLike)
{
return;
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests use if (!NativeMethodsShared.IsUnixLike) return;, which makes them silently pass on non-Unix platforms. Prefer [UnixOnlyFact] (or another conditional test attribute) so the test is reported as skipped instead of passing without assertions.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +42
if (!NativeMethodsShared.IsUnixLike)
{
return;
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: returning early when not on Unix causes a pass-without-testing on Windows. Use [UnixOnlyFact]/conditional attributes so the test runner correctly reports it as skipped.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with using OS-specific facts - we should have a full set available.

Comment on lines 289 to 300
int sessionId = 0;
if (includeSessionId)
{
using var currentProcess = Process.GetCurrentProcess();
sessionId = currentProcess.SessionId;
if (NativeMethodsShared.IsWindows)
{
using var currentProcess = Process.GetCurrentProcess();
sessionId = currentProcess.SessionId;
}
// On Unix, getsid() returns the session leader PID which differs per terminal,
// preventing cross-terminal node reuse. Use 0 since Unix doesn't need
// RDP-style session isolation.
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes Handshake SessionId behavior only in src/Shared/CommunicationsUtilities.cs. There is a separate handshake implementation in src/MSBuildTaskHost/CommunicationsUtilities.cs that still uses EnvironmentUtilities.ProcessSessionId (getsid on Unix), so cross-terminal TaskHost reuse may remain broken unless it's updated similarly (or the scope is clarified).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that these should be unified in the same way.

Comment on lines 415 to +418
/// <summary>
/// The timeout to connect to a node.
/// </summary>
private const int DefaultNodeConnectionTimeout = 900 * 1000; // 15 minutes; enough time that a dev will typically do another build in this time
private const int DefaultNodeConnectionTimeout = 30 * 1000; // 30 seconds
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultNodeConnectionTimeout is reduced to 30s here, but src/MSBuildTaskHost/CommunicationsUtilities.cs still defaults to 15 minutes. If TaskHost processes also rely on this timeout for idle cleanup, they'll continue to linger much longer than worker nodes. Consider aligning the defaults (or documenting why TaskHost should remain different).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a controversial change - please split it out into a separate PR from the pure bugfixes for argument discussion.

Comment on lines 45 to +50
/// <summary>
/// The amount of time to wait for the client to connect to the host.
/// Reduced from 60s to 5s so that failed reuse probes don't block idle nodes
/// from reaching their connection timeout check.
/// </summary>
private const int ClientConnectTimeout = 60000;
private const int ClientConnectTimeout = 5000;
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reducing this to 5s, a couple of callsites still describe this as "wait a long time" (for handshake reads). Consider updating the wording to avoid confusion when diagnosing timeouts.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reasonable to want to make more responsive - @rainersigwald this appears to lock out nodes because the message processing pump is single-threaded and serial execution, yeah?

@rainersigwald
Copy link
Copy Markdown
Member

@JakeRadMSFT I'd really appreciate it if you could split this into individual bugfixes, because the review and refactorings for the different components are fairly distinct.

@JakeRadMSFT
Copy link
Copy Markdown
Member Author

Closing in favor of individual PRs per reviewer request:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants