Skip to content

Fix node reuse probe timeout: 0ms poll -> 1s wait#13355

Closed
JakeRadMSFT wants to merge 4 commits intodotnet:mainfrom
JakeRadMSFT:fix/node-reuse-timeout
Closed

Fix node reuse probe timeout: 0ms poll -> 1s wait#13355
JakeRadMSFT wants to merge 4 commits intodotnet:mainfrom
JakeRadMSFT:fix/node-reuse-timeout

Conversation

@JakeRadMSFT
Copy link
Copy Markdown
Member

Problem

The node reuse probe uses a 0ms timeout (poll mode) when attempting to connect to idle nodes. On Unix named pipes, a 0ms poll often fails because the idle node's accept loop hasn't re-entered WaitForConnection yet. This causes MSBuild to skip reusable nodes and spawn new ones instead.

Fix

Use a 1s timeout (TimeoutForNodeReuse) to give sleeping nodes enough time to wake and respond to the handshake probe. This is short enough to not significantly delay node acquisition but long enough for the kernel to deliver the connection.

Split from #13336 per reviewer request.

The reuse probe used a 0ms timeout (poll mode) when attempting to connect
to idle nodes. On Unix named pipes, a 0ms poll often fails because the
idle node's accept loop hasn't re-entered WaitForConnection yet. This
causes MSBuild to skip reusable nodes and spawn new ones instead.

Fix: Use a 1s timeout (TimeoutForNodeReuse) to give sleeping nodes
enough time to wake and respond to the handshake probe.
Copilot AI review requested due to automatic review settings March 10, 2026 06:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts MSBuild out-of-proc node reuse probing to avoid missing reusable idle nodes (notably on Unix/macOS) by waiting briefly for the node to accept a named-pipe connection, instead of using a non-blocking poll.

Changes:

  • Introduce TimeoutForNodeReuse = 1000ms for reuse probes.
  • Update node reuse connection attempts to use the new timeout instead of 0ms poll mode.

@rainersigwald rainersigwald self-assigned this Mar 10, 2026
@rainersigwald
Copy link
Copy Markdown
Member

Pushed a refactor to keep the timeout in the place where it was defined before.

@rainersigwald
Copy link
Copy Markdown
Member

This is short enough to not significantly delay node acquisition

I'd like to drill in on this. Here's an artificial test case:

<Project>
    <ItemGroup>
        <I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=1" />
        <I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=2" />
        <I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=3" />
        <I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=4" />
        <I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=5" />
        <I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=6" />
        <I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=7" />
        <I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=8" />
    </ItemGroup>

    <Target Name="Go">
        <MSBuild Projects="@(I)" BuildInParallel="true" Targets="Inner" />
    </Target>

    <Target Name="Inner">
      <Exec Command="sleep 2" />
    </Target>
</Project>

So: 8 projects that take 2s each, you'd expect the build to take 2s and change, and it's pretty reliably 2.5s for me on this machine.

With this change you can see an added ~1s of latency in the loaded-machine case where worker nodes are busy, because each attempted node acquisition has to wait for the timeout.

killall MSBuild ; rm -r comm_first ; rm -r comm_second ; MSBUILDDEBUGPATH=$(pwd)/comm_first MSBUILDDEBUGCOMM=1 /home/raines/src/msbuild/artifacts/bin/bootstrap/core/dotnet msbuild parallelbuild.proj -clp:Summary -tl:off & sleep 0.2 && MSBUILDDEBUGPATH=$(pwd)/comm_second MSBUILDDEBUGCOMM=1 /home/raines/src/msbuild/artifacts/bin/bootstrap/core/dotnet msbuild parallelbuild.proj -clp:Summary -tl:off
[1] 178914
MSBuild version 18.6.0-dev-26160-01+b5ce005b8 for .NET

MSBuild version 18.6.0-dev-26160-01+b5ce005b8 for .NET


Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:02.53
[1]  + done       MSBUILDDEBUGPATH=$(pwd)/comm_first MSBUILDDEBUGCOMM=1  msbuild  -clp:Summary

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:03.57

And if you insert more busy worker nodes, you can stack this delay.

Since I know you're worried about the many-in-flight-builds case I'm not sure that increasing it so high is the right move.

@rainersigwald
Copy link
Copy Markdown
Member

Closing since it's not clear this is a good plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants