Fix node reuse probe timeout: 0ms poll -> 1s wait#13355
Fix node reuse probe timeout: 0ms poll -> 1s wait#13355JakeRadMSFT wants to merge 4 commits intodotnet:mainfrom
Conversation
The reuse probe used a 0ms timeout (poll mode) when attempting to connect to idle nodes. On Unix named pipes, a 0ms poll often fails because the idle node's accept loop hasn't re-entered WaitForConnection yet. This causes MSBuild to skip reusable nodes and spawn new ones instead. Fix: Use a 1s timeout (TimeoutForNodeReuse) to give sleeping nodes enough time to wake and respond to the handshake probe.
There was a problem hiding this comment.
Pull request overview
Adjusts MSBuild out-of-proc node reuse probing to avoid missing reusable idle nodes (notably on Unix/macOS) by waiting briefly for the node to accept a named-pipe connection, instead of using a non-blocking poll.
Changes:
- Introduce
TimeoutForNodeReuse = 1000msfor reuse probes. - Update node reuse connection attempts to use the new timeout instead of
0mspoll mode.
|
Pushed a refactor to keep the timeout in the place where it was defined before. |
I'd like to drill in on this. Here's an artificial test case: <Project>
<ItemGroup>
<I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=1" />
<I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=2" />
<I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=3" />
<I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=4" />
<I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=5" />
<I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=6" />
<I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=7" />
<I Include="$(MSBuildThisFileFullPath)" AdditionalProperties="InnerBuildId=8" />
</ItemGroup>
<Target Name="Go">
<MSBuild Projects="@(I)" BuildInParallel="true" Targets="Inner" />
</Target>
<Target Name="Inner">
<Exec Command="sleep 2" />
</Target>
</Project>So: 8 projects that take 2s each, you'd expect the build to take 2s and change, and it's pretty reliably 2.5s for me on this machine. With this change you can see an added ~1s of latency in the loaded-machine case where worker nodes are busy, because each attempted node acquisition has to wait for the timeout. ❯ killall MSBuild ; rm -r comm_first ; rm -r comm_second ; MSBUILDDEBUGPATH=$(pwd)/comm_first MSBUILDDEBUGCOMM=1 /home/raines/src/msbuild/artifacts/bin/bootstrap/core/dotnet msbuild parallelbuild.proj -clp:Summary -tl:off & sleep 0.2 && MSBUILDDEBUGPATH=$(pwd)/comm_second MSBUILDDEBUGCOMM=1 /home/raines/src/msbuild/artifacts/bin/bootstrap/core/dotnet msbuild parallelbuild.proj -clp:Summary -tl:off
[1] 178914
MSBuild version 18.6.0-dev-26160-01+b5ce005b8 for .NET
MSBuild version 18.6.0-dev-26160-01+b5ce005b8 for .NET
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:02.53
[1] + done MSBUILDDEBUGPATH=$(pwd)/comm_first MSBUILDDEBUGCOMM=1 msbuild -clp:Summary
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:03.57And if you insert more busy worker nodes, you can stack this delay. Since I know you're worried about the many-in-flight-builds case I'm not sure that increasing it so high is the right move. |
|
Closing since it's not clear this is a good plan. |
Problem
The node reuse probe uses a 0ms timeout (poll mode) when attempting to connect to idle nodes. On Unix named pipes, a 0ms poll often fails because the idle node's accept loop hasn't re-entered
WaitForConnectionyet. This causes MSBuild to skip reusable nodes and spawn new ones instead.Fix
Use a 1s timeout (
TimeoutForNodeReuse) to give sleeping nodes enough time to wake and respond to the handshake probe. This is short enough to not significantly delay node acquisition but long enough for the kernel to deliver the connection.Split from #13336 per reviewer request.