Build coordinator for system-wide MSBuild node management#13338
Closed
JakeRadMSFT wants to merge 6 commits intodotnet:mainfrom
Closed
Build coordinator for system-wide MSBuild node management#13338JakeRadMSFT wants to merge 6 commits intodotnet:mainfrom
JakeRadMSFT wants to merge 6 commits intodotnet:mainfrom
Conversation
Four fixes that dramatically improve MSBuild node reuse on Unix/macOS: 1. SessionId = 0 on Unix (was getsid() which returns different values per terminal, preventing cross-terminal node reuse) 2. TimeoutForNodeReuse = 1000ms (was 0ms poll-only, too fast for sleeping nodes to respond) 3. ClientConnectTimeout = 5000ms (was 60000ms, blocking idle nodes from reaching their connection timeout check) 4. DefaultNodeConnectionTimeout = 30s (was 15 minutes, so idle nodes clean up promptly instead of lingering) Relates to dotnet#13334
Move ComputeHash() from ServerNodeHandshake to base Handshake class
so both client and server handshakes can compute their hash for pipe
naming.
Add GetHashBasedPipeName() and FindNodesByHandshakeHash() to
NamedPipeUtil for O(1) discovery of compatible nodes on Unix by
listing /tmp/MSBuild-{hash}-* instead of probing all dotnet processes.
Update NodeEndpointOutOfProc to create hash-based pipe names on Unix.
Update NodeProviderOutOfProcBase to use hash-based discovery on Unix
and hash-based pipe names when connecting.
Includes 12 unit tests covering ComputeHash, GetHashBasedPipeName,
and FindNodesByHandshakeHash.
The parent MSBuild uses hash-based pipe names on Unix via
TryConnectToProcess, but the task host child was still creating
pipes with the legacy MSBuild{pid} naming. This caused MSB4216
'Could not create or connect to a task host' errors on
Linux and macOS.
Apply the same hash-based pipe naming pattern used by
NodeEndpointOutOfProc to the task host endpoint.
4f072fe to
5391367
Compare
When TaskHostParameters specify architecture '*' (any), resolve it to the actual current architecture so parent and child compute identical HandshakeOptions and hash-based pipe names on Unix. This fixes TransientAndSidecarNodeCanCoexist and TaskHostLifecycle tests that were failing because the parent had no architecture bits in the handshake while the child (using TaskHostParameters.Empty) resolved to the current arch.
Add BuildCoordinator -- a long-lived process that manages node budget across concurrent MSBuild builds. It limits total worker node count, queues excess builds, and dynamically rebalances via heartbeat protocol. Add BuildCoordinatorClient for BuildManager integration. Add --coordinator CLI mode to XMake.cs. Add ShutdownExcessNodes to INodeManager for dynamic budget reduction. Includes 29 unit tests. Peak nodes reduced from 110 to 10.
5391367 to
8f5111a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Build Coordinator for System-Wide MSBuild Node Management
Part 3 of 3 — Builds on #13337 (hash-based pipe naming) and #13336 (bug fixes). Relates to #13334.
Problem
Running multiple
dotnet buildcommands simultaneously (e.g. across worktrees) spawns excessive MSBuild worker nodes. 10 concurrent builds on a 12-core Mac produces 110 worker processes, most sitting idle, thrashing memory, and competing for CPU.Even after fixing node reuse bugs (#13336) and adding fast node discovery (#13337), the core issue remains: MSBuild has no concept of system-wide node awareness. Each build independently asks for N nodes with no coordination.
Solution
Build Coordinator (
dotnet msbuild --coordinator)A lightweight long-lived process that manages node budgets across concurrent MSBuild instances:
ShutdownExcessNodestrimming workers mid-build when budget decreasesTest Results
Tested on a 12-core Apple Silicon Mac with 10 concurrent
dotnet buildinvocations (2 batches of 5, 1 minute apart) across 5 worktrees.Without Coordinator (baseline)
Peak: 110 nodes, 111 pipes. 50-65 idle during builds. ~28s staggered cleanup.
With Coordinator (budget=12, max-builds=2)
Peak: 10 nodes, 11 pipes. 0-7 idle. All 10 builds completed, nodes reused across batches.
Comparison
The coordinator achieves 91% fewer nodes while being slightly faster overall — individual builds get dedicated CPU cores instead of fighting 100+ contending workers. All 10 builds completed successfully through the coordinator's queuing and promotion system.
Unit Tests
29 tests, all passing:
Changes
10 files changed:
BuildCoordinator.cs(688 lines),BuildCoordinatorClient.cs(217 lines),BuildCoordinator_Tests.csBuildManager.cs,INodeManager.cs,NodeManager.cs,NodeProviderOutOfProc.cs,TaskHostNodeManager.cs,Microsoft.Build.csproj,XMake.cs#if NETguards (net472 unaffected)