Skip to content

Build coordinator for system-wide MSBuild node management#13338

Closed
JakeRadMSFT wants to merge 6 commits intodotnet:mainfrom
JakeRadMSFT:feature/build-coordinator
Closed

Build coordinator for system-wide MSBuild node management#13338
JakeRadMSFT wants to merge 6 commits intodotnet:mainfrom
JakeRadMSFT:feature/build-coordinator

Conversation

@JakeRadMSFT
Copy link
Copy Markdown
Member

@JakeRadMSFT JakeRadMSFT commented Mar 9, 2026

Build Coordinator for System-Wide MSBuild Node Management

Part 3 of 3 — Builds on #13337 (hash-based pipe naming) and #13336 (bug fixes). Relates to #13334.

Problem

Running multiple dotnet build commands simultaneously (e.g. across worktrees) spawns excessive MSBuild worker nodes. 10 concurrent builds on a 12-core Mac produces 110 worker processes, most sitting idle, thrashing memory, and competing for CPU.

Even after fixing node reuse bugs (#13336) and adding fast node discovery (#13337), the core issue remains: MSBuild has no concept of system-wide node awareness. Each build independently asks for N nodes with no coordination.

Solution

Build Coordinator (dotnet msbuild --coordinator)

A lightweight long-lived process that manages node budgets across concurrent MSBuild instances:

  • Fair-share budgeting: total node budget divided equally among active builds
  • Epoch-based heartbeat-gated promotion: new builds only get activated after all existing builds have acknowledged their reduced budget via heartbeat — prevents temporarily exceeding the budget
  • Dynamic rebalancing: budgets adjust as builds start/finish, with ShutdownExcessNodes trimming workers mid-build when budget decreases
  • PID-aware staleness reaper: detects crashed builds by checking heartbeat freshness + process liveness
  • Pipe file watchdog: auto-recreates the coordinator pipe if deleted by cleanup scripts
  • Coordinator-aware lifecycle: when coordinator is active, nodes exit immediately on build completion instead of lingering

Test Results

Tested on a 12-core Apple Silicon Mac with 10 concurrent dotnet build invocations (2 batches of 5, 1 minute apart) across 5 worktrees.

Without Coordinator (baseline)

TIME      NODES  PIPES  BUILDS  IDLE  PEAK  +SPAWN  -DIED
19:27:22      0      1       5     0     0       0      0  ← batch 1 starts (5 builds)
19:27:26     44     32       5     7    44      44      0  ← nodes explode immediately
19:27:33     55     56       5     0    55      55      0  ← 55 nodes (11 per build)
19:28:10     55     56       5    20    55      55      0  ← 20 idle while building
19:28:22     55     56      10    26    55      55      0  ← batch 2 arrives
19:28:39    110    106      10    22   110     110      0  ← 110 nodes! (11 × 10)
19:29:24    110    111      10    59   110     110      0  ← 59 of 110 nodes idle
19:30:06    110    111      10    65   110     110      0  ← 65 idle, wasting resources
19:30:57    110    111       0   110   110     110      0  ← all done, 110 idle nodes
19:31:07     88     89       0    88   110     110     22  ← staggered cleanup begins
19:31:21     55     56       0    55   110     110     55
19:31:25      0      1       0     0   110     110    110  ← all dead after ~28s

Peak: 110 nodes, 111 pipes. 50-65 idle during builds. ~28s staggered cleanup.

With Coordinator (budget=12, max-builds=2)

TIME      NODES  PIPES  BUILDS  IDLE  PEAK  +SPAWN  -DIED
19:22:09      0      1       0     0     0       0      0 [COORD] ← coordinator listening
19:22:18      5      1       5     5     5       5      0 [COORD] ← batch 1 (5 builds), only 2 active
19:22:22     10     11       5     0    10      10      0 [COORD] ← capped at 10 nodes (budget=12)
19:22:47     10     11       5     7    10      10      0 [COORD] ← max 7 idle (not 60+)
19:23:07     10     11       3     5    10      10      0 [COORD] ← 2 done, next 2 promoted
19:23:18     10     11       8     0    10      10      0 [COORD] ← batch 2 arrives, queued
19:24:17     10     11       6     6    10      10      0 [COORD] ← steady state
19:25:34     10     11       1     5    10      10      0 [COORD] ← last build finishing
19:25:37     10     11       0    10    10      10      0 [COORD] ← all done
19:26:03      5      6       0     5    10      10      5 [COORD] ← nodes exiting
19:26:06      0      1       0     0    10      10     10 [COORD] ← clean

Peak: 10 nodes, 11 pipes. 0-7 idle. All 10 builds completed, nodes reused across batches.

Comparison

Metric Without Coordinator With Coordinator Improvement
Peak nodes 110 10 91% fewer
Peak pipes 111 11 90% fewer
Total nodes spawned 110 10 91% fewer
Avg idle nodes during build ~50-65 ~3-7 ~93% less waste
Nodes after all builds done 110 (for 28s) 10 (for 29s) 10× fewer to clean up
Total wall-clock (10 builds) ~3:35 ~3:19 ~8% faster

The coordinator achieves 91% fewer nodes while being slightly faster overall — individual builds get dedicated CPU cores instead of fighting 100+ contending workers. All 10 builds completed successfully through the coordinator's queuing and promotion system.

Unit Tests

29 tests, all passing:

  • BuildCoordinator_Tests.cs (29 tests): coordinator protocol, fair-share budgeting, queuing and promotion, heartbeat budget adjustments, rebalancing on unregister, max concurrent build limits, status reporting, graceful shutdown

Changes

10 files changed:

  • 3 new files: BuildCoordinator.cs (688 lines), BuildCoordinatorClient.cs (217 lines), BuildCoordinator_Tests.cs
  • 7 modified files: BuildManager.cs, INodeManager.cs, NodeManager.cs, NodeProviderOutOfProc.cs, TaskHostNodeManager.cs, Microsoft.Build.csproj, XMake.cs
  • All coordinator code behind #if NET guards (net472 unaffected)

Four fixes that dramatically improve MSBuild node reuse on Unix/macOS:

1. SessionId = 0 on Unix (was getsid() which returns different values
   per terminal, preventing cross-terminal node reuse)
2. TimeoutForNodeReuse = 1000ms (was 0ms poll-only, too fast for
   sleeping nodes to respond)
3. ClientConnectTimeout = 5000ms (was 60000ms, blocking idle nodes
   from reaching their connection timeout check)
4. DefaultNodeConnectionTimeout = 30s (was 15 minutes, so idle nodes
   clean up promptly instead of lingering)

Relates to dotnet#13334
Move ComputeHash() from ServerNodeHandshake to base Handshake class
so both client and server handshakes can compute their hash for pipe
naming.

Add GetHashBasedPipeName() and FindNodesByHandshakeHash() to
NamedPipeUtil for O(1) discovery of compatible nodes on Unix by
listing /tmp/MSBuild-{hash}-* instead of probing all dotnet processes.

Update NodeEndpointOutOfProc to create hash-based pipe names on Unix.
Update NodeProviderOutOfProcBase to use hash-based discovery on Unix
and hash-based pipe names when connecting.

Includes 12 unit tests covering ComputeHash, GetHashBasedPipeName,
and FindNodesByHandshakeHash.
The parent MSBuild uses hash-based pipe names on Unix via
TryConnectToProcess, but the task host child was still creating
pipes with the legacy MSBuild{pid} naming. This caused MSB4216
'Could not create or connect to a task host' errors on
Linux and macOS.

Apply the same hash-based pipe naming pattern used by
NodeEndpointOutOfProc to the task host endpoint.
@JakeRadMSFT JakeRadMSFT force-pushed the feature/build-coordinator branch from 4f072fe to 5391367 Compare March 9, 2026 03:42
When TaskHostParameters specify architecture '*' (any), resolve it to the
actual current architecture so parent and child compute identical
HandshakeOptions and hash-based pipe names on Unix. This fixes
TransientAndSidecarNodeCanCoexist and TaskHostLifecycle tests that were
failing because the parent had no architecture bits in the handshake while
the child (using TaskHostParameters.Empty) resolved to the current arch.
Add BuildCoordinator -- a long-lived process that manages node budget
across concurrent MSBuild builds. It limits total worker node count,
queues excess builds, and dynamically rebalances via heartbeat protocol.

Add BuildCoordinatorClient for BuildManager integration.
Add --coordinator CLI mode to XMake.cs.
Add ShutdownExcessNodes to INodeManager for dynamic budget reduction.
Includes 29 unit tests. Peak nodes reduced from 110 to 10.
@JakeRadMSFT JakeRadMSFT force-pushed the feature/build-coordinator branch from 5391367 to 8f5111a Compare March 9, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant