Fix thread safety issues in WorkerNodeTelemetryData by AR-May · Pull Request #13413 · dotnet/msbuild

AR-May · 2026-03-18T17:28:44Z

Related to #12867

Context

Fixing two bugs in telemetry infrastructure when using /m /mt (in-proc multithreaded) mode:

Thread-safety crash: all in-proc nodes share a single TelemetryForwarderProvider singleton. Multiple RequestBuilder instances run on dedicated threads and call AddTask/AddTarget concurrently on the same WorkerNodeTelemetryData dictionary fields, causing race conditions and dictionary corruption.

Nx telemetry duplication: In /m /mt mode, N BuildRequestEngine instances share one TelemetryForwarderProvider singleton. Each engine calls FinalizeProcessing on shutdown, sending the entire accumulated data each time. The InternalTelemetryConsumingLogger merges all N copies, inflating every counter N times.

Reproduction: 20+ non-SDK .NET Framework library projects + 1 exe referencing all of them, built with MSBuild.exe Repro.sln /m /mt.

Changes Made

Fix

Batch-then-merge in RequestBuilder: Each RequestBuilder now accumulates task/target telemetry into a local WorkerNodeTelemetryData instance (zero contention), then merges once into the shared state via elemetryForwarder.MergeWorkerData().

Thread-safe TelemetryForwarder: Added an internal lock protecting both MergeWorkerData and FinalizeProcessing. The forwarder is a singleton shared across BuildRequestEngine instances in /m /mt mode, so concurrent access is expected.

Swap-and-send in FinalizeProcessing: Instead of sending the same accumulated data on every call, FinalizeProcessing atomically swaps the internal data with a fresh empty instance under the lock, then sends only if non-empty. This ensures:

First engine to finalize sends all data accumulated so far
Subsequent engines find empty data and skip sending (no duplication)
Late merges from other engines go into the new instance and are sent by the next FinalizeProcessing call (no
data loss)

Testing

Locally tested that the issue is gone on a repro project.
Unit tests

Copilot

Pull request overview

This PR addresses a concurrency bug in MSBuild’s internal worker-node telemetry aggregation when running in in-proc multi-threaded mode (/m /mt), where multiple RequestBuilder instances can concurrently mutate shared telemetry dictionaries and corrupt their state.

Changes:

Switch telemetry reporting in RequestBuilder.UpdateStatisticsPostBuild from per-target/per-task updates to batching into a local WorkerNodeTelemetryData and merging once.
Replace ITelemetryForwarder.AddTask/AddTarget with a single MergeWorkerData API and update the provider implementations accordingly.
Add explicit locking around aggregation in the internal telemetry-consuming logger.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/Framework/Telemetry/WorkerNodeTelemetryData.cs	Adds method documentation and minor refactor while preserving merge/aggregation behavior.
src/Build/TelemetryInfra/TelemetryForwarderProvider.cs	Replaces per-item update APIs with `MergeWorkerData` and exposes key creation helper for batching.
src/Build/TelemetryInfra/InternalTelemetryConsumingLogger.cs	Adds a lock around worker telemetry aggregation.
src/Build/TelemetryInfra/ITelemetryForwarder.cs	Updates forwarder contract to support batched merging.
src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs	Implements local accumulation + single merge under a lock to prevent concurrent dictionary writes.

AR-May · 2026-03-19T13:50:58Z

/azp run

azure-pipelines · 2026-03-19T13:51:08Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Fixes telemetry thread-safety and counter inflation in in-proc multithreaded (/m /mt) builds by batching telemetry per RequestBuilder, merging into a shared forwarder under a lock, and preventing repeated “send the whole buffer” behavior during engine shutdown.

Changes:

Accumulate task/target telemetry in a per-RequestBuilder WorkerNodeTelemetryData and merge once into the shared forwarder.
Make TelemetryForwarder thread-safe and change finalization to “swap-and-send” to avoid Nx duplication across multiple BuildRequestEngine finalizers.
Add unit tests for WorkerNodeTelemetryData.IsEmpty and forwarder reset behavior after FinalizeProcessing.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/Framework/Telemetry/WorkerNodeTelemetryData.cs	Adds `IsEmpty` and improves merge/add method clarity used by the forwarder swap-and-send logic.
src/Framework/Telemetry/TaskOrTargetTelemetryKey.cs	Introduces a helper factory (`Create`) to centralize key construction used by `RequestBuilder`.
src/Build/TelemetryInfra/TelemetryForwarderProvider.cs	Adds locking, batch merge entrypoint, and swap-and-send finalization to prevent races and duplicate sends.
src/Build/TelemetryInfra/ITelemetryForwarder.cs	Replaces per-item APIs with a batch merge API (`MergeWorkerData`).
src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs	Switches to batch-then-merge telemetry collection to remove dictionary contention.
src/Build.UnitTests/Telemetry/Telemetry_Tests.cs	Adds tests for the new reset/empty behavior and forwarder finalization semantics.

This reverts commit dfe5370.

PR dotnet#13413 fixed thread-safety by creating a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request, populating it, then merging under a lock. Each dictionary grows through 6-7 resize steps, creating ~30KB of Entry[]/Int32[] churn per request. With ~1000 requests per node across multiple nodes, this totals ~200MB for large solutions like OrchardCore (233 projects). Fix: restore AddTarget/AddTask directly on ITelemetryForwarder with internal per-call locking in TelemetryForwarder. This eliminates all per-request dictionary allocations while preserving thread safety and the N-times duplication fix (swap-and-send FinalizeProcessing). Lock overhead is negligible: ~20ns uncontended x ~200 calls = 4us/request. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eliminates ~200MB allocation regression from PR #13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR #13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes because the pattern was applied unconditionally. Fix: each BuildRequestEngine now owns its own WorkerNodeTelemetryData on the NodeLoggingContext. RequestBuilder writes directly to it with per-call locking (for concurrent builders within one engine). The engine sends the data at CleanupForBuild and resets for the next build. This eliminates: - Per-request dictionary allocations (zero new dictionaries per request) - Singleton contention (no shared TelemetryForwarderProvider state) - N-times duplication (each engine sends only its own data) The InternalTelemetryConsumingLogger already merges WorkerNodeTelemetry EventArgs from all engines, so the data flow is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext. The forwarder encapsulates the data, the lock, and the send logic: - TelemetryForwarder owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton This eliminates per-request allocations, singleton contention, and N-times duplication (each engine sends only its own data). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryCollector (via TelemetryCollectorProvider.CreateCollector) and passes it to builders through NodeLoggingContext. The collector encapsulates the data, the lock, and the send logic: - TelemetryCollector owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryCollectorProvider is the factory, not a data-holding singleton Renamed from Forwarder to Collector to reflect what it actually does (collects telemetry, does not forward — forwarding is done by the logging infrastructure). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext. The forwarder encapsulates the data, the lock, and the send logic: - TelemetryForwarder owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton This eliminates per-request allocations, singleton contention, and N-times duplication (each engine sends only its own data). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext. The forwarder encapsulates the data, the lock, and the send logic: - TelemetryForwarder owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext: - TelemetryForwarder owns WorkerNodeTelemetryData (no locking needed - the engine's one-active-builder-at-a-time invariant guarantees single-threaded access, see dotnet#13531) - RequestBuilder calls AddTarget/AddTask directly - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 18, 2026 17:28

Copilot started reviewing on behalf of AR-May March 18, 2026 17:29 View session

AR-May marked this pull request as draft March 18, 2026 17:31

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Comment thread src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs Outdated

Comment thread src/Build/TelemetryInfra/TelemetryForwarderProvider.cs

Comment thread src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs Outdated

AR-May force-pushed the make-telemetry-thread-safe branch 2 times, most recently from f2e0591 to 3e1fd80 Compare March 19, 2026 13:25

AR-May added 2 commits March 19, 2026 14:27

Fix thread safety issues in WorkerNodeTelemetryData

d11afb0

Add tests, small refactoring

1a19fa3

AR-May force-pushed the make-telemetry-thread-safe branch from 3e1fd80 to 1a19fa3 Compare March 19, 2026 13:30

AR-May marked this pull request as ready for review March 19, 2026 13:32

AR-May requested a review from Copilot March 19, 2026 15:42

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Comment thread src/Build/TelemetryInfra/TelemetryForwarderProvider.cs

Comment thread src/Build.UnitTests/Telemetry/Telemetry_Tests.cs

AR-May mentioned this pull request Mar 20, 2026

Write end-to-end tests for /mt mode. #12640

Closed

JanProvaznik approved these changes Mar 20, 2026

View reviewed changes

Comment thread src/Framework/Telemetry/TaskOrTargetTelemetryKey.cs Outdated

Comment thread src/Build/TelemetryInfra/ITelemetryForwarder.cs

Comment thread src/Framework/Telemetry/WorkerNodeTelemetryData.cs

Remove TaskOrTargetTelemetryKey.Create()

bca8fca

JanProvaznik approved these changes Mar 20, 2026

View reviewed changes

AR-May enabled auto-merge (squash) March 20, 2026 13:48

AR-May merged commit dfe5370 into dotnet:main Mar 20, 2026
10 checks passed

AR-May self-assigned this Mar 20, 2026

This was referenced Mar 20, 2026

[release/10.0.3xx] Source code updates from dotnet/msbuild dotnet/dotnet#5614

Merged

[main] Source code updates from dotnet/msbuild dotnet/dotnet#5619

Merged

JanProvaznik added a commit that referenced this pull request Mar 31, 2026

Revert "Fix thread safety issues in WorkerNodeTelemetryData (#13413)"

fb6d617

This reverts commit dfe5370.

JanProvaznik mentioned this pull request Apr 9, 2026

Fix telemetry allocation regression: per-engine collector ownership #13516

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix thread safety issues in WorkerNodeTelemetryData#13413

Fix thread safety issues in WorkerNodeTelemetryData#13413
AR-May merged 3 commits intodotnet:mainfrom
AR-May:make-telemetry-thread-safe

AR-May commented Mar 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AR-May commented Mar 19, 2026

Uh oh!

azure-pipelines Bot commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AR-May commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes Made

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AR-May commented Mar 19, 2026

Uh oh!

azure-pipelines Bot commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AR-May commented Mar 18, 2026 •

edited

Loading