Fix thread safety issues in WorkerNodeTelemetryData#13413
Fix thread safety issues in WorkerNodeTelemetryData#13413AR-May merged 3 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a concurrency bug in MSBuild’s internal worker-node telemetry aggregation when running in in-proc multi-threaded mode (/m /mt), where multiple RequestBuilder instances can concurrently mutate shared telemetry dictionaries and corrupt their state.
Changes:
- Switch telemetry reporting in
RequestBuilder.UpdateStatisticsPostBuildfrom per-target/per-task updates to batching into a localWorkerNodeTelemetryDataand merging once. - Replace
ITelemetryForwarder.AddTask/AddTargetwith a singleMergeWorkerDataAPI and update the provider implementations accordingly. - Add explicit locking around aggregation in the internal telemetry-consuming logger.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Framework/Telemetry/WorkerNodeTelemetryData.cs | Adds method documentation and minor refactor while preserving merge/aggregation behavior. |
| src/Build/TelemetryInfra/TelemetryForwarderProvider.cs | Replaces per-item update APIs with MergeWorkerData and exposes key creation helper for batching. |
| src/Build/TelemetryInfra/InternalTelemetryConsumingLogger.cs | Adds a lock around worker telemetry aggregation. |
| src/Build/TelemetryInfra/ITelemetryForwarder.cs | Updates forwarder contract to support batched merging. |
| src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs | Implements local accumulation + single merge under a lock to prevent concurrent dictionary writes. |
f2e0591 to
3e1fd80
Compare
3e1fd80 to
1a19fa3
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Fixes telemetry thread-safety and counter inflation in in-proc multithreaded (/m /mt) builds by batching telemetry per RequestBuilder, merging into a shared forwarder under a lock, and preventing repeated “send the whole buffer” behavior during engine shutdown.
Changes:
- Accumulate task/target telemetry in a per-
RequestBuilderWorkerNodeTelemetryDataand merge once into the shared forwarder. - Make
TelemetryForwarderthread-safe and change finalization to “swap-and-send” to avoid Nx duplication across multipleBuildRequestEnginefinalizers. - Add unit tests for
WorkerNodeTelemetryData.IsEmptyand forwarder reset behavior afterFinalizeProcessing.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Framework/Telemetry/WorkerNodeTelemetryData.cs | Adds IsEmpty and improves merge/add method clarity used by the forwarder swap-and-send logic. |
| src/Framework/Telemetry/TaskOrTargetTelemetryKey.cs | Introduces a helper factory (Create) to centralize key construction used by RequestBuilder. |
| src/Build/TelemetryInfra/TelemetryForwarderProvider.cs | Adds locking, batch merge entrypoint, and swap-and-send finalization to prevent races and duplicate sends. |
| src/Build/TelemetryInfra/ITelemetryForwarder.cs | Replaces per-item APIs with a batch merge API (MergeWorkerData). |
| src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs | Switches to batch-then-merge telemetry collection to remove dictionary contention. |
| src/Build.UnitTests/Telemetry/Telemetry_Tests.cs | Adds tests for the new reset/empty behavior and forwarder finalization semantics. |
This reverts commit dfe5370.
PR dotnet#13413 fixed thread-safety by creating a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request, populating it, then merging under a lock. Each dictionary grows through 6-7 resize steps, creating ~30KB of Entry[]/Int32[] churn per request. With ~1000 requests per node across multiple nodes, this totals ~200MB for large solutions like OrchardCore (233 projects). Fix: restore AddTarget/AddTask directly on ITelemetryForwarder with internal per-call locking in TelemetryForwarder. This eliminates all per-request dictionary allocations while preserving thread safety and the N-times duplication fix (swap-and-send FinalizeProcessing). Lock overhead is negligible: ~20ns uncontended x ~200 calls = 4us/request. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR #13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR #13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes because the pattern was applied unconditionally. Fix: each BuildRequestEngine now owns its own WorkerNodeTelemetryData on the NodeLoggingContext. RequestBuilder writes directly to it with per-call locking (for concurrent builders within one engine). The engine sends the data at CleanupForBuild and resets for the next build. This eliminates: - Per-request dictionary allocations (zero new dictionaries per request) - Singleton contention (no shared TelemetryForwarderProvider state) - N-times duplication (each engine sends only its own data) The InternalTelemetryConsumingLogger already merges WorkerNodeTelemetry EventArgs from all engines, so the data flow is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext. The forwarder encapsulates the data, the lock, and the send logic: - TelemetryForwarder owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton This eliminates per-request allocations, singleton contention, and N-times duplication (each engine sends only its own data). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryCollector (via TelemetryCollectorProvider.CreateCollector) and passes it to builders through NodeLoggingContext. The collector encapsulates the data, the lock, and the send logic: - TelemetryCollector owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryCollectorProvider is the factory, not a data-holding singleton Renamed from Forwarder to Collector to reflect what it actually does (collects telemetry, does not forward — forwarding is done by the logging infrastructure). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext. The forwarder encapsulates the data, the lock, and the send logic: - TelemetryForwarder owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton This eliminates per-request allocations, singleton contention, and N-times duplication (each engine sends only its own data). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext. The forwarder encapsulates the data, the lock, and the send logic: - TelemetryForwarder owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2 Dictionary<> instances) per build request to fix thread-safety. Each dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/ Int32[] churn per request. With ~1000 requests per node x multiple nodes = ~200MB for OrchardCore (233 projects). The regression affected both MT and non-MT modes. Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext. The forwarder encapsulates the data, the lock, and the send logic: - TelemetryForwarder owns WorkerNodeTelemetryData + internal lock - RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking) - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext: - TelemetryForwarder owns WorkerNodeTelemetryData (no locking needed - the engine's one-active-builder-at-a-time invariant guarantees single-threaded access, see dotnet#13531) - RequestBuilder calls AddTarget/AddTask directly - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext: - TelemetryForwarder owns WorkerNodeTelemetryData (no locking needed - the engine's one-active-builder-at-a-time invariant guarantees single-threaded access, see dotnet#13531) - RequestBuilder calls AddTarget/AddTask directly - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing per-request WorkerNodeTelemetryData dictionary allocations entirely. Each BuildRequestEngine creates its own ITelemetryForwarder (via TelemetryForwarderProvider.CreateForwarder) and passes it to builders through NodeLoggingContext: - TelemetryForwarder owns WorkerNodeTelemetryData (no locking needed - the engine's one-active-builder-at-a-time invariant guarantees single-threaded access, see dotnet#13531) - RequestBuilder calls AddTarget/AddTask directly - BuildRequestEngine calls FinalizeProcessing at CleanupForBuild - TelemetryForwarderProvider is the factory, not a data-holding singleton Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Related to #12867
Context
Fixing two bugs in telemetry infrastructure when using /m /mt (in-proc multithreaded) mode:
Thread-safety crash: all in-proc nodes share a single TelemetryForwarderProvider singleton. Multiple RequestBuilder instances run on dedicated threads and call AddTask/AddTarget concurrently on the same WorkerNodeTelemetryData dictionary fields, causing race conditions and dictionary corruption.
Nx telemetry duplication: In /m /mt mode, N BuildRequestEngine instances share one TelemetryForwarderProvider singleton. Each engine calls FinalizeProcessing on shutdown, sending the entire accumulated data each time. The InternalTelemetryConsumingLogger merges all N copies, inflating every counter N times.
Reproduction: 20+ non-SDK .NET Framework library projects + 1 exe referencing all of them, built with MSBuild.exe Repro.sln /m /mt.
Changes Made
Fix
Batch-then-merge in RequestBuilder: Each RequestBuilder now accumulates task/target telemetry into a local WorkerNodeTelemetryData instance (zero contention), then merges once into the shared state via elemetryForwarder.MergeWorkerData().
Thread-safe TelemetryForwarder: Added an internal lock protecting both MergeWorkerData and FinalizeProcessing. The forwarder is a singleton shared across BuildRequestEngine instances in /m /mt mode, so concurrent access is expected.
Swap-and-send in FinalizeProcessing: Instead of sending the same accumulated data on every call, FinalizeProcessing atomically swaps the internal data with a fresh empty instance under the lock, then sends only if non-empty. This ensures:
data loss)
Testing
Locally tested that the issue is gone on a repro project.
Unit tests