Skip to content

Fix thread safety issues in WorkerNodeTelemetryData#13413

Merged
AR-May merged 3 commits intodotnet:mainfrom
AR-May:make-telemetry-thread-safe
Mar 20, 2026
Merged

Fix thread safety issues in WorkerNodeTelemetryData#13413
AR-May merged 3 commits intodotnet:mainfrom
AR-May:make-telemetry-thread-safe

Conversation

@AR-May
Copy link
Copy Markdown
Member

@AR-May AR-May commented Mar 18, 2026

Related to #12867

Context

Fixing two bugs in telemetry infrastructure when using /m /mt (in-proc multithreaded) mode:

Thread-safety crash: all in-proc nodes share a single TelemetryForwarderProvider singleton. Multiple RequestBuilder instances run on dedicated threads and call AddTask/AddTarget concurrently on the same WorkerNodeTelemetryData dictionary fields, causing race conditions and dictionary corruption.

Nx telemetry duplication: In /m /mt mode, N BuildRequestEngine instances share one TelemetryForwarderProvider singleton. Each engine calls FinalizeProcessing on shutdown, sending the entire accumulated data each time. The InternalTelemetryConsumingLogger merges all N copies, inflating every counter N times.

Reproduction: 20+ non-SDK .NET Framework library projects + 1 exe referencing all of them, built with MSBuild.exe Repro.sln /m /mt.

Changes Made

Fix

Batch-then-merge in RequestBuilder: Each RequestBuilder now accumulates task/target telemetry into a local WorkerNodeTelemetryData instance (zero contention), then merges once into the shared state via elemetryForwarder.MergeWorkerData().

Thread-safe TelemetryForwarder: Added an internal lock protecting both MergeWorkerData and FinalizeProcessing. The forwarder is a singleton shared across BuildRequestEngine instances in /m /mt mode, so concurrent access is expected.

Swap-and-send in FinalizeProcessing: Instead of sending the same accumulated data on every call, FinalizeProcessing atomically swaps the internal data with a fresh empty instance under the lock, then sends only if non-empty. This ensures:

  • First engine to finalize sends all data accumulated so far
  • Subsequent engines find empty data and skip sending (no duplication)
  • Late merges from other engines go into the new instance and are sent by the next FinalizeProcessing call (no
    data loss)

Testing

Locally tested that the issue is gone on a repro project.
Unit tests

Copilot AI review requested due to automatic review settings March 18, 2026 17:28
@AR-May AR-May marked this pull request as draft March 18, 2026 17:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a concurrency bug in MSBuild’s internal worker-node telemetry aggregation when running in in-proc multi-threaded mode (/m /mt), where multiple RequestBuilder instances can concurrently mutate shared telemetry dictionaries and corrupt their state.

Changes:

  • Switch telemetry reporting in RequestBuilder.UpdateStatisticsPostBuild from per-target/per-task updates to batching into a local WorkerNodeTelemetryData and merging once.
  • Replace ITelemetryForwarder.AddTask/AddTarget with a single MergeWorkerData API and update the provider implementations accordingly.
  • Add explicit locking around aggregation in the internal telemetry-consuming logger.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/Framework/Telemetry/WorkerNodeTelemetryData.cs Adds method documentation and minor refactor while preserving merge/aggregation behavior.
src/Build/TelemetryInfra/TelemetryForwarderProvider.cs Replaces per-item update APIs with MergeWorkerData and exposes key creation helper for batching.
src/Build/TelemetryInfra/InternalTelemetryConsumingLogger.cs Adds a lock around worker telemetry aggregation.
src/Build/TelemetryInfra/ITelemetryForwarder.cs Updates forwarder contract to support batched merging.
src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs Implements local accumulation + single merge under a lock to prevent concurrent dictionary writes.

Comment thread src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs Outdated
Comment thread src/Build/TelemetryInfra/TelemetryForwarderProvider.cs
Comment thread src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs Outdated
@AR-May AR-May force-pushed the make-telemetry-thread-safe branch 2 times, most recently from f2e0591 to 3e1fd80 Compare March 19, 2026 13:25
@AR-May AR-May force-pushed the make-telemetry-thread-safe branch from 3e1fd80 to 1a19fa3 Compare March 19, 2026 13:30
@AR-May AR-May marked this pull request as ready for review March 19, 2026 13:32
@AR-May
Copy link
Copy Markdown
Member Author

AR-May commented Mar 19, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@AR-May AR-May requested a review from Copilot March 19, 2026 15:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes telemetry thread-safety and counter inflation in in-proc multithreaded (/m /mt) builds by batching telemetry per RequestBuilder, merging into a shared forwarder under a lock, and preventing repeated “send the whole buffer” behavior during engine shutdown.

Changes:

  • Accumulate task/target telemetry in a per-RequestBuilder WorkerNodeTelemetryData and merge once into the shared forwarder.
  • Make TelemetryForwarder thread-safe and change finalization to “swap-and-send” to avoid Nx duplication across multiple BuildRequestEngine finalizers.
  • Add unit tests for WorkerNodeTelemetryData.IsEmpty and forwarder reset behavior after FinalizeProcessing.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/Framework/Telemetry/WorkerNodeTelemetryData.cs Adds IsEmpty and improves merge/add method clarity used by the forwarder swap-and-send logic.
src/Framework/Telemetry/TaskOrTargetTelemetryKey.cs Introduces a helper factory (Create) to centralize key construction used by RequestBuilder.
src/Build/TelemetryInfra/TelemetryForwarderProvider.cs Adds locking, batch merge entrypoint, and swap-and-send finalization to prevent races and duplicate sends.
src/Build/TelemetryInfra/ITelemetryForwarder.cs Replaces per-item APIs with a batch merge API (MergeWorkerData).
src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs Switches to batch-then-merge telemetry collection to remove dictionary contention.
src/Build.UnitTests/Telemetry/Telemetry_Tests.cs Adds tests for the new reset/empty behavior and forwarder finalization semantics.

Comment thread src/Build/TelemetryInfra/TelemetryForwarderProvider.cs
Comment thread src/Build.UnitTests/Telemetry/Telemetry_Tests.cs
Comment thread src/Framework/Telemetry/TaskOrTargetTelemetryKey.cs Outdated
Comment thread src/Build/TelemetryInfra/ITelemetryForwarder.cs
Comment thread src/Framework/Telemetry/WorkerNodeTelemetryData.cs
@AR-May AR-May enabled auto-merge (squash) March 20, 2026 13:48
@AR-May AR-May merged commit dfe5370 into dotnet:main Mar 20, 2026
10 checks passed
@AR-May AR-May self-assigned this Mar 20, 2026
JanProvaznik added a commit that referenced this pull request Mar 31, 2026
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 9, 2026
PR dotnet#13413 fixed thread-safety by creating a local WorkerNodeTelemetryData
(with 2 Dictionary<> instances) per build request, populating it, then
merging under a lock. Each dictionary grows through 6-7 resize steps,
creating ~30KB of Entry[]/Int32[] churn per request. With ~1000 requests
per node across multiple nodes, this totals ~200MB for large solutions
like OrchardCore (233 projects).

Fix: restore AddTarget/AddTask directly on ITelemetryForwarder with
internal per-call locking in TelemetryForwarder. This eliminates all
per-request dictionary allocations while preserving thread safety and
the N-times duplication fix (swap-and-send FinalizeProcessing).

Lock overhead is negligible: ~20ns uncontended x ~200 calls = 4us/request.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit that referenced this pull request Apr 9, 2026
Eliminates ~200MB allocation regression from PR #13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Root cause: PR #13413 created a local WorkerNodeTelemetryData (with 2
Dictionary<> instances) per build request to fix thread-safety. Each
dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/
Int32[] churn per request. With ~1000 requests per node x multiple
nodes = ~200MB for OrchardCore (233 projects). The regression affected
both MT and non-MT modes because the pattern was applied unconditionally.

Fix: each BuildRequestEngine now owns its own WorkerNodeTelemetryData
on the NodeLoggingContext. RequestBuilder writes directly to it with
per-call locking (for concurrent builders within one engine). The
engine sends the data at CleanupForBuild and resets for the next build.

This eliminates:
- Per-request dictionary allocations (zero new dictionaries per request)
- Singleton contention (no shared TelemetryForwarderProvider state)
- N-times duplication (each engine sends only its own data)

The InternalTelemetryConsumingLogger already merges WorkerNodeTelemetry
EventArgs from all engines, so the data flow is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 10, 2026
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2
Dictionary<> instances) per build request to fix thread-safety. Each
dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/
Int32[] churn per request. With ~1000 requests per node x multiple
nodes = ~200MB for OrchardCore (233 projects). The regression affected
both MT and non-MT modes.

Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via
TelemetryForwarderProvider.CreateForwarder) and passes it to builders
through NodeLoggingContext. The forwarder encapsulates the data, the
lock, and the send logic:

- TelemetryForwarder owns WorkerNodeTelemetryData + internal lock
- RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking)
- BuildRequestEngine calls FinalizeProcessing at CleanupForBuild
- TelemetryForwarderProvider is the factory, not a data-holding singleton

This eliminates per-request allocations, singleton contention, and
N-times duplication (each engine sends only its own data).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 10, 2026
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2
Dictionary<> instances) per build request to fix thread-safety. Each
dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/
Int32[] churn per request. With ~1000 requests per node x multiple
nodes = ~200MB for OrchardCore (233 projects). The regression affected
both MT and non-MT modes.

Fix: each BuildRequestEngine creates its own ITelemetryCollector (via
TelemetryCollectorProvider.CreateCollector) and passes it to builders
through NodeLoggingContext. The collector encapsulates the data, the
lock, and the send logic:

- TelemetryCollector owns WorkerNodeTelemetryData + internal lock
- RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking)
- BuildRequestEngine calls FinalizeProcessing at CleanupForBuild
- TelemetryCollectorProvider is the factory, not a data-holding singleton

Renamed from Forwarder to Collector to reflect what it actually does
(collects telemetry, does not forward — forwarding is done by the
logging infrastructure).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 10, 2026
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2
Dictionary<> instances) per build request to fix thread-safety. Each
dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/
Int32[] churn per request. With ~1000 requests per node x multiple
nodes = ~200MB for OrchardCore (233 projects). The regression affected
both MT and non-MT modes.

Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via
TelemetryForwarderProvider.CreateForwarder) and passes it to builders
through NodeLoggingContext. The forwarder encapsulates the data, the
lock, and the send logic:

- TelemetryForwarder owns WorkerNodeTelemetryData + internal lock
- RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking)
- BuildRequestEngine calls FinalizeProcessing at CleanupForBuild
- TelemetryForwarderProvider is the factory, not a data-holding singleton

This eliminates per-request allocations, singleton contention, and
N-times duplication (each engine sends only its own data).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 10, 2026
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2
Dictionary<> instances) per build request to fix thread-safety. Each
dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/
Int32[] churn per request. With ~1000 requests per node x multiple
nodes = ~200MB for OrchardCore (233 projects). The regression affected
both MT and non-MT modes.

Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via
TelemetryForwarderProvider.CreateForwarder) and passes it to builders
through NodeLoggingContext. The forwarder encapsulates the data, the
lock, and the send logic:

- TelemetryForwarder owns WorkerNodeTelemetryData + internal lock
- RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking)
- BuildRequestEngine calls FinalizeProcessing at CleanupForBuild
- TelemetryForwarderProvider is the factory, not a data-holding singleton

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 10, 2026
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Root cause: PR dotnet#13413 created a local WorkerNodeTelemetryData (with 2
Dictionary<> instances) per build request to fix thread-safety. Each
dictionary grew through 6-7 resize steps, creating ~30KB of Entry[]/
Int32[] churn per request. With ~1000 requests per node x multiple
nodes = ~200MB for OrchardCore (233 projects). The regression affected
both MT and non-MT modes.

Fix: each BuildRequestEngine creates its own ITelemetryForwarder (via
TelemetryForwarderProvider.CreateForwarder) and passes it to builders
through NodeLoggingContext. The forwarder encapsulates the data, the
lock, and the send logic:

- TelemetryForwarder owns WorkerNodeTelemetryData + internal lock
- RequestBuilder calls AddTarget/AddTask (thread-safe, no raw locking)
- BuildRequestEngine calls FinalizeProcessing at CleanupForBuild
- TelemetryForwarderProvider is the factory, not a data-holding singleton

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 13, 2026
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Each BuildRequestEngine creates its own ITelemetryForwarder (via
TelemetryForwarderProvider.CreateForwarder) and passes it to builders
through NodeLoggingContext:

- TelemetryForwarder owns WorkerNodeTelemetryData (no locking needed -
  the engine's one-active-builder-at-a-time invariant guarantees
  single-threaded access, see dotnet#13531)
- RequestBuilder calls AddTarget/AddTask directly
- BuildRequestEngine calls FinalizeProcessing at CleanupForBuild
- TelemetryForwarderProvider is the factory, not a data-holding singleton

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 13, 2026
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Each BuildRequestEngine creates its own ITelemetryForwarder (via
TelemetryForwarderProvider.CreateForwarder) and passes it to builders
through NodeLoggingContext:

- TelemetryForwarder owns WorkerNodeTelemetryData (no locking needed -
  the engine's one-active-builder-at-a-time invariant guarantees
  single-threaded access, see dotnet#13531)
- RequestBuilder calls AddTarget/AddTask directly
- BuildRequestEngine calls FinalizeProcessing at CleanupForBuild
- TelemetryForwarderProvider is the factory, not a data-holding singleton

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik added a commit to JanProvaznik/msbuild that referenced this pull request Apr 13, 2026
Eliminates ~200MB allocation regression from PR dotnet#13413 by removing
per-request WorkerNodeTelemetryData dictionary allocations entirely.

Each BuildRequestEngine creates its own ITelemetryForwarder (via
TelemetryForwarderProvider.CreateForwarder) and passes it to builders
through NodeLoggingContext:

- TelemetryForwarder owns WorkerNodeTelemetryData (no locking needed -
  the engine's one-active-builder-at-a-time invariant guarantees
  single-threaded access, see dotnet#13531)
- RequestBuilder calls AddTarget/AddTask directly
- BuildRequestEngine calls FinalizeProcessing at CleanupForBuild
- TelemetryForwarderProvider is the factory, not a data-holding singleton

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants