Fix race in BackgroundService exception aggregation during Host shutdown#125590
Fix race in BackgroundService exception aggregation during Host shutdown#125590danmoseley wants to merge 3 commits intodotnet:mainfrom
Conversation
TryExecuteBackgroundServiceAsync tasks were fire-and-forget, creating a race where Host.StopAsync could read _backgroundServiceExceptions before the monitoring tasks had added their exceptions. When multiple BackgroundServices fault, this caused some exceptions to be silently lost. The fix stores the monitoring tasks and awaits them (with shutdown timeout) in StopAsync before reading the exception list. Fix dotnet#125589 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/area-extensions-hosting |
There was a problem hiding this comment.
Pull request overview
This PR updates the internal Host shutdown path to avoid missing BackgroundService failures due to a race between BackgroundService.StopAsync and the host’s background-service monitoring continuation.
Changes:
- Track background-service monitoring tasks instead of fire-and-forget.
- During
StopAsync, wait for monitoring tasks to finish recording exceptions before reading and rethrowing them.
Use LazyInitializer.EnsureInitialized + lock, matching the existing pattern used for _backgroundServiceExceptions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates Microsoft.Extensions.Hosting’s internal Host implementation to better surface BackgroundService failures during shutdown by tracking the background-service monitoring tasks and (best-effort) waiting for them to finish before aggregating background exceptions in StopAsync.
Changes:
- Track
TryExecuteBackgroundServiceAsync(...)monitor tasks for eachBackgroundServicestarted by the host. - During
StopAsync, wait for these monitor tasks to complete (or for shutdown cancellation) before reading_backgroundServiceExceptions, reducing a race where exceptions could be missed.
| /// A BackgroundService that overrides <see cref="ExecuteTask"/> to return a separately | ||
| /// controlled task. The internal _executeTask (used by BackgroundService.StopAsync) completes | ||
| /// normally on cancellation, but the overridden ExecuteTask (monitored by | ||
| /// TryExecuteBackgroundServiceAsync) faults 200ms after StopAsync, deterministically |
There was a problem hiding this comment.
can you explain derministically? what if the other task/thread was not scheduled for much more than 200ms?
There was a problem hiding this comment.
Pull request overview
This PR addresses a shutdown-time race in Microsoft.Extensions.Hosting where background-service monitoring exceptions could be missed because the fire-and-forget monitoring task hadn’t yet recorded its exception when Host.StopAsync aggregated exceptions.
Changes:
- Track background-service monitoring tasks created by
TryExecuteBackgroundServiceAsync. - During
Host.StopAsync, wait for background-service monitoring tasks to finish recording exceptions before aggregating and throwing. - Add a regression test that deterministically reproduces the lost-exception window using an overridden
ExecuteTask.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs |
Stores background-service monitoring tasks and waits (with cancellation support) for them during shutdown before reading exception state. |
src/libraries/Microsoft.Extensions.Hosting/tests/UnitTests/BackgroundServiceExceptionTests.cs |
Adds a regression test and a specialized BackgroundService to reproduce the exception-recording race. |
| if (_backgroundServiceTasks is not null) | ||
| { | ||
| Task bgMonitoringTasks = Task.WhenAll(_backgroundServiceTasks); | ||
| var tcs = new TaskCompletionSource<object?>(TaskCreationOptions.RunContinuationsAsynchronously); | ||
| using (cancellationToken.Register(s => ((TaskCompletionSource<object?>)s!).TrySetCanceled(), tcs)) | ||
| { | ||
| await Task.WhenAny(bgMonitoringTasks, tcs.Task).ConfigureAwait(false); | ||
| } |
| var tcs = new TaskCompletionSource<object?>(TaskCreationOptions.RunContinuationsAsynchronously); | ||
| using (cancellationToken.Register(s => ((TaskCompletionSource<object?>)s!).TrySetCanceled(), tcs)) | ||
| { | ||
| await Task.WhenAny(bgMonitoringTasks, tcs.Task).ConfigureAwait(false); |
Fixes #125589
Problem
When multiple
BackgroundServiceinstances fault withBackgroundServiceExceptionBehavior.StopHost, some exceptions can be silently lost.In real workloads, multiple
BackgroundServices commonly fail together — for example, when a shared dependency like a database or message broker goes down. With this bug, only one of those failures is reported; the rest are silently dropped. This makes production incidents harder to diagnose: operators see one service failed but have no indication that others also failed, leading to incomplete root-cause analysis and potentially missing the actual source of the problem.The
BackgroundServiceExceptionTests.BackgroundService_MultipleExceptions_ThrowsAggregateExceptiontest is flaky because of this (observed on osx-arm64 Debug).Root Cause
In
StartAsync,TryExecuteBackgroundServiceAsyncis fire-and-forget (_ =). This method awaits the service'sExecuteTaskand adds any exception to_backgroundServiceExceptions. DuringStopAsync,BackgroundService.StopAsyncalso awaits the sameExecuteTask. When the task faults, both continuations are scheduled on the thread pool. If theStopAsynccontinuation runs first,Host.StopAsyncproceeds to read_backgroundServiceExceptionsbefore the monitoring task has added its exception.Fix
Store the
TryExecuteBackgroundServiceAsynctasks and await them inStopAsync(respecting the shutdown timeout) before reading the exception list.Verification
The original failure was only observed on macOS arm64 and could not be reproduced directly on Windows. However, injecting a 500ms
Task.DelayintoTryExecuteBackgroundServiceAsyncdeterministically simulates the thread-pool scheduling that causes the race, providing high-confidence verification on any platform: