Skip to content

Fix CI flakiness: MSB4216 task host failures, dotnet-watch hangs, NuGet errors#53424

Merged
mmitche merged 44 commits intomainfrom
fix/ci-flakiness/main
Apr 3, 2026
Merged

Fix CI flakiness: MSB4216 task host failures, dotnet-watch hangs, NuGet errors#53424
mmitche merged 44 commits intomainfrom
fix/ci-flakiness/main

Conversation

@mmitche
Copy link
Copy Markdown
Member

@mmitche mmitche commented Mar 12, 2026

Summary

This PR addresses intermittent CI failures identified through systematic analysis of recent builds.

Remaining Fixes (after merge from main)

Root Cause 1 - MSB4216 task host failures on macOS Helix

Impact: Tests using NuGet package tasks with TaskHostFactory fail intermittently on macOS.
Fix: Export DOTNET_HOST_PATH in RunTestsOnHelix.sh and RunTestsOnHelix.cmd.

Root Cause 2 - GZipCompress file lock races

Impact: Parallel.For in GZipCompress races with antivirus/file indexer.
Fix: Added retry with exponential backoff for file access.

Root Cause 3 - Noisy NuGet source removal errors

Impact: Helix test setup produces confusing errors removing non-existent NuGet sources.
Fix: Suppress errors from dotnet nuget remove source commands.

Root Cause 4 - DefaultRequestDispatcherTest timing issues

Impact: Test makes timing assumptions that fail under load.
Fix: Increased timeouts for CI environment.

Root Cause 5 - Missing runtimeconfig.json in test assets

Impact: Test tool projects fail to locate runtime configuration.
Fix: Added MSBuild target in test assets Directory.Build.targets.

Changes Removed

The dotnet-watch Aspire race condition fix has been removed from this PR — it was superseded by @tmat's proper fix in #53271 (now merged to main).

Validation

Starting fresh validation run after merge from main. Target: 25 consecutive passing builds.

Copilot AI review requested due to automatic review settings March 12, 2026 20:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Helix/CI test execution for Hot Reload / dotnet-watch scenarios by reducing hang risk and improving the reliability of Helix environment setup.

Changes:

  • Replace “effectively infinite” DCP/Aspire timeouts used by watch-based tests with bounded (5-minute) values to prevent multi-hour Helix hangs.
  • Improve test process cleanup by closing stdin prior to termination and adding a bounded wait for process exit.
  • Update Helix test entrypoint scripts to set DOTNET_HOST_PATH and make NuGet source removal resilient when sources are absent.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
test/Microsoft.DotNet.HotReload.Test.Utilities/WatchableApp.cs Switches DCP/Aspire-related environment timeouts to bounded values for CI.
test/Microsoft.DotNet.HotReload.Test.Utilities/AwaitableProcess.cs Adjusts disposal/termination behavior to reduce hangs during cleanup.
build/RunTestsOnHelix.sh Sets DOTNET_HOST_PATH and makes dotnet nuget remove source tolerant of missing sources.
build/RunTestsOnHelix.cmd Sets DOTNET_HOST_PATH and suppresses errors when removing non-existent NuGet sources.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +255 to +257
catch (OperationCanceledException)
{
Logger.Log($"Process {Id} did not exit within 30 seconds after Kill()");
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the process still hasn’t exited after the 30s timeout, DisposeAsync only logs and continues. That can leave a runaway dotnet-watch process (or child processes) running in the Helix work item, causing resource leaks and cross-test interference while also hiding the failure signal. Consider failing the test/cleanup in this case (or at least making the cleanup path retry/force termination and surface the problem), rather than just logging and proceeding to dispose the Process handle.

Suggested change
catch (OperationCanceledException)
{
Logger.Log($"Process {Id} did not exit within 30 seconds after Kill()");
catch (OperationCanceledException ex)
{
Logger.Log($"Process {Id} did not exit within 30 seconds after Kill()");
throw new TimeoutException($"Process {Id} did not exit within 30 seconds after Kill().", ex);

Copilot uses AI. Check for mistakes.
Comment on lines +232 to +236
try
{
Process.StandardInput.Close();
}
catch
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad catch around closing StandardInput/Kill() swallows all exceptions without logging. Since this code was added to address a platform-specific hang, swallowing the exception makes it hard to diagnose when stdin can’t be closed (or why). Consider at least logging the exception details (or narrowing the caught exception types) so cleanup failures are actionable in CI logs.

Copilot uses AI. Check for mistakes.
@mmitche mmitche force-pushed the fix/ci-flakiness/main branch from 83d044f to ae5112f Compare March 12, 2026 20:21
@mmitche mmitche requested a review from a team as a code owner March 12, 2026 20:21
@mmitche mmitche force-pushed the fix/ci-flakiness/main branch 2 times, most recently from a35fa75 to d9f4358 Compare March 12, 2026 23:54
@mmitche mmitche requested review from a team and tmat as code owners March 12, 2026 23:54
@mmitche mmitche force-pushed the fix/ci-flakiness/main branch from d9f4358 to 9949b40 Compare March 13, 2026 01:26
@mmitche mmitche requested a review from a team as a code owner March 13, 2026 01:26
@mmitche mmitche force-pushed the fix/ci-flakiness/main branch from 9949b40 to ed04d76 Compare March 13, 2026 02:28
_isDisposed = true;

// wait for all in-flight process initialization to complete:
// If no session initialization is in-flight (_pendingSessionInitializationCount == 0),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely correct either. Ok to merge, I'll follow up with better fix.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmat This is an automatic fix by the AI for flakiness. Don't merge this...when it's gotten 25 passing runs we'll take a second pass over this to smooth it out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured. It looks very AI like. It's great it found the issue. I'll work on a better fix over the weekend.

@tmat
Copy link
Copy Markdown
Member

tmat commented Mar 13, 2026

watch changes lgtm

@mmitche mmitche force-pushed the fix/ci-flakiness/main branch 3 times, most recently from 2eb12cb to bcae687 Compare March 13, 2026 12:43
@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Mar 13, 2026

🎯 Milestone: 5 Consecutive Passes

Validation Results

Build Jobs Result
1333280 18/18 ✅ Passed
1333378 18/18 ✅ Passed
1333489 18/18 ✅ Passed
1333567 18/18 ✅ Passed
1333595 18/18 ✅ Passed

Root Causes Fixed

  1. MSB4216 TaskHostFactory — DOTNET_HOST_PATH not set in Helix scripts
  2. dotnet-watch Aspire hang — Per-operation timeout was inheriting the 2-hour Helix work-item timeout instead of being capped at 5 minutes. Also fixed semaphore deadlock in AspireServiceFactory.DisposeAsync, added stdin close before process kill, and set DCP timeout environment variables.
  3. GZipCompress file lock — Parallel.For races with antivirus/file indexer; added retry with exponential backoff
  4. NuGet source removal noise — Suppressed stderr from removing non-existent NuGet sources
  5. Missing runtimeconfig.json — Added MSBuild target to include runtimeconfig.json in NuGet packages for test tool projects

Baseline vs Current

  • Before: ~57% failure rate (17/30 builds failed on main)
  • After: 5/5 consecutive perfect builds (100% pass rate so far)

Continuing validation toward 25 consecutive passes target.

@mmitche mmitche force-pushed the fix/ci-flakiness/main branch from bcae687 to 1644c42 Compare March 13, 2026 12:45
Comment thread build/SetupHelixEnvironment.cmd Outdated
Comment thread build/SetupHelixEnvironment.cmd Outdated
@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Mar 13, 2026

@akoeplinger This is an automated run...let it go for a while..don't merge or approve. I'll have it go through another pass and also reivew at the end of when it thinks it gets a stable run.

@mmitche mmitche force-pushed the fix/ci-flakiness/main branch 7 times, most recently from 4cb1ca9 to ea8fad3 Compare March 14, 2026 00:12
mmitche and others added 11 commits March 20, 2026 07:12
On Linux, Console.ReadKey() blocks indefinitely when stdin is inherited
from a parent test process. Aspire launcher processes (server, resources)
each create a PhysicalConsole that starts a LongRunning task calling
Console.ReadKey(), which never unblocks when the test process kills
the child. Check Console.IsInputRedirected before starting the keyboard
listener to avoid the hang.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolved conflict: kept main's file-scoped namespace cleanup and numpad
key handling, applied our Console.IsInputRedirected guard to prevent
Aspire test hangs from Console.ReadKey() on inherited stdin.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@marcpopMSFT marcpopMSFT self-requested a review March 24, 2026 20:23
public PhysicalConsole(TestFlags testFlags)
{
Console.OutputEncoding = Encoding.UTF8;
_ = testFlags.HasFlag(TestFlags.ReadKeyFromStdin) ? ListenToStandardInputAsync() : ListenToConsoleKeyPressAsync();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmat to review this change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -20,6 +20,10 @@ public class GZipCompress : Task
[Required]
public string OutputDirectory { get; set; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnet/aspnet-blazor-eng to review the changes to this file.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@marcpopMSFT
Copy link
Copy Markdown
Member

Test changes look reasonable to me. Pinged a few folks on the product changes as they are in areas outside of my knoweldge.

@mmitche seems like it didn't get 25 passing runs in a row though and wasn't continuing to try to fix tests. Thoughts on next steps we should take here?

@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Mar 26, 2026

Test changes look reasonable to me. Pinged a few folks on the product changes as they are in areas outside of my knoweldge.

@mmitche seems like it didn't get 25 passing runs in a row though and wasn't continuing to try to fix tests. Thoughts on next steps we should take here?

Yeah it looks like it gave up and decided that the additional failures were not its fault. Well, duh. I think getting this PR in, trimming it down to the fixes we think are good to go, and then I want to take a different approach with the next round of experimentation.

@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Mar 26, 2026

@marcpopMSFT What I'd say is that anything we're unsure about here, let's just revert. We can then let the agent party on the results.

Comment thread test/TestAssets/Directory.Build.targets Outdated
@@ -20,6 +20,10 @@ public class GZipCompress : Task
[Required]
public string OutputDirectory { get; set; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mmitche mmitche merged commit 96683a0 into main Apr 3, 2026
25 checks passed
@mmitche mmitche deleted the fix/ci-flakiness/main branch April 3, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants