Fix FileSystemWatcher test flakiness on Windows and re-enable 86 disabled tests by danmoseley · Pull Request #125818 · dotnet/runtime

danmoseley · 2026-03-20T05:06:29Z

Categories of failures

The FSW test failures on Windows fall into three categories:

Timeout too short for callback latency — Under thread pool pressure (typical in CI), ReadDirectoryChangesW callbacks arrive late. The 1000ms WaitForExpectedEventTimeout expires before the event is delivered. This is the dominant failure mode: "<EventType> event did not occur as expected".
Stale state between retries — Some tests create files/directories in action() but never clean up. When ExpectEvent retries (re-creating the watcher and re-running the action), the second attempt fails because the file already exists, or the watcher fires on leftover state rather than the fresh action.
ExpectNoEvent using wrong timeout — ExpectNoEvent was hardcoded to use WaitForExpectedEventTimeout (1000ms) instead of WaitForUnexpectedEventTimeout (500ms), making negative tests take 2x longer than intended and potentially masking timing issues.

Fixes in this PR (and why each)

No product changes. All changes are in the test utility (FileSystemWatcherTest.cs) and individual test files.

Progressive retry timeouts — ExpectEvent now multiplies WaitForExpectedEventTimeout by the attempt number (1x, 2x, 3x). This directly addresses category 1: if the first attempt's 1000ms is too short under load, the second attempt waits 2000ms and the third waits 3000ms, giving the callback time to arrive without penalizing the common fast case.
Increased SubsequentExpectedWait (10ms to 500ms) — Tests that check multiple event types (e.g., Changed + Created) waited only 10ms for the second event type after catching the first. Under load, the second event arrives later. 500ms provides adequate margin.
Settling delay after enabling watcher (50ms Thread.Sleep) — Added in ExecuteAndVerifyEvents, ExpectEvents, and TryErrorEvent after setting EnableRaisingEvents = true and before running the test action. While Windows registers the watch synchronously, this guards against edge cases on all platforms.
Cleanup in Create event tests — File.Create.cs and Directory.Create.cs tests now delete the created file/directory in a finally block, so retries start from a clean state (category 2).
ExpectNoEvent timeout fix — Changed to use WaitForUnexpectedEventTimeout as intended (category 3).
Re-enabled 86 tests previously disabled by 25 [ActiveIssue] annotations (11 class-level, 14 method-level) referencing Failing test due to no detected IO events in 'System.IO.Tests.FileSystemWatcherTest.ExecuteAndVerifyEvents' #103584 on Windows. Two tests (File_Move_With_Set_Environment_Variable, File_Move_With_Unset_Environment_Variable) are disabled against a new issue for a product-level bug unrelated to timing.

Validation steps

Tested on Intel i9-14900K (24 cores / 32 threads), 64 GB RAM, Windows 11. Tests run on both NTFS and ReFS (dev drive).

Controlled callback delay (proving the fix addresses the root cause):

Injected a temporary busy-spin delay into ReadDirectoryChangesCallback to deterministically reproduce the late-event-delivery condition that occurs under thread pool starvation in CI
At 900ms delay, baseline (no fixes): 114 event failures across 20 runs (100% failure rate)
At 900ms delay, with fixes: 0 event failures across 20 runs

Stress testing:

Concurrent stress generator running alongside tests: 32 CPU-saturating workers (one per logical processor) doing tight math loops, 8 IO workers (rapid file create/write/delete cycles with 1-64KB random data), and continuous thread pool flooding (~500 short work items queued every 10ms) to simulate thread pool contention
No matter how much stress, could not reproduce failures even without fixes - this local machine is just too fast compared to a CI machine.
Under this load, ran 20 sequential iterations on each filesystem with fixes: 0 flaky failures on NTFS, 0 on ReFS
3 parallel test instances running simultaneously under stress with fixes: 0 flaky failures

Unchanged

Linux/macOS — Separate issues exist (FileSystemWatcher SymbolicLink tests failing after inotify refactor (#117148) #125737 Linux inotify race after FileSystemWatcher.Linux: use a single inotify instance and refactor watch tracking. #117148 refactor; System.IO.Tests.File_Move_Tests.File_Move_Multiple_From_Watched_To_Unwatched_Mac test failure #125018 macOS duplicate FSEvents). These have different root causes and are not addressed here.
Product code — No changes to FileSystemWatcher implementation. The callback delivery latency is fundamentally an OS/thread-pool scheduling issue, not a product bug. The fix is making tests tolerant of realistic delivery variance.
Issue FileSystemWatcher_DirectorySymbolicLink_TargetsFile_Fails #53366 (SymbolicLink / InternalBufferSize tests) — Remains disabled on all platforms; this is a separate product-level issue.

Fixes #103630
Fixes #103584

- Increase SubsequentExpectedWait from 10ms to 500ms so subsequent event-type checks in ExecuteAndVerifyEvents tolerate delayed delivery under thread pool starvation. - Add 50ms settling delay after EnableRaisingEvents=true in ExecuteAndVerifyEvents, ExpectEvents, and TryErrorEvent to allow OS-specific async startup to complete before the test action runs. - Implement progressive timeout on retries in ExpectEvent and TryErrorEvent: attempt 1 uses the base timeout (1000ms), attempt 2 uses 2x, attempt 3 uses 3x. This is the key fix: under thread pool starvation the ReadDirectoryChangesW callback can be delayed beyond the base timeout, but a fixed retry with the same timeout just fails again. Progressive timeout gives later attempts enough headroom. Validated with artificial callback delay injection at 900ms: - Baseline (no fixes): 114 event failures across 20 runs (100% failure rate) - With fixes: 0 failures across 20 runs on both NTFS and ReFS Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The MultipleFilters and ModifyFiltersConcurrentWithEvents Create tests pass cleanup: null to ExpectEvent, but the action (File.Create / Directory.CreateDirectory) is not idempotent: on retry the file/dir already exists, so no Created event fires and the retry always fails. The corresponding Delete tests already have proper cleanup (re-creating the deleted item between retries). Apply the same pattern in reverse: delete the created item between retries so the next attempt gets a fresh Created event. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- WaitForExpectedEventTimeout: 1000ms -> 2000ms. With progressive retry this gives 2000/4000/6000ms across 3 attempts, enough headroom for thread pool starvation even on single-core CI machines. - WaitForExpectedEventTimeout_NoRetry: 3000ms -> 5000ms. Tests using the simple ExpectEvent(WaitHandle, string) overload have no retry loop, so a generous single timeout is needed. - ExpectEvents() collection timeout: 5s -> 10s. This method has no retry mechanism and is used by Directory/File Move multi-event tests. These increases only affect the failure path (WaitOne returns immediately when the event arrives). Happy-path execution time is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove [ActiveIssue] annotations for dotnet#103584 from 14 test files (22 annotations total, including class-level disabling of entire test classes like Directory_Create_Tests, File_Create_Tests, etc.). Also remove [ActiveIssue] for dotnet#53366 on FileSystemWatcher_DirectorySymbolicLink_TargetsFile_Fails — the issue has had zero hits since 2021 and the test is Windows-only. The robustness improvements in the preceding commits (progressive retry timeouts, increased base timeouts, retry idempotency cleanup) should prevent the flakiness that originally motivated disabling these tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ExpectNoEvent defaulted to WaitForExpectedEventTimeout (2000ms) but negative tests should use the shorter WaitForUnexpectedEventTimeout (150ms). The codebase already defines this constant for exactly this purpose but it was never wired up. Using the long timeout slows down every test that verifies an event does NOT occur. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-policy-service · 2026-03-20T05:07:40Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR targets Windows flakiness in System.IO.FileSystem.Watcher tests by increasing tolerance to callback latency and retry-related stale state, and then re-enables a large set of tests previously disabled on Windows.

Changes:

Adjusts FileSystemWatcherTest timeouts/retry behavior (including progressive timeouts) and fixes ExpectNoEvent default timeout usage.
Adds a small post-EnableRaisingEvents settling delay to reduce timing sensitivity across platforms.
Re-enables many tests previously disabled via [ActiveIssue], and adds cleanup in some create-related tests to keep retries idempotent.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs	Updates core test helper timeouts/retry logic and adds settling delays.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.unit.cs	Removes Windows `[ActiveIssue]` suppressions and adds cleanup for create-related retry safety.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.cs	Re-enables a previously disabled init/resume test on Windows.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.SymbolicLink.cs	Re-maps Windows `ActiveIssue` annotations for specific symlink tests.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.InternalBufferSize.cs	Removes Windows `[ActiveIssue]` suppression for internal buffer size tests.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.NotifyFilter.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.Move.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.Delete.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.Create.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.Changed.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.NotifyFilter.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.Move.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.Delete.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.Create.cs	Removes Windows `[ActiveIssue]` suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.Changed.cs	Removes Windows `[ActiveIssue]` suppression at the class level.

Comments suppressed due to low confidence (1)

src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs:286

cleanup() is executed after the assertion. If the assertion fails (i.e., an unexpected event did occur), cleanup won’t run, which can leave state behind for the outer retry wrapper and lead to follow-on failures. Run cleanup in a finally so it executes regardless of assertion outcome.

            bool result = ExecuteAndVerifyEvents(watcher, unExpectedEvents, action, false, expectedPath == null ? null : new string[] { expectedPath }, timeout);
            Assert.False(result, "Expected Event occurred");

            if (cleanup != null)
                cleanup();

src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs

- Fix comment to say 'linearly' instead of 'double' for progressive timeout - Wrap cleanup in try/finally so it runs even if ExecuteAndVerifyEvents throws Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs

- Increase WaitForExpectedEventTimeout to 5s, SubsequentExpectedWait to 2s, WaitForExpectedEventTimeout_NoRetry to 15s, WaitForUnexpectedEventTimeout to 500ms (all WaitOne-based, so zero happy-path cost) - Remove Thread.Sleep(50) settle delays (no startup race on any platform) - Add WaitUnlessCancelled helper: sleeps via CancellationToken.WaitHandle.WaitOne and throws on cancellation, preparing for xunit v3 migration - Wrap ExpectNoEvent cleanup in try/finally for exception safety Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ness

danmoseley · 2026-03-20T21:29:38Z

Note

This comment was generated with Copilot assistance.

Thanks for the feedback. Updated in the latest commit:

Timeout sizing: Increased WaitForExpectedEventTimeout to 5s (from 1s), SubsequentExpectedWait to 2s (from 10ms), WaitForExpectedEventTimeout_NoRetry to 15s (from 3s), WaitForUnexpectedEventTimeout to 500ms (from 150ms). These are all WaitOne-based, so they return immediately on signal — zero happy-path cost. The timeout is only the max wait. Of course, increasing potential time spent retrying is not useful if it could potentially cause the test fixture to time out as a whole, trading some failed tests for a failed test job. Budget-wise: worst case for the largest serial class (File.Move, 15 tests) is ~8 min with all retries exhausted. Other classes run in parallel via xunit, so total wall time depends on when File.Move gets scheduled and how many cores are available — should fit comfortably in 15 min, but worth watching CI.

Removed Thread.Sleep(50): There's no startup race on any platform — EnableRaisingEvents completes synchronously on Windows (Monitor() → ReadDirectoryChangesW), and inotify/FSEvents buffer on Linux/macOS. The 50ms wasn't making first attempts pass; when delivery is delayed due to thread pool starvation, 50ms doesn't help — the progressive WaitOne timeouts (5s/10s/15s) handle that.

Cancellation prep for xunit v3: Added a WaitUnlessCancelled(ms) helper that sleeps via CancellationToken.WaitHandle.WaitOne and throws OperationCanceledException on cancellation. Today uses CancellationToken.None (no-op). When the xunit v3 migration lands (#125019), changing one local to TestContext.Current.CancellationToken makes all retry sleeps bail out immediately on runner shutdown.

Also: Wrapped ExpectNoEvent cleanup in try/finally (was missing, unlike ExpectEvent).

danmoseley · 2026-03-20T21:30:11Z

Note rebased on main to test attribution!

Note
This comment was generated with Copilot assistance.

danmoseley · 2026-03-20T21:31:11Z

I could take out all the cancelation token stuff and make a note to myself to come back in a month when xunit v3 is in. But I figured it was better to put it in now, even though it's a no-op, and then go add the "switch on" in the xunit PR itself so when there's merged there's no follow up.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Copilot · 2026-03-20T21:34:47Z

src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.unit.cs

+                    ExpectEvent(watcher, WatcherChangeTypes.Created, () => Directory.CreateDirectory(directoryOne), cleanup: () => Directory.Delete(directoryOne), expectedPath: directoryOne);
+                    ExpectEvent(watcher, WatcherChangeTypes.Created, () => Directory.CreateDirectory(directoryTwo), cleanup: () => Directory.Delete(directoryTwo), expectedPath: directoryTwo);


The new cleanup lambdas for directoryOne/directoryTwo call Directory.Delete(...) unconditionally. If the directory creation fails (or the directory is already missing for any reason), this will throw during cleanup and can mask the real failure/retry signal. Make these cleanups resilient (e.g., if (Directory.Exists(...)) Directory.Delete(...)).

Copilot · 2026-03-20T21:34:48Z

src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.InternalBufferSize.cs

    public class InternalBufferSizeTests : FileSystemWatcherTest
    {


The PR description says the SymbolicLink/InternalBufferSize tests tied to issue #53366 “remain disabled”, but this file removes the ActiveIssue attribute from InternalBufferSizeTests. Please either update the description to match the intent (re-enabled) or keep the appropriate ActiveIssue annotations if these are still expected to be disabled on some platforms.

Copilot · 2026-03-20T21:34:48Z

src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs

+        // Waits for the specified duration unless the test runner signals cancellation,
+        // in which case it throws OperationCanceledException to fail the test immediately.
+        private static void WaitUnlessCancelled(int milliseconds)
+        {
+            // Replace with TestContext.Current.CancellationToken when migrated to xunit v3 (#125019).
+            CancellationToken token = CancellationToken.None;
+            token.WaitHandle.WaitOne(milliseconds);
+            token.ThrowIfCancellationRequested();
+        }


WaitUnlessCancelled currently uses CancellationToken.None, so it can never observe runner cancellation and will behave like an unconditional wait. The comment/docstring claims it will fail early on cancellation; either wire this up to a real test cancellation token (if available) or adjust the comment to reflect the current behavior to avoid misleading future maintainers.

Copilot · 2026-03-20T21:34:48Z

src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs

                action();
-                eventsOccurred.WaitOne(new TimeSpan(0, 0, 5));
+                eventsOccurred.WaitOne(new TimeSpan(0, 0, 15));
            }


ExpectEvents now waits a hardcoded 15 seconds (new TimeSpan(0,0,15)). To keep timeouts consistent and easier to tune, prefer using an existing constant (e.g., WaitForExpectedEventTimeout_NoRetry) or a named TimeSpan constant rather than an inline magic value.

danmoseley and others added 5 commits March 19, 2026 23:05

Copilot AI review requested due to automatic review settings March 20, 2026 05:06

github-actions bot added the area-System.IO label Mar 20, 2026

dotnet-policy-service bot assigned danmoseley Mar 20, 2026

danmoseley requested a review from adamsitnik March 20, 2026 05:07

Copilot started reviewing on behalf of danmoseley March 20, 2026 05:08 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

stephentoub reviewed Mar 20, 2026

View reviewed changes

src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs Outdated Show resolved Hide resolved

stephentoub reviewed Mar 20, 2026

View reviewed changes

src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs Outdated Show resolved Hide resolved

danmoseley and others added 2 commits March 20, 2026 15:15

Merge remote-tracking branch 'upstream/main' into fix/fsw-test-robust…

1cfb39e

…ness

Copilot AI review requested due to automatic review settings March 20, 2026 21:29

Copilot started reviewing on behalf of danmoseley March 20, 2026 21:30 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

		ExpectEvent(watcher, WatcherChangeTypes.Created, () => Directory.CreateDirectory(directoryOne), cleanup: () => Directory.Delete(directoryOne), expectedPath: directoryOne);
		ExpectEvent(watcher, WatcherChangeTypes.Created, () => Directory.CreateDirectory(directoryTwo), cleanup: () => Directory.Delete(directoryTwo), expectedPath: directoryTwo);

		public class InternalBufferSizeTests : FileSystemWatcherTest
		{

Conversation

danmoseley commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Categories of failures

Fixes in this PR (and why each)

Validation steps

Unchanged

Uh oh!

dotnet-policy-service bot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmoseley commented Mar 20, 2026

Uh oh!

danmoseley commented Mar 20, 2026

Uh oh!

danmoseley commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danmoseley commented Mar 20, 2026 •

edited

Loading

danmoseley commented Mar 20, 2026 •

edited

Loading