Skip to content

Investigate directory enumeration performance #8396

@KirillOsenkov

Description

@KirillOsenkov

I noticed we're using standard Directory.EnumerateFiles() to enumerate files for globs. It's not very efficient, and also runs the risks of throwing when it hits directories or files it can't access.

Sample first-chance exception:

System.UnauthorizedAccessException: Access to the path 'C:\Documents and Settings' is denied.
   at void System.IO.__Error.WinIOError(int errorCode, string maybeFullPath)
   at void System.IO.FileSystemEnumerableIterator<TSource>.CommonInit()
   at new System.IO.FileSystemEnumerableIterator<TSource>(string path, string originalUserPath, string searchPattern, SearchOption searchOption, SearchResultHandler<TSource> resultHandler, bool checkHost)
   at IEnumerable<string> System.IO.Directory.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.ManagedFileSystem.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.MSBuildOnWindowsFileSystem.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.CachingFileSystemWrapper.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IReadOnlyList<string> Microsoft.Build.Shared.FileMatcher.GetAccessibleFiles(IFileSystem fileSystem, string path, string filespec, string projectDirectory, bool stripProjectDirectory)
   at IReadOnlyList<string> Microsoft.Build.Shared.FileMatcher.GetAccessibleFileSystemEntries(IFileSystem fileSystem, FileSystemEntity entityType, string path, string pattern, string projectDirectory, bool stripProjectDirectory)
   at Microsoft.Build.Shared.FileMatcher(IFileSystem fileSystem, ConcurrentDictionary<string, IReadOnlyList<string>> fileEntryExpansionCache)+(FileSystemEntity entityType, string path, string pattern, string projectDirectory, bool stripProjectDirectory) => { } x 2
   at TValue System.Collections.Concurrent.ConcurrentDictionary<TKey, TValue>.GetOrAdd(TKey key, Func<TKey, TValue> valueFactory)
   at Microsoft.Build.Shared.FileMatcher(IFileSystem fileSystem, ConcurrentDictionary<string, IReadOnlyList<string>> fileEntryExpansionCache)+(FileSystemEntity type, string path, string pattern, string directory, bool stripProjectDirectory) => { }
   at IEnumerable<string> Microsoft.Build.Shared.FileMatcher.GetFilesForStep(RecursiveStepResult stepResult, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory)
   at void Microsoft.Build.Shared.FileMatcher.GetFilesRecursive(ConcurrentStack<List<string>> listOfFiles, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory, IList<RecursionState> searchesToExclude, Dictionary<string, List<RecursionState>> searchesToExcludeInSubdirs, TaskOptions taskOptions)
   at void Microsoft.Build.Shared.FileMatcher.GetFilesRecursive(ConcurrentStack<List<string>> listOfFiles, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory, IList<RecursionState> searchesToExclude, Dictionary<string, List<RecursionState>> searchesToExcludeInSubdirs, TaskOptions taskOptions)+(string subdir) => { }
   at ParallelLoopResult System.Threading.Tasks.Parallel.ForEachWorker<TSource, TLocal>(IEnumerable<TSource> source, ParallelOptions parallelOptions, Action<TSource> body, Action<TSource, ParallelLoopState> bodyWithState, Action<TSource, ParallelLoopState, long> bodyWithStateAndIndex, Func<TSource, ParallelLoopState, TLocal, TLocal> bodyWithStateAndLocal, Func<TSource, ParallelLoopState, long, TLocal, TLocal> bodyWithEverything, Func<TLocal> localInit, Action<TLocal> localFinally)+(int i) => { }
   at ParallelLoopResult System.Threading.Tasks.Parallel.ForWorker<TLocal>(int fromInclusive, int toExclusive, ParallelOptions parallelOptions, Action<int> body, Action<int, ParallelLoopState> bodyWithState, Func<int, ParallelLoopState, TLocal, TLocal> bodyWithLocal, Func<TLocal> localInit, Action<TLocal> localFinally)+() => { }
   at void System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at void System.Threading.Tasks.Task.ExecuteSelfReplicating(Task root)+() => { }
   at void System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   at bool System.Threading.ThreadPoolWorkQueue.Dispatch()
   at bool System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

Also seeing the same for C:\Config.Msi when we accidentally enumerate the whole drive due to some property being empty and the glob ends up starting with a \.

I've had success with directly calling the Win32 API in parallel to reduce allocations, achieving up to 2x speed and 0.5x allocations:
https://github.com/KirillOsenkov/Benchmarks/blob/8556f92c07b9a3d211a7e72b776c324aff7e24b7/src/Tests/DirectoryEnumeration.cs#L12-L15

Also it seems that this approach doesn't run into exceptions when trying to access inaccessible directories, unlike the BCL one.

Feel free to experiment with this benchmark, steal the source, try on real-world builds, see if you can tune it further, submit PRs if you can make it even faster ;)

The first place I would try this is in FileMatcher (see the stack above). Also, looking at the stack, I'd measure getting rid of the ConcurrentDictionary and try a simple collection with a lock around it. I often get much better results with a simple lock around simple collections.

I'm noticing we do have a ManagedFileSystem abstraction, so I guess we can try replacing the implementation in a single place and see if it can make our builds faster wholesale.

One potential concern is that the parallelism in the new method does a lot of thrashing, so not sure how this performs on an HDD. But then again, do we care about HDDs anymore?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions