Compute hashes in parallel in GetFileHash#5303
Conversation
ladipro
left a comment
There was a problem hiding this comment.
Hi @pranavkm, in your benchmark what portion of the time went into waiting for I/O versus actual hash computation? If the files were cached in memory and the task was CPU-bound, I'm worried that this may interfere with higher level MSBuild scheduling, lead to overcommitment, and even degrade the performance in some scenarios.
src/Tasks/FileIO/GetFileHash.cs
Outdated
| var hash = ComputeHash(Algorithm, file.ItemSpec); | ||
| var file = Files[index]; | ||
| var hash = ComputeHash(algorithmFactory, file.ItemSpec); | ||
| file.SetMetadata("FileHashAlgorithm", Algorithm); |
There was a problem hiding this comment.
This call doesn't appear to be thread safe. You should do it under a lock or defer setting the metadata to after the parallel loop.
There was a problem hiding this comment.
There isn't any new concurrency introduced inside the body of the loop. The body of the Parallel.For is invoked exactly once for each index, you wouldn't have multiple threads processing the same ITaskItem concurrently. The only way this would be problematic is if ITaskItem[] does not allow calling methods on different items concurrently, but I'm not sure how you'd have that.
There was a problem hiding this comment.
I was thinking about duplicate items in the Files array but haven't succeeded in reproducing this. We may always wrap items with unique instances in which case this would indeed be safe. But it sure feels like an unnecessarily brittle invariant to rely on.
There was a problem hiding this comment.
It wouldn't be incorrect to take out duplicates right?
There was a problem hiding this comment.
Should be fairly easily to extract changing the array out if you think it's necessary. Let me know and I'll update accordingly.
There was a problem hiding this comment.
Team triage: Ladi has successfully scared Rainer about this. Please do extract the SetMetadata outside the Parallel.For. To avoid problems with duplicates, probably store in an array and go by index rather than a dictionary of Include strings or something.
pranavkm
left a comment
There was a problem hiding this comment.
in your benchmark what portion of the time went into waiting for I/O versus actual hash computation? If the files were cached in memory and the task was CPU-bound,.
I haven't had a chance to profile the invocation of the task and I've very little familiarity trying to profile I/O bound work. Do you have some suggestions on how I could try this?
I'm worried that this may interfere with higher level MSBuild scheduling, lead to overcommitment, and even degrade the performance in some scenarios
How would one test this?
src/Tasks/FileIO/GetFileHash.cs
Outdated
| var hash = ComputeHash(Algorithm, file.ItemSpec); | ||
| var file = Files[index]; | ||
| var hash = ComputeHash(algorithmFactory, file.ItemSpec); | ||
| file.SetMetadata("FileHashAlgorithm", Algorithm); |
There was a problem hiding this comment.
There isn't any new concurrency introduced inside the body of the loop. The body of the Parallel.For is invoked exactly once for each index, you wouldn't have multiple threads processing the same ITaskItem concurrently. The only way this would be problematic is if ITaskItem[] does not allow calling methods on different items concurrently, but I'm not sure how you'd have that.
|
As for testing I/O vs. CPU, I'd try building a simple project that just computes hashes of large files and compare the cold (freshly booted machine) and hot (project already built a couple of times, files are cached) build times. Thanks! |
|
(Pending performance checks.) |
|
I've been a little busy this week, I'll try and get some numbers over the weekend. |
|
Updated. @ladipro I tried your suggestion for cold versus warm builds, but I see very little variance in the numbers once a project gets built. I tested this on a disk drive (not SSD), my guess is that IO is cached after an initial read. Some more numbers from my local tests in case they're interesting: File does not existBefore (100 files, 4k each)File at start: File at end: After (100 files, 4k each)File at start: File at end: All files exist (5 files, 4kb each)Before42 ms GetFileHash 1 calls After43 ms GetFileHash 1 calls All files exist (100 files, 4kb each)Before49 ms GetFileHash 1 calls After46 ms GetFileHash 1 calls All files exist (5 files, 200 kb each)Before47 ms GetFileHash 1 calls After44 ms GetFileHash 1 calls All files exist (100 files, 200 kb each)Before132 ms GetFileHash 1 calls After61 ms GetFileHash 1 calls I tested this on an MSBuild built locally in Release configuration with and without my changes. Using the one in the SDK gives very different results (including time to start being quicker). I suspect that's because it's been ngened |
src/Tasks/FileIO/GetFileHash.cs
Outdated
| var hash = ComputeHash(algorithmFactory, file.ItemSpec); | ||
| var encodedHash = EncodeHash(encoding, hash); | ||
|
|
||
| lock (file) |
There was a problem hiding this comment.
nit: Locking on public objects should generally be avoided.
There was a problem hiding this comment.
Fixed to use a dedicated write lock.
Co-Authored-By: Rainer Sigwald <raines@microsoft.com>
|
Thank you! |
* LOC CHECKIN | microsoft/msbuild vs16.6 | 20200420 (#5299) * Final branding_16.6 (#5273) * merge * Enable handshake timeout on Mono (#5318) I am seeing consistent test hangs on macOS Mono. A node is getting into a bad state and is failing to respond to handshakes. While I do not yet understand the root cause, it is clear that having a timeout on the handshake operation mitigates the issue. The test is now failing but does not hang, saving developer time as well as test pipeline resources. * Revert "Reverted and rebuilt for localizable strings." (#5246) This adds back support for logging an error when a task returns false without logging an error. This was originally added in #4940 but was reverted because of multiple difficulties. * Changed error code * Add escape hatch * Fix typo * Filtering for duplicate content files with referencing projects (#4931) * Updating content filtering based on content copying changes * Add a flag that is enabled by default on Core; otherwise disabled by default. * Adding Shutdown Message to Worker Nodes (#5262) * Changed where Trace is being called and removed old functionality. * Updating .NET Core branding to .NET (#5286) * Override default Arcade Xunit configuration (#5302) * Update Directory.Build.targets * prevent arcade from injecting its own xunit file * Don't log @(ReferencePath) at the end of ImplicitlyExpandDesignTimeFacades (#5317) The actual ItemGroups inside the target already do a good job of logging exactly what items were added and/or removed and in what order and with what metadata. Emitting an extra low-pri message which is unstructured here just adds noise, slows the builds down, wastes binlog space and is otherwise redundant. * Compute hashes in parallel in GetFileHash (#5303) * Compute hashes in parallel. This scales better for larger number of files. * Use a dedicated write lock * Allow disabling logging of task parameters and item metadata (#5268) This enables fine-grained control over whether: * to log log each parameter (whether input or output) * or whether to log item metadata for each ITaskItem[] parameter. When LogTaskInputs is set the default behavior is still to log all parameters and all item metadata for ITaskItem[] parameters. Since this is very verbose and hurts performance without adding any useful information it is valuable to be able to turn this logging off in certain situations. This approach allows controlling logging via setting simple properties or environment variables. I've identified the specific tasks and parameters that we want to restrict logging for that would give us the most gains without losing any significant useful info: https://github.com/KirillOsenkov/MSBuildStructuredLog/wiki/Task-Parameter-Logging * Update dependencies from https://github.com/dotnet/arcade build 20200430.5 (#5325) - Microsoft.DotNet.Arcade.Sdk: 1.0.0-beta.20221.2 -> 1.0.0-beta.20230.5 Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> * Use environment variable for handshake Resolves #4961 (#5196) * Use environment variable for handshake Resolves #4961 * Combine means of hashing * Support transitive project references in static graph (#5326) Transitive project references are a thing now. Added support to static graph, so that buildxl and qb can avoid adding the transitive refs. This PR is independent of #5222. Ideally, review this one first, as QB has a stronger dependency on this PR than on #5222. Design - transitive references are opt-in, per project evaluation - once a project opts-in, transitivity is applied for all ProjectReference items - a project opt-ins by setting the property AddTransitiveProjectReferencesInStaticGraph to true. The sdk does this automatically in Microsoft.Managed.After.Targets. - interaction with crosstargeting: transitive refs are added only to inner builds, not the outer builds. This mimics vanilla msbuild. Co-authored-by: Rainer Sigwald <raines@microsoft.com> Co-authored-by: Martin Chromecek (Moravia IT) <v-chmart@microsoft.com> Co-authored-by: Ladi Prosek <laprosek@microsoft.com> Co-authored-by: Sarah Oslund <sfoslund@microsoft.com> Co-authored-by: Ben Villalobos <villalobosb93@gmail.com> Co-authored-by: Mihai Codoban <codobanmihai@gmail.com> Co-authored-by: Kirill Osenkov <KirillOsenkov@users.noreply.github.com> Co-authored-by: Pranav K <prkrishn@hotmail.com> Co-authored-by: dotnet-maestro[bot] <42748379+dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: Rainer Sigwald <raines@microsoft.com>
Based on investigations here: dotnet/aspnetcore#21021.
Hashes produced by GetFileHash can be calculated in parallel. This scales better for larger number of files. Some before and after with this change:
10 files
Before
After
100 files
Before
After
In addition, this PR has a few more cleanups as suggested here by @bartonjs dotnet/aspnetcore#21021 (comment)