Compute hashes in parallel in GetFileHash by pranavkm · Pull Request #5303 · dotnet/msbuild

pranavkm · 2020-04-21T00:36:07Z

Based on investigations here: dotnet/aspnetcore#21021.

Hashes produced by GetFileHash can be calculated in parallel. This scales better for larger number of files. Some before and after with this change:

10 files

Before

After

100 files

Before

After

In addition, this PR has a few more cleanups as suggested here by @bartonjs dotnet/aspnetcore#21021 (comment)

ladipro

Hi @pranavkm, in your benchmark what portion of the time went into waiting for I/O versus actual hash computation? If the files were cached in memory and the task was CPU-bound, I'm worried that this may interfere with higher level MSBuild scheduling, lead to overcommitment, and even degrade the performance in some scenarios.

src/Tasks/FileIO/GetFileHash.cs

ladipro · 2020-04-21T09:44:34Z

src/Tasks/FileIO/GetFileHash.cs

-                var hash = ComputeHash(Algorithm, file.ItemSpec);
+                var file = Files[index];
+                var hash = ComputeHash(algorithmFactory, file.ItemSpec);
                file.SetMetadata("FileHashAlgorithm", Algorithm);


This call doesn't appear to be thread safe. You should do it under a lock or defer setting the metadata to after the parallel loop.

There isn't any new concurrency introduced inside the body of the loop. The body of the Parallel.For is invoked exactly once for each index, you wouldn't have multiple threads processing the same ITaskItem concurrently. The only way this would be problematic is if ITaskItem[] does not allow calling methods on different items concurrently, but I'm not sure how you'd have that.

I was thinking about duplicate items in the Files array but haven't succeeded in reproducing this. We may always wrap items with unique instances in which case this would indeed be safe. But it sure feels like an unnecessarily brittle invariant to rely on.

It wouldn't be incorrect to take out duplicates right?

Should be fairly easily to extract changing the array out if you think it's necessary. Let me know and I'll update accordingly.

Team triage: Ladi has successfully scared Rainer about this. Please do extract the SetMetadata outside the Parallel.For. To avoid problems with duplicates, probably store in an array and go by index rather than a dictionary of Include strings or something.

pranavkm

in your benchmark what portion of the time went into waiting for I/O versus actual hash computation? If the files were cached in memory and the task was CPU-bound,.

I haven't had a chance to profile the invocation of the task and I've very little familiarity trying to profile I/O bound work. Do you have some suggestions on how I could try this?

I'm worried that this may interfere with higher level MSBuild scheduling, lead to overcommitment, and even degrade the performance in some scenarios

How would one test this?

src/Tasks/FileIO/GetFileHash.cs

pranavkm · 2020-04-21T15:24:33Z

src/Tasks/FileIO/GetFileHash.cs

-                var hash = ComputeHash(Algorithm, file.ItemSpec);
+                var file = Files[index];
+                var hash = ComputeHash(algorithmFactory, file.ItemSpec);
                file.SetMetadata("FileHashAlgorithm", Algorithm);


There isn't any new concurrency introduced inside the body of the loop. The body of the Parallel.For is invoked exactly once for each index, you wouldn't have multiple threads processing the same ITaskItem concurrently. The only way this would be problematic is if ITaskItem[] does not allow calling methods on different items concurrently, but I'm not sure how you'd have that.

ladipro · 2020-04-21T21:28:46Z

As for testing I/O vs. CPU, I'd try building a simple project that just computes hashes of large files and compare the cold (freshly booted machine) and hot (project already built a couple of times, files are cached) build times. Thanks!

src/Tasks/FileIO/GetFileHash.cs

Forgind · 2020-04-23T23:16:56Z

(Pending performance checks.)

pranavkm · 2020-04-23T23:21:40Z

I've been a little busy this week, I'll try and get some numbers over the weekend.

pranavkm · 2020-04-27T21:26:01Z

Updated. @ladipro I tried your suggestion for cold versus warm builds, but I see very little variance in the numbers once a project gets built. I tested this on a disk drive (not SSD), my guess is that IO is cached after an initial read. Some more numbers from my local tests in case they're interesting:

File does not exist

Before (100 files, 4k each)

File at start:
47 ms GetFileHash 1 calls

File at end:
48 ms GetFileHash 1 calls

After (100 files, 4k each)

File at start:
50 ms GetFileHash 1 calls

File at end:
55 ms GetFileHash 1 calls

All files exist (5 files, 4kb each)

Before

42 ms GetFileHash 1 calls

After

43 ms GetFileHash 1 calls

All files exist (100 files, 4kb each)

Before

49 ms GetFileHash 1 calls

After

46 ms GetFileHash 1 calls

All files exist (5 files, 200 kb each)

Before

47 ms GetFileHash 1 calls

After

44 ms GetFileHash 1 calls

All files exist (100 files, 200 kb each)

Before

132 ms GetFileHash 1 calls

After

61 ms GetFileHash 1 calls

I tested this on an MSBuild built locally in Release configuration with and without my changes. Using the one in the SDK gives very different results (including time to start being quicker). I suspect that's because it's been ngened

ladipro

Thank you for the numbers! I'd probably still extract the calls outside the parallel loop instead of locking. The way it's written now you're locking on a public object so you never know who else may be holding it.

ladipro · 2020-04-28T06:34:00Z

src/Tasks/FileIO/GetFileHash.cs

+                var hash = ComputeHash(algorithmFactory, file.ItemSpec);
+                var encodedHash = EncodeHash(encoding, hash);
+
+                lock (file)


nit: Locking on public objects should generally be avoided.

Fixed to use a dedicated write lock.

ladipro

Thank you!

src/Tasks/FileIO/GetFileHash.cs

Co-Authored-By: Rainer Sigwald <raines@microsoft.com>

Forgind · 2020-04-29T22:38:09Z

Thank you!

* LOC CHECKIN | microsoft/msbuild vs16.6 | 20200420 (#5299) * Final branding_16.6 (#5273) * merge * Enable handshake timeout on Mono (#5318) I am seeing consistent test hangs on macOS Mono. A node is getting into a bad state and is failing to respond to handshakes. While I do not yet understand the root cause, it is clear that having a timeout on the handshake operation mitigates the issue. The test is now failing but does not hang, saving developer time as well as test pipeline resources. * Revert "Reverted and rebuilt for localizable strings." (#5246) This adds back support for logging an error when a task returns false without logging an error. This was originally added in #4940 but was reverted because of multiple difficulties. * Changed error code * Add escape hatch * Fix typo * Filtering for duplicate content files with referencing projects (#4931) * Updating content filtering based on content copying changes * Add a flag that is enabled by default on Core; otherwise disabled by default. * Adding Shutdown Message to Worker Nodes (#5262) * Changed where Trace is being called and removed old functionality. * Updating .NET Core branding to .NET (#5286) * Override default Arcade Xunit configuration (#5302) * Update Directory.Build.targets * prevent arcade from injecting its own xunit file * Don't log @(ReferencePath) at the end of ImplicitlyExpandDesignTimeFacades (#5317) The actual ItemGroups inside the target already do a good job of logging exactly what items were added and/or removed and in what order and with what metadata. Emitting an extra low-pri message which is unstructured here just adds noise, slows the builds down, wastes binlog space and is otherwise redundant. * Compute hashes in parallel in GetFileHash (#5303) * Compute hashes in parallel. This scales better for larger number of files. * Use a dedicated write lock * Allow disabling logging of task parameters and item metadata (#5268) This enables fine-grained control over whether: * to log log each parameter (whether input or output) * or whether to log item metadata for each ITaskItem[] parameter. When LogTaskInputs is set the default behavior is still to log all parameters and all item metadata for ITaskItem[] parameters. Since this is very verbose and hurts performance without adding any useful information it is valuable to be able to turn this logging off in certain situations. This approach allows controlling logging via setting simple properties or environment variables. I've identified the specific tasks and parameters that we want to restrict logging for that would give us the most gains without losing any significant useful info: https://github.com/KirillOsenkov/MSBuildStructuredLog/wiki/Task-Parameter-Logging * Update dependencies from https://github.com/dotnet/arcade build 20200430.5 (#5325) - Microsoft.DotNet.Arcade.Sdk: 1.0.0-beta.20221.2 -> 1.0.0-beta.20230.5 Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> * Use environment variable for handshake Resolves #4961 (#5196) * Use environment variable for handshake Resolves #4961 * Combine means of hashing * Support transitive project references in static graph (#5326) Transitive project references are a thing now. Added support to static graph, so that buildxl and qb can avoid adding the transitive refs. This PR is independent of #5222. Ideally, review this one first, as QB has a stronger dependency on this PR than on #5222. Design - transitive references are opt-in, per project evaluation - once a project opts-in, transitivity is applied for all ProjectReference items - a project opt-ins by setting the property AddTransitiveProjectReferencesInStaticGraph to true. The sdk does this automatically in Microsoft.Managed.After.Targets. - interaction with crosstargeting: transitive refs are added only to inner builds, not the outer builds. This mimics vanilla msbuild. Co-authored-by: Rainer Sigwald <raines@microsoft.com> Co-authored-by: Martin Chromecek (Moravia IT) <v-chmart@microsoft.com> Co-authored-by: Ladi Prosek <laprosek@microsoft.com> Co-authored-by: Sarah Oslund <sfoslund@microsoft.com> Co-authored-by: Ben Villalobos <villalobosb93@gmail.com> Co-authored-by: Mihai Codoban <codobanmihai@gmail.com> Co-authored-by: Kirill Osenkov <KirillOsenkov@users.noreply.github.com> Co-authored-by: Pranav K <prkrishn@hotmail.com> Co-authored-by: dotnet-maestro[bot] <42748379+dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: Rainer Sigwald <raines@microsoft.com>

Compute hashes in parallel

cd475f0

pranavkm changed the title ~~Compute hashes in parallel~~ Compute hashes in parallel in GetFileHash Apr 21, 2020

ladipro suggested changes Apr 21, 2020

View reviewed changes

pranavkm commented Apr 21, 2020

View reviewed changes

cdmihai reviewed Apr 23, 2020

View reviewed changes

src/Tasks/FileIO/GetFileHash.cs Show resolved Hide resolved

Forgind approved these changes Apr 23, 2020

View reviewed changes

src/Tasks/FileIO/GetFileHash.cs Show resolved Hide resolved

src/Tasks/FileIO/GetFileHash.cs Show resolved Hide resolved

Changes per PR comments

cce8fb9

ladipro reviewed Apr 28, 2020

View reviewed changes

Use a dedicated write lock

1f813f4

ladipro approved these changes Apr 28, 2020

View reviewed changes

rainersigwald approved these changes Apr 29, 2020

View reviewed changes

src/Tasks/FileIO/GetFileHash.cs Outdated Show resolved Hide resolved

Update src/Tasks/FileIO/GetFileHash.cs

d3d82d7

Co-Authored-By: Rainer Sigwald <raines@microsoft.com>

Forgind merged commit 2cee6d0 into dotnet:master Apr 29, 2020

Conversation

pranavkm commented Apr 21, 2020

10 files

Before

After

100 files

Before

After

Uh oh!

ladipro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ladipro Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

pranavkm Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

ladipro Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

cdmihai Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

pranavkm Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

rainersigwald Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pranavkm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pranavkm Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

ladipro commented Apr 21, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Forgind commented Apr 23, 2020

Uh oh!

pranavkm commented Apr 23, 2020

Uh oh!

pranavkm commented Apr 27, 2020

File does not exist

Before (100 files, 4k each)

After (100 files, 4k each)

All files exist (5 files, 4kb each)

Before

After

All files exist (100 files, 4kb each)

Before

After

All files exist (5 files, 200 kb each)

Before

After

All files exist (100 files, 200 kb each)

Before

After

Uh oh!

ladipro left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ladipro Apr 28, 2020

Choose a reason for hiding this comment

Uh oh!

pranavkm Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ladipro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rainersigwald Apr 24, 2020 •

edited

Loading

ladipro left a comment •

edited

Loading

pranavkm Apr 28, 2020 •

edited

Loading