Skip to content

Compute hashes in parallel in GetFileHash#5303

Merged
Forgind merged 4 commits intodotnet:masterfrom
pranavkm:prkrishn/parallel-hash
Apr 29, 2020
Merged

Compute hashes in parallel in GetFileHash#5303
Forgind merged 4 commits intodotnet:masterfrom
pranavkm:prkrishn/parallel-hash

Conversation

@pranavkm
Copy link
Copy Markdown
Contributor

Based on investigations here: dotnet/aspnetcore#21021.

Hashes produced by GetFileHash can be calculated in parallel. This scales better for larger number of files. Some before and after with this change:

10 files

Before

image

After

image

100 files

Before

image

After

image

In addition, this PR has a few more cleanups as suggested here by @bartonjs dotnet/aspnetcore#21021 (comment)

@pranavkm pranavkm changed the title Compute hashes in parallel Compute hashes in parallel in GetFileHash Apr 21, 2020
Copy link
Copy Markdown
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pranavkm, in your benchmark what portion of the time went into waiting for I/O versus actual hash computation? If the files were cached in memory and the task was CPU-bound, I'm worried that this may interfere with higher level MSBuild scheduling, lead to overcommitment, and even degrade the performance in some scenarios.

var hash = ComputeHash(Algorithm, file.ItemSpec);
var file = Files[index];
var hash = ComputeHash(algorithmFactory, file.ItemSpec);
file.SetMetadata("FileHashAlgorithm", Algorithm);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call doesn't appear to be thread safe. You should do it under a lock or defer setting the metadata to after the parallel loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't any new concurrency introduced inside the body of the loop. The body of the Parallel.For is invoked exactly once for each index, you wouldn't have multiple threads processing the same ITaskItem concurrently. The only way this would be problematic is if ITaskItem[] does not allow calling methods on different items concurrently, but I'm not sure how you'd have that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about duplicate items in the Files array but haven't succeeded in reproducing this. We may always wrap items with unique instances in which case this would indeed be safe. But it sure feels like an unnecessarily brittle invariant to rely on.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be incorrect to take out duplicates right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fairly easily to extract changing the array out if you think it's necessary. Let me know and I'll update accordingly.

Copy link
Copy Markdown
Member

@rainersigwald rainersigwald Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Team triage: Ladi has successfully scared Rainer about this. Please do extract the SetMetadata outside the Parallel.For. To avoid problems with duplicates, probably store in an array and go by index rather than a dictionary of Include strings or something.

Copy link
Copy Markdown
Contributor Author

@pranavkm pranavkm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in your benchmark what portion of the time went into waiting for I/O versus actual hash computation? If the files were cached in memory and the task was CPU-bound,.

I haven't had a chance to profile the invocation of the task and I've very little familiarity trying to profile I/O bound work. Do you have some suggestions on how I could try this?

I'm worried that this may interfere with higher level MSBuild scheduling, lead to overcommitment, and even degrade the performance in some scenarios

How would one test this?

var hash = ComputeHash(Algorithm, file.ItemSpec);
var file = Files[index];
var hash = ComputeHash(algorithmFactory, file.ItemSpec);
file.SetMetadata("FileHashAlgorithm", Algorithm);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't any new concurrency introduced inside the body of the loop. The body of the Parallel.For is invoked exactly once for each index, you wouldn't have multiple threads processing the same ITaskItem concurrently. The only way this would be problematic is if ITaskItem[] does not allow calling methods on different items concurrently, but I'm not sure how you'd have that.

@ladipro
Copy link
Copy Markdown
Member

ladipro commented Apr 21, 2020

As for testing I/O vs. CPU, I'd try building a simple project that just computes hashes of large files and compare the cold (freshly booted machine) and hot (project already built a couple of times, files are cached) build times. Thanks!

@Forgind
Copy link
Copy Markdown
Contributor

Forgind commented Apr 23, 2020

(Pending performance checks.)

@pranavkm
Copy link
Copy Markdown
Contributor Author

I've been a little busy this week, I'll try and get some numbers over the weekend.

@pranavkm
Copy link
Copy Markdown
Contributor Author

Updated. @ladipro I tried your suggestion for cold versus warm builds, but I see very little variance in the numbers once a project gets built. I tested this on a disk drive (not SSD), my guess is that IO is cached after an initial read. Some more numbers from my local tests in case they're interesting:

File does not exist

Before (100 files, 4k each)

File at start:
47 ms GetFileHash 1 calls

File at end:
48 ms GetFileHash 1 calls

After (100 files, 4k each)

File at start:
50 ms GetFileHash 1 calls

File at end:
55 ms GetFileHash 1 calls

All files exist (5 files, 4kb each)

Before

42 ms GetFileHash 1 calls

After

43 ms GetFileHash 1 calls

All files exist (100 files, 4kb each)

Before

49 ms GetFileHash 1 calls

After

46 ms GetFileHash 1 calls

All files exist (5 files, 200 kb each)

Before

47 ms GetFileHash 1 calls

After

44 ms GetFileHash 1 calls

All files exist (100 files, 200 kb each)

Before

132 ms GetFileHash 1 calls

After

61 ms GetFileHash 1 calls

I tested this on an MSBuild built locally in Release configuration with and without my changes. Using the one in the SDK gives very different results (including time to start being quicker). I suspect that's because it's been ngened

Copy link
Copy Markdown
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the numbers! I'd probably still extract the calls outside the parallel loop instead of locking. The way it's written now you're locking on a public object so you never know who else may be holding it.

var hash = ComputeHash(algorithmFactory, file.ItemSpec);
var encodedHash = EncodeHash(encoding, hash);

lock (file)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Locking on public objects should generally be avoided.

Copy link
Copy Markdown
Contributor Author

@pranavkm pranavkm Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use a dedicated write lock.

Copy link
Copy Markdown
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Co-Authored-By: Rainer Sigwald <raines@microsoft.com>
@Forgind Forgind merged commit 2cee6d0 into dotnet:master Apr 29, 2020
@Forgind
Copy link
Copy Markdown
Contributor

Forgind commented Apr 29, 2020

Thank you!

Forgind added a commit that referenced this pull request May 4, 2020
* LOC CHECKIN | microsoft/msbuild vs16.6 | 20200420 (#5299)

* Final branding_16.6 (#5273)

* merge

* Enable handshake timeout on Mono (#5318)

I am seeing consistent test hangs on macOS Mono. A node is getting into a bad state and is failing to respond to handshakes. While I do not yet understand the root cause, it is clear that having a timeout on the handshake operation mitigates the issue. The test is now failing but does not hang, saving developer time as well as test pipeline resources.

* Revert "Reverted and rebuilt for localizable strings." (#5246)

This adds back support for logging an error when a task returns false
without logging an error. This was originally added in #4940 but was
reverted because of multiple difficulties.

* Changed error code

* Add escape hatch

* Fix typo

* Filtering for duplicate content files with referencing projects (#4931)

* Updating content filtering based on content copying changes

* Add a flag that is enabled by default on Core; otherwise disabled by default.

* Adding Shutdown Message to Worker Nodes (#5262)

* Changed where Trace is being called and removed old functionality.

* Updating .NET Core branding to .NET (#5286)

* Override default Arcade Xunit configuration (#5302)

* Update Directory.Build.targets

* prevent arcade from injecting its own xunit file

* Don't log @(ReferencePath) at the end of ImplicitlyExpandDesignTimeFacades (#5317)

The actual ItemGroups inside the target already do a good job of logging exactly what items were added and/or removed and in what order and with what metadata.

Emitting an extra low-pri message which is unstructured here just adds noise, slows the builds down, wastes binlog space and is otherwise redundant.

* Compute hashes in parallel in GetFileHash (#5303)

* Compute hashes in parallel. This scales better for larger number of files.

* Use a dedicated write lock

* Allow disabling logging of task parameters and item metadata (#5268)

This enables fine-grained control over whether:

 * to log log each parameter (whether input or output)
 * or whether to log item metadata for each ITaskItem[] parameter.

When LogTaskInputs is set the default behavior is still to log all parameters and all item metadata for ITaskItem[] parameters. Since this is very verbose and hurts performance without adding any useful information it is valuable to be able to turn this logging off in certain situations.

This approach allows controlling logging via setting simple properties or environment variables.

I've identified the specific tasks and parameters that we want to restrict logging for that would give us the most gains without losing any significant useful info:

https://github.com/KirillOsenkov/MSBuildStructuredLog/wiki/Task-Parameter-Logging

* Update dependencies from https://github.com/dotnet/arcade build 20200430.5 (#5325)

- Microsoft.DotNet.Arcade.Sdk: 1.0.0-beta.20221.2 -> 1.0.0-beta.20230.5

Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>

* Use environment variable for handshake Resolves #4961 (#5196)

* Use environment variable for handshake Resolves #4961

* Combine means of hashing

* Support transitive project references in static graph (#5326)

Transitive project references are a thing now. Added support to static graph, so that buildxl and qb can avoid adding the transitive refs.
This PR is independent of #5222. Ideally, review this one first, as QB has a stronger dependency on this PR than on #5222.

Design

- transitive references are opt-in, per project evaluation
- once a project opts-in, transitivity is applied for all ProjectReference items
- a project opt-ins by setting the property AddTransitiveProjectReferencesInStaticGraph to true. The sdk does this automatically in Microsoft.Managed.After.Targets.
- interaction with crosstargeting: transitive refs are added only to inner builds, not the outer builds. This mimics vanilla msbuild.

Co-authored-by: Rainer Sigwald <raines@microsoft.com>

Co-authored-by: Martin Chromecek (Moravia IT) <v-chmart@microsoft.com>
Co-authored-by: Ladi Prosek <laprosek@microsoft.com>
Co-authored-by: Sarah Oslund <sfoslund@microsoft.com>
Co-authored-by: Ben Villalobos <villalobosb93@gmail.com>
Co-authored-by: Mihai Codoban <codobanmihai@gmail.com>
Co-authored-by: Kirill Osenkov <KirillOsenkov@users.noreply.github.com>
Co-authored-by: Pranav K <prkrishn@hotmail.com>
Co-authored-by: dotnet-maestro[bot] <42748379+dotnet-maestro[bot]@users.noreply.github.com>
Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>
Co-authored-by: Rainer Sigwald <raines@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants