Skip to content

Improve RemoveDependenciesFromEntryIfMissing#5392

Merged
rainersigwald merged 11 commits into
dotnet:masterfrom
benvillalobos:5180-removedependenciesfromentryifmissing
Jun 24, 2020
Merged

Improve RemoveDependenciesFromEntryIfMissing#5392
rainersigwald merged 11 commits into
dotnet:masterfrom
benvillalobos:5180-removedependenciesfromentryifmissing

Conversation

@benvillalobos
Copy link
Copy Markdown
Member

@benvillalobos benvillalobos commented Jun 2, 2020

Fixes #5180

The fix is a straightforward cache of files that have been detected to exist already, preventing duplicate file system checks on files we already know exist.

There will be a future issue and PR addressing ConstructDependencyTable mentioned in a comment on the original issue.

Testing on my machine shows an improvement in RemoveDependenciesFromEntryIfMissing of ~850ms down to ~25ms.

We still want to run the same code on files we know exist, but we don't need to check for the file every time.
Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs Outdated
Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs Outdated
Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs Outdated
// If we are ignoring missing files, then only record those that exist
if (FileUtilities.FileExistsNoThrow(file))
// Cache the files as we find them to save time (On^2), at the expense of storing data O(n).
if (fileCache.Contains(file) || FileUtilities.FileExistsNoThrow(file))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see the caching logic in FileExistsNoThrow, opt-in behind an environment variable? I'm assuming that we are not ready to turn it on yet, is that correct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I see not to have it on is that the values could become stale. This captures only half of that, so if a file is added after it was first determined to not exist, this would still be able to pick it up, whereas the other cache would not. It sounds better to go with the other cache to me, but since it's behind an escape hatch, I'm guessing someone actually did that at some point? Do we know if they added a file or deleted a file and whether their case generalizes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually didn't catch this, whoops! I'm not sure why it's behind an escape hatch, but It's worth investigating

Nathan raises a good question. I had the same question that I forget the answer to (cc: @rainersigwald ). What should be done about stale values? Does it even matter here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That escape hatch is for a different thing.

In general, it's not safe to cache file existence checks in MSBuild: we're a build engine! The whole point is to create files! So knowing that a file didn't exist N milliseconds ago is not generally useful: was that just the "should I run this target?" check, and now we're doing copy-if-exists after creating it?

The escape hatch was added for fast-evaluation scenarios--if you pinkie-swear to not do any operations that could create files as part of the "build"/evaluation you're doing, you can avoid many checks that would be required for correctness without that constraint. The internal CloudBuild system uses that to quickly use heuristics to predict project inputs and outputs.

Copy link
Copy Markdown
Contributor

@Forgind Forgind Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this also used for Clean? For a simple build, you wouldn't be removing a file unless you replace it with a more up-to-date version of itself, and that would make this cache perfect. Clean would make it maybe not if it lasts for any noticeable amount of time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a simple build, you wouldn't be removing a file unless you replace it with a more up-to-date version of itself, and that would make this cache perfect.

I don't think that's true in practice. There are plenty of remove/delete operations in a normal build, especially in the face of errors.

// Cache of last write times
private readonly ConcurrentDictionary<string, DateTime> _lastWriteTimeCache = new ConcurrentDictionary<string, DateTime>(StringComparer.Ordinal);
// Cache of files that have been checked and exist.
private HashSet<string> fileCache = new HashSet<string>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the lifetime of this class? Does it exist only for a single computation (so the lifetime of the whole cache is short, and we don't have to worry about invalidating it) or for longer (and there might be stale cached data)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Notes from our teams call about this): It looks to exist within task execution. In order to be safe we decided to keep the scope of the change local to the function rather than global to the class. This does slightly increase time taken because the cache essentially resets between calls, but this way we don't need to worry about using stale data.

Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs Outdated
// Cache of last write times
private readonly ConcurrentDictionary<string, DateTime> _lastWriteTimeCache = new ConcurrentDictionary<string, DateTime>(StringComparer.Ordinal);
// Cache of files that have been checked and exist.
private HashSet<string> fileCache = new HashSet<string>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What threading guarantees do we provide on the call you're changing? Does this need to be a concurrent collection?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, the method is already updating a non-thread-safe collection DependencyTable so it looks like it doesn't have threading guarantees and a using a regular collection for the cache should be fine.

// If we are ignoring missing files, then only record those that exist
if (FileUtilities.FileExistsNoThrow(file))
// Cache the files as we find them to save time (On^2), at the expense of storing data O(n).
if (fileCache.Contains(file) || FileUtilities.FileExistsNoThrow(file))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That escape hatch is for a different thing.

In general, it's not safe to cache file existence checks in MSBuild: we're a build engine! The whole point is to create files! So knowing that a file didn't exist N milliseconds ago is not generally useful: was that just the "should I run this target?" check, and now we're doing copy-if-exists after creating it?

The escape hatch was added for fast-evaluation scenarios--if you pinkie-swear to not do any operations that could create files as part of the "build"/evaluation you're doing, you can avoid many checks that would be required for correctness without that constraint. The internal CloudBuild system uses that to quickly use heuristics to predict project inputs and outputs.

Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs Outdated
To store whether a file that has been previously checked existed or not. That way we skip file checks on files that we already know don't exist.
Also slight changes to try to match code between canonicaltracked input and output files.cs
Copy link
Copy Markdown
Contributor

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Couple small optimizations.

Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs Outdated
Comment on lines +743 to +745
rootingMarker = correspondingOutputs != null
? FileTracker.FormatRootingMarker(source[sourceIndex], correspondingOutputs[sourceIndex])
: FileTracker.FormatRootingMarker(source[sourceIndex]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rootingMarker = correspondingOutputs != null
? FileTracker.FormatRootingMarker(source[sourceIndex], correspondingOutputs[sourceIndex])
: FileTracker.FormatRootingMarker(source[sourceIndex]);
rootingMarker = FileTracker.FormatRootingMarker(source[sourceIndex], correspondingOutputs?[sourceIndex]);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't take this suggestion, it causes a nullreferenceexception. FormatRootingMarker is capable of handling the null array case, but this suggestion passes an array with 1 null element. In this case I'd rather keep the original.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what? This should never be passing an array, just null or the value of the array. Did you miss the '?'?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be mistaken as to which overload gets called here. There are two overloads that are very similar:
public static string FormatRootingMarker(ITaskItem source, ITaskItem output) and
public static string FormatRootingMarker(ITaskItem[] sources, ITaskItem[] outputs)

The suggested change calls the first overload, which packs both items into arrays like so:

public static string FormatRootingMarker(ITaskItem source, ITaskItem output) => FormatRootingMarker(new[] { source }, new[] { output });

Then calls the second overload which eventually does this:

// So we don't have to deal with null checks.
            outputs = outputs ?? Array.Empty<ITaskItem>();

            var rootSources = new List<string>(sources.Length + outputs.Length);

            foreach (ITaskItem source in sources)
            {
                rootSources.Add(FileUtilities.NormalizePath(source.ItemSpec).ToUpperInvariant());
            }

            foreach (ITaskItem output in outputs)
            {
                rootSources.Add(FileUtilities.NormalizePath(output.ItemSpec).ToUpperInvariant());
            }

Notice it does a null check on the array, not the items within. and the second foreach loop will break because it tries to get the ItemSpec from a null item.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I understand why the change could be problematic, but now I don't understand why your version would work. In any case, it's the same as it had been before, and the change wasn't really an optimization anyway, so I'm not going to block on this.

if (FileUtilities.FileExistsNoThrow(file))
// Record whether or not each file exists and cache it.
// We do this to save time (On^2), at the expense of data O(n).
bool inFileCache = fileCache.ContainsKey(file);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs Outdated
Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs Outdated
Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedOutputFiles.cs Outdated
Correctly caching whether or not the file exists in the case that we hadn't already cached it.
Copy link
Copy Markdown
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, thank you. Are the perf numbers in the description still valid after reducing the scope of the cache?

Comment thread src/Utilities/TrackedDependencies/CanonicalTrackedInputFiles.cs
@benvillalobos
Copy link
Copy Markdown
Member Author

The perf numbers are now closer to 30~35ms, which isn't surprising.

@Forgind
Copy link
Copy Markdown
Contributor

Forgind commented Jun 11, 2020

The perf numbers are now closer to 30~35ms, which isn't surprising.

Still very nice improvement!

@rainersigwald rainersigwald added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Jun 12, 2020
@rainersigwald rainersigwald added this to the MSBuild 16.7 Preview 4 milestone Jun 12, 2020
@benvillalobos
Copy link
Copy Markdown
Member Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@rainersigwald rainersigwald merged commit 3bd6463 into dotnet:master Jun 24, 2020
Forgind added a commit to Forgind/msbuild that referenced this pull request Jun 29, 2020
Second part

Part 1 of not checking byte

parent d86a1e168bdf295aa777d47ee1a4b988b8913889
author Nathan Mytelka <Forgind@users.noreply.github.com> 1591730709 -0700
committer Nathan Mytelka <Forgind@users.noreply.github.com> 1593461702 -0700

Remove outdated comment

Part 2 of unmasking first byte

Part 3

Part 4

Part 5

Part 6

Reenabled administrator privilege and cleanup

Add test

Improve RemoveDependenciesFromEntryIfMissing (dotnet#5392)

Fixes dotnet#5180

The fix is a straightforward cache of files that have been detected to exist already, preventing duplicate file system checks on files we already know exist.

There will be a future issue and PR addressing ConstructDependencyTable mentioned in a comment on the original issue.

Testing on my machine shows an improvement in RemoveDependenciesFromEntryIfMissing of ~850ms down to 30~35ms.

Spruce up ObjectModelHelpers assertions

These fired while I was writing a new test but didn't have much useful information.

Keep AssertItems from throwing an ArgumentOutOfRangeException on mismatched lengths,
and give a clue or two about mismatched lengths in AssertItemHasMetadata.

Regression tests for dotnet#5445

Ensure that Update and Remove operations done at evaluation time that use
item functions pay attention to the item function and don't apply to all
items of the same type.

Respect item functions in lazy Update/Remove

Fixes dotnet#5445 by checking to see if an item function is invoked (the captured
expression has subcaptures) before optimizing operations on same-item
captures.

Rename short-circuit-lazy-item-update check method

The question this method answers is 'can I just remove/update every item in the group, or do I need to expand the value to match against existing items?'

Renamed it for a bit more clarity there.

Log CurrentUICulture in binlog (dotnet#5426)

This will be useful to be able to open localized binlogs. If we know the culture of the log we can fetch the right resources from the MSBuild .dlls.

Properly parse version

Part 7
Forgind added a commit that referenced this pull request Jul 22, 2020
* First step

Second part

Part 1 of not checking byte

parent d86a1e168bdf295aa777d47ee1a4b988b8913889
author Nathan Mytelka <Forgind@users.noreply.github.com> 1591730709 -0700
committer Nathan Mytelka <Forgind@users.noreply.github.com> 1593461702 -0700

Remove outdated comment

Part 2 of unmasking first byte

Part 3

Part 4

Part 5

Part 6

Reenabled administrator privilege and cleanup

Add test

Improve RemoveDependenciesFromEntryIfMissing (#5392)

Fixes #5180

The fix is a straightforward cache of files that have been detected to exist already, preventing duplicate file system checks on files we already know exist.

There will be a future issue and PR addressing ConstructDependencyTable mentioned in a comment on the original issue.

Testing on my machine shows an improvement in RemoveDependenciesFromEntryIfMissing of ~850ms down to 30~35ms.

Spruce up ObjectModelHelpers assertions

These fired while I was writing a new test but didn't have much useful information.

Keep AssertItems from throwing an ArgumentOutOfRangeException on mismatched lengths,
and give a clue or two about mismatched lengths in AssertItemHasMetadata.

Regression tests for #5445

Ensure that Update and Remove operations done at evaluation time that use
item functions pay attention to the item function and don't apply to all
items of the same type.

Respect item functions in lazy Update/Remove

Fixes #5445 by checking to see if an item function is invoked (the captured
expression has subcaptures) before optimizing operations on same-item
captures.

Rename short-circuit-lazy-item-update check method

The question this method answers is 'can I just remove/update every item in the group, or do I need to expand the value to match against existing items?'

Renamed it for a bit more clarity there.

Log CurrentUICulture in binlog (#5426)

This will be useful to be able to open localized binlogs. If we know the culture of the log we can fetch the right resources from the MSBuild .dlls.

Properly parse version

Part 7

* Moved user check

* Moved Handshake

* Fixed build

* Refactoring

* Move fire byte calculation into loop

* Save before committing 😃

* Catch uncaught exception

* PR feedback

* Correct off-by-one error
Forgind added a commit that referenced this pull request Jul 27, 2020
* First step

Second part

Part 1 of not checking byte

parent d86a1e168bdf295aa777d47ee1a4b988b8913889
author Nathan Mytelka <Forgind@users.noreply.github.com> 1591730709 -0700
committer Nathan Mytelka <Forgind@users.noreply.github.com> 1593461702 -0700

Remove outdated comment

Part 2 of unmasking first byte

Part 3

Part 4

Part 5

Part 6

Reenabled administrator privilege and cleanup

Add test

Improve RemoveDependenciesFromEntryIfMissing (#5392)

Fixes #5180

The fix is a straightforward cache of files that have been detected to exist already, preventing duplicate file system checks on files we already know exist.

There will be a future issue and PR addressing ConstructDependencyTable mentioned in a comment on the original issue.

Testing on my machine shows an improvement in RemoveDependenciesFromEntryIfMissing of ~850ms down to 30~35ms.

Spruce up ObjectModelHelpers assertions

These fired while I was writing a new test but didn't have much useful information.

Keep AssertItems from throwing an ArgumentOutOfRangeException on mismatched lengths,
and give a clue or two about mismatched lengths in AssertItemHasMetadata.

Regression tests for #5445

Ensure that Update and Remove operations done at evaluation time that use
item functions pay attention to the item function and don't apply to all
items of the same type.

Respect item functions in lazy Update/Remove

Fixes #5445 by checking to see if an item function is invoked (the captured
expression has subcaptures) before optimizing operations on same-item
captures.

Rename short-circuit-lazy-item-update check method

The question this method answers is 'can I just remove/update every item in the group, or do I need to expand the value to match against existing items?'

Renamed it for a bit more clarity there.

Log CurrentUICulture in binlog (#5426)

This will be useful to be able to open localized binlogs. If we know the culture of the log we can fetch the right resources from the MSBuild .dlls.

Properly parse version

Part 7

* Moved user check

* Moved Handshake

* Fixed build

* Refactoring

* Move fire byte calculation into loop

* Save before committing 😃

* Catch uncaught exception

* PR feedback

* Correct off-by-one error

* Make handshake version explicit

* PR comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve RemoveDependenciesFromEntryIfMissing() when there is a lot of input files.

4 participants