Skip to content

Add GetFileHash and VerifyFileHash tasks#3398

Merged
cdmihai merged 3 commits intodotnet:masterfrom
natemcmaster:filehash-tasks
Jun 13, 2018
Merged

Add GetFileHash and VerifyFileHash tasks#3398
cdmihai merged 3 commits intodotnet:masterfrom
natemcmaster:filehash-tasks

Conversation

@natemcmaster
Copy link
Copy Markdown
Contributor

Add tasks for computing the file hash and verifying an expected file hash using SHA256, SHA384, or SHA512. These tasks were originally part of aspnet/buildtools, then moved to dotnet/arcade. I've polished them up and added better tests/logging. I'm not set on the API names so let me know if there is a better MSBuild convention for the task and property names.

Resolves #3369

cc @jeffkl

Copy link
Copy Markdown
Contributor

@jeffkl jeffkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, just a few minor comments.

Comment thread src/Tasks/FileIO/HashHelper.cs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just move HashHelper's stuff to GetFileHash and call it from VerifyFileHash.

Comment thread src/Tasks/FileIO/HashHelper.cs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I doubt this will be instantiated a lot, I think I'd prefer to see these be some sort of Lazy so that we don't end up with N instances of these classes when you might only be using one. Lazy<T> still allocates, perhaps a factory method with a switch? Is there a benefit of having them as static and would it be better to just instantiate them when needed?

Comment thread src/Tasks/FileIO/VerifyFileHash.cs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice is to return !Log.HasLoggedErrors;

Comment thread src/Tasks/Resources/Strings.resx Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-MSB3954: failed to compute file hash '{0}'. File not found.
+MSB3954: Failed to compute hash for file '{0}' because it does not exist or is inaccessible.

@natemcmaster
Copy link
Copy Markdown
Contributor Author

Updated with PR feedback.

Copy link
Copy Markdown
Contributor

@AndyGerlicher AndyGerlicher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks for contributing this!

@natemcmaster
Copy link
Copy Markdown
Contributor Author

@dotnet-bot test this please

Add tasks for computing the file hash and verifying an expected file hash using SHA256, SHA384, or SHA512.
@natemcmaster
Copy link
Copy Markdown
Contributor Author

Microsoft.Build.UnitTests.GetSDKReferenceFiles_Tests.GetSDKReferenceFilesTestFixture.VerifyReferencesLoggedX64

Test failure seems unrelated to my change. https://ci2.dot.net/job/Microsoft_msbuild/job/master/job/innerloop_Windows_NT_Full_prtest/597/testReport/junit/Microsoft.Build.UnitTests.GetSDKReferenceFiles_Tests/GetSDKReferenceFilesTestFixture/VerifyReferencesLoggedX64/

@dotnet-bot test Windows_NT Build for Full please

Comment thread src/Tasks/FileIO/GetFileHash.cs Outdated
public string Algorithm { get; set; } = DefaultFileHashAlgorithm;

/// <summary>
/// The metadata name where the hash is store in each item. File hash is in hex. Defaults to "FileHash".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "where the hash is stored in each item"

Comment thread src/Tasks/FileIO/GetFileHash.cs Outdated
if (!File.Exists(file.ItemSpec))
{
Log.LogErrorFromResources("FileHash.FileNotFound", file.ItemSpec);
continue;
Copy link
Copy Markdown
Contributor

@cdmihai cdmihai Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth continuing if a file does not exist?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, good point. We could run a separate loop first to check File.Exists before ever computing a hash.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Two loops might regress spinning disks a bit, but hopefully it's an ever decreasing problem. :)

Comment thread src/Tasks/FileIO/GetFileHash.cs Outdated

var hash = ComputeHash(Algorithm, file.ItemSpec);
file.SetMetadata("FileHashAlgoritm", Algorithm);
file.SetMetadata(MetadataName, ConversionUtilities.ConvertByteArrayToHex(hash));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the user be able to choose which of the two representations to use? I assume that they would only use one of them, so the other becomes wasted memory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if we changed the tasks to support a property such as UseBase64Encoding="true" instead? That would allow us to operating in hex or base64 mode and avoid wasting memory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an input property called "HashEncoding" with two valid options, hex and base64? And then single metadata / output would be either one or the other.

switch (algorithmName.ToUpperInvariant())
{
case "SHA256":
return SHA256.Create();
Copy link
Copy Markdown
Contributor

@cdmihai cdmihai Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these objects be reused for the entire task execution or do they carry state and need to be created for each ComputeHash call?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't task instances recreated?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but hash algorithm objects are created for each file to hash. Probably not a problem unless thousands of files get hashed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC HashAlgorithm objects carry state so need to be recreated for each file they are hashing.

@cdmihai
Copy link
Copy Markdown
Contributor

cdmihai commented Jun 12, 2018

@jeffkl @rainersigwald @natemcmaster
Are these tasks expensive enough to have them yield and reacquire the engine?

@jeffkl
Copy link
Copy Markdown
Contributor

jeffkl commented Jun 12, 2018

Are these tasks expensive enough to have them yield and reacquire the engine?

I'd imagine it depends on the size of the file. I don't think it would be used to hash huge files, but perhaps it could check the file size and then yield?

@natemcmaster
Copy link
Copy Markdown
Contributor Author

Are these tasks expensive enough to have them yield and reacquire the engine?

I guess it depends how far you want to go to optimize for hashing large files and/or large numbers of files. I'd have to explore the scenrios more since I'm not sure what the balance is between IO/CPU time. Do we already have a good way to run benchmarks?

@cdmihai
Copy link
Copy Markdown
Contributor

cdmihai commented Jun 12, 2018

Do we already have a good way to run benchmarks?

It's not easy yet to add arbitrary tests. Probably easiest way would be to create github repos where hashing / checking hook up in the design time build target graph. Then you could create a couple of repos with increasing file size / number of files hashed, run the design time build test against them, and that will give you a csv with times and memory.

@cdmihai
Copy link
Copy Markdown
Contributor

cdmihai commented Jun 12, 2018

But maybe it's only worth doing this when we actually observe slow run times. :)

…users to select hex or base64 encoding of file hashes
@natemcmaster
Copy link
Copy Markdown
Contributor Author

I've updated with PR feedback.

Comment thread src/Tasks/FileIO/VerifyFileHash.cs Outdated
Log.LogErrorFromResources("VerifyFileHash.HashMismatch", File, Algorithm, HashBase64, actualHash);
return false;
}
var actualHash = encoding == Tasks.HashEncoding.Hex
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use the same EncodeHash function from GetFileHash. Perhaps it could be an extension method of HashEncoding? It would mean that if another encoding method were added it would be less places to change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

}
}

internal static bool TryParseHashEncoding(string value, out HashEncoding encoding)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason you are using switch with all the hardcoded text strings here rather than just return Enum.TryParse<HashEncoding>(value, true, out encoding);

I am thinking about ensuring that if new encoding methods are added in the future, that it doesn't require changes in lots of places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason. Enum.TryParse is better so switched to that.

@natemcmaster
Copy link
Copy Markdown
Contributor Author

Linux failures appear to be unrelated to my changes. Is this a known issue?

"/mnt/j/w/Microsoft_msbuild/master/innerloop_Ubuntu16.04_CoreCLR_prtest/artifacts/Debug/bootstrap/netcoreapp2.1/MSBuild/MSBuild.dll". Beginning retry 1 in 1000ms. The process cannot access the file '/mnt/j/w/Microsoft_msbuild/master/innerloop_Ubuntu16.04_CoreCLR_prtest/artifacts/Debug/bin/MSBuild.Bootstrap/publish/MSBuild.dll' because it is being used by another process. [/mnt/j/w/Microsoft_msbuild/master/innerloop_Ubuntu16.04_CoreCLR_prtest/src/MSBuild.Bootstrap/MSBuild.Bootstrap.csproj]

@dotnet-bot test Ubuntu16.04 Build for CoreCLR please

@jeffkl
Copy link
Copy Markdown
Contributor

jeffkl commented Jun 13, 2018

Linux failures appear to be unrelated to my changes. Is this a known issue?

I'll take a look if any fail on your latest commit

@natemcmaster
Copy link
Copy Markdown
Contributor Author

natemcmaster commented Jun 13, 2018

It must have just been flakiness. Re-running Ubuntu passed. But if you want to investigate that failure more, here's a link: https://ci2.dot.net/job/Microsoft_msbuild/job/master/job/innerloop_Ubuntu16.04_CoreCLR_prtest/619/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants