Skip to content

[WIP] Support for alternative hashes (#1676)#2022

Closed
shcheklein wants to merge 20 commits into
treeverse:masterfrom
shcheklein:fix-1676
Closed

[WIP] Support for alternative hashes (#1676)#2022
shcheklein wants to merge 20 commits into
treeverse:masterfrom
shcheklein:fix-1676

Conversation

@shcheklein
Copy link
Copy Markdown
Contributor

  • [x ] Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


Fixes #1676

Based on a fork by @vyloy .

Copy link
Copy Markdown
Contributor Author

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1920 - address comments from the previous PR

Comment thread dvc/config.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply:

return bool(modchecksum.checksum_types_from_str(...))

?

Comment thread dvc/output/base.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be:

return next((self.info[t] for t in self.remote.checksum_types() if t in self.info), None)
# or
return next(filter(bool, map(self.info.get, self.remote.checksum_types())), None) 

:)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

return self.info[self.checksum]

Comment thread dvc/remote/base.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this properties as above?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is that an error if we have more than one here? If yes than I suggest adding an assert.

Comment thread dvc/remote/base.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a decorator to abstract away caching?

Comment thread dvc/remote/base.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to call .keys() just types = set(checksum_info). Actually Python has optimized code path to make sets from dicts. intersection is computed twice.

Comment thread dvc/remote/base.py Outdated
Copy link
Copy Markdown
Contributor

@Suor Suor May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this addition_check only purpose to pass there self.changed_cache? If so won't it be more clear to simply use bool parameter?

Comment thread dvc/utils/checksum.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this go to appropriate modules and just stay contained there? This will eliminate one level of indirection - these constants.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will go to the appropriate modules for sure 👍

Comment thread dvc/remote/local/__init__.py Outdated
Copy link
Copy Markdown
Contributor

@Suor Suor May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like file_md5/file_checksum second result is never used. Maybe now is a good time to drop it and return only hex.

Comment thread dvc/stage.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just:

SCHEMA.update({
    Optional(PARAM_CMD): Or(str, None),
    ...
})

Comment thread dvc/stage.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ors instead of any() here will simplify code and provide short-circuit behavior.

Comment thread dvc/stage.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksum = {k: d[k] for k in modchecksum.CHECKSUM_MAP if d.get(k)}

# Or if you don't expect falsy values in `d`:
checksum = {k: d[k] for k in modchecksum.CHECKSUM_MAP if k in d}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or using funcy:

from funcy import project
checksum = project(d, modchecksum.CHECKSUM_MAP)

Funcy is my lib :)

Comment thread dvc/stage.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r = {k: v for k, v in self.checksum.items() if v} 

Or using funcy:

from funcy import compact
r = compact(self.checksum)

Comment thread dvc/stage.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole thing is merging of two dicts with a filter:

from  itertools import chain

d1 = self.checksum
d2 = {...}
return {k: v for k, v in chain(d1.items(), d2.items()) if v}

# or with funcy
from funcy import merge, compact

d1 = self.checksum
d2 = {...}
return compact(merge(d1, d2))

Comment thread dvc/utils/checksum.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be

if not set(checksum_types) <= set(supported_types):
    return None

Comment thread dvc/utils/checksum.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good place to use list comprehension.

Comment thread tests/unit/remote/hash/test_hash.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like functional tests not unit ones. Also could be easily written in new pytest style using dvc fixture:

def test_compatibility(dvc):
    ret = main(...)
    assert ret == 0

    # ....

@shcheklein
Copy link
Copy Markdown
Contributor Author

@Suor thanks for reviewing this. It was started by an external contributor and am trying to get it done, I don't expect it to be fast though. I'll ping you again when it's in a better shape and some fundamental issues are solved. I'll address your comments along the way, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alterative hashings for data

3 participants