Since md5 is sensitive to the order and format of the data, simple changes to the schema (eg. swapping two columns) or changing the type of a column (eg. integer to float) leads to new hash values and duplicated datasets. There are some alternatives that attempt to address this, such as UNF (http://guides.dataverse.org/en/latest/developers/unf/index.html).
It would be great to specify an alternative hash function in DVC, particularly to be able to provide a user-defined function.
Since md5 is sensitive to the order and format of the data, simple changes to the schema (eg. swapping two columns) or changing the type of a column (eg. integer to float) leads to new hash values and duplicated datasets. There are some alternatives that attempt to address this, such as UNF (http://guides.dataverse.org/en/latest/developers/unf/index.html).
It would be great to specify an alternative hash function in DVC, particularly to be able to provide a user-defined function.