Skip to content

4813 allow duplicate files#6924

Merged
kcondon merged 83 commits intodevelopfrom
4813-allow-duplicate-files
Aug 3, 2020
Merged

4813 allow duplicate files#6924
kcondon merged 83 commits intodevelopfrom
4813-allow-duplicate-files

Conversation

@sekmiller
Copy link
Contributor

@sekmiller sekmiller commented May 20, 2020

What this PR does / why we need it:
This will allow users to upload multiple files with the same checksum value to a given dataset. On upload either via the interface or api the user will be warned that a duplicate file exists in the dataset (with the current path/label of the duplicate file.) At that point they can directly delete the newly uploaded file - if the upload is via the UI.

Which issue(s) this PR closes:
#4813 - allow files with the same MD5/checksum to exist in the same dataset

Closes #4813
Closes #6468

Special notes for your reviewer:
Really wanted to take a stick of dynamite to AddReplaceFileHelper, but ended up working with it as it exists. Also fixed an issue with the editFileMetadata api where if you weren't updating the file's label you'd get an duplicate file name error. This was causing a failure in the .testForceReplaceAndUpdate Test

Suggestions on how to test this:
various scenarios uploading a duplicate file including replace
document outlining upload use cases and expected messaging

Does this PR introduce a user interface change?:
Introduces a popup on upload of a duplicate file, which warns the user and allows them to immediately delete the newly uploaded file

Is there a release notes update needed for this change?:
We could note that duplicate files within a dataset are now allowed as a new feature

Additional documentation:

@coveralls
Copy link

coveralls commented May 20, 2020

Coverage Status

Coverage decreased (-0.04%) to 19.562% when pulling 57ab613 on 4813-allow-duplicate-files into a6f580f on develop.

@sekmiller sekmiller removed their assignment May 20, 2020
@scolapasta scolapasta assigned scolapasta and sekmiller and unassigned scolapasta May 28, 2020
@sekmiller
Copy link
Contributor Author

I updated the messaging based on discussion during the Design Meeting.

@TaniaSchlatter TaniaSchlatter removed their assignment Jul 29, 2020
Copy link
Contributor

@scolapasta scolapasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sekmiller I added some comments; please review and let me know what you think.

// on *new* datafiles, that haven't been saved in the database yet;
// but it should never be the case in the context of this API)
// -- L.A. Mar. 2020
//SEK 5/2020 - we can't use checksum because it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than leave (and add) to these comments, at this point we can probably remove all (from the "not sure.." . Feel free to check with @landreev if he feels any reason these should stay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you check if we can just remove?


- Files with the same checksum can be included in a dataset, even if the files are in the same directory.
- Files with the same filename can be included in a dataset as long as the files are in different directories.
- If a user attempts to add a file to a directory where a file already exists with that directory/filename combination, Dataverse will adjust the file path and names by adding "-1" or "-2" as applicable. This change will be visible in the list of files being uploaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth being more explicit here about upload vs edit? i.e. related to the next bullet about changing the directory, is also changing the name after upload. (so the suggestion is to add that there, and change add to upload here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a change to the doc here, please see that it helps to add some clarity.


}

public void deleteMarkedAsDuplicateFiles() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be a separate method? or could it call delete files after setting the selected files from the files with same checksum? Or maybe better yet, have a private delete method that takes a list, have this call this with the list of files with same checksum and have the other (regular) delete call the private method with the list of selected files. (you also have to be careful to differentiate between delete of newly uploaded vs pre-existing, but I think in the end it would be cleaner (and less duplicate) code. Let me know what you think of this suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. i'll take a look. I know we have different success messages - that were introduced yesterday, but it might not be too bad to consolidate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to use only one method for delete. Ran into a weird situation - real edge case - where if you uploaded a dupe, allowed it to stay, uploaded it again, decided no once is enough so deleted the new one it would also delete the first dupe that was "ok" - so fixed that, allowing the first dupe to stay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the one method way (and checking if selected to determine which delete is wanted) is problematic, so we do still need two public methods. BUT they just need to "collect" the files and then call a private delete which still centralizes the code. @sekmiller's aware of the change needed, and will move to QA once that is all set.

@Transient
private boolean markedAsDuplicate;

public boolean isMarkedAsDuplicate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for checksums? if so, do we need, since we already have it on the datafile object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do need it because there's a separate list of file metadata that feed the uploaded files table and if the user deletes the duplicates we have to remove them too. I was also using it to write the success message, but since were no longer using names there I did some code simplification there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can't you call fileMetaData.datafile.isMarkedAsDuplicate()? or something like that?

@sekmiller sekmiller removed their assignment Jul 30, 2020
@sekmiller sekmiller removed their assignment Jul 30, 2020
@scolapasta scolapasta removed their assignment Jul 30, 2020
@kcondon kcondon self-assigned this Jul 31, 2020
@kcondon
Copy link
Contributor

kcondon commented Jul 31, 2020

So, I tested all the rules and used cases as best I could. I think they all work except on upload when paths are involved for duplicates that have not yet been saved.

  1. Upload file2.txt twice, edit paths to be c,d. Cannot save, says duplicate filenames.
    This works if we upload a zip with dupe filenames in different paths.

@kcondon kcondon assigned sekmiller and unassigned kcondon Jul 31, 2020
@sekmiller
Copy link
Contributor Author

I was able to upload duplicate files and edit their paths and successfully save them
Screen Shot 2020-08-03 at 9 45 58 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File Replace - "...same file" validation error msg not displayed correctly File Upload - allow files with same MD5 (or other checksum) in a dataset

7 participants