-
Notifications
You must be signed in to change notification settings - Fork 535
Description
What steps does it take to reproduce the issue?
Using the DVUploader, upload a large file (> 10GB) to Dataverse (local file-system, not S3).
-
When does this issue occur?
Every time I upload a file to local storage. -
Which page(s) does it occurs on?
I'm exclusively using DVUploader. -
What happens?
On our DV host, the file first lands in/tmp. It is then copied to/usr/local/dv-temp/tempat which point it is unzipped to this same directory (we double-zip). At this point there are (3) copies of the file on local storage. Next I see thatiostatshows a long running ~30MB/s read operation. I believe this corresponds with DV calculating an MD5 checksum. Running themd5sumLinux utility on the same file proceeds at 500MB/s -- so there is decent available performance on our DV host. Looking at the code (https://github.com/IQSS/dataverse/blob/1435dcca970ee524ec32506f1d8d50c81026fe86/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java), it appears that the checksum is calculated 1KB at a time, without using buffered IO -- which could explain the suboptimal performance that I think I am seeing. -
To whom does it occur (all users, curators, superusers)?
Anyone performing an upload. -
What did you expect to happen?
I would expect the md5 checksum calculation performance to be similar to the performance achieved by themd5sumlinux utility.
Which version of Dataverse are you using?
5.10.1
Any related open or closed issues to this bug report?
I don't believe so.
Screenshots:
No matter the issue, screenshots are always welcome.
To add a screenshot, please use one of the following formats and/or methods described here: