Skip to content

Checksum (MD5) calculation for local upload is slow #9166

@jgara

Description

@jgara

What steps does it take to reproduce the issue?
Using the DVUploader, upload a large file (> 10GB) to Dataverse (local file-system, not S3).

  • When does this issue occur?
    Every time I upload a file to local storage.

  • Which page(s) does it occurs on?
    I'm exclusively using DVUploader.

  • What happens?
    On our DV host, the file first lands in /tmp. It is then copied to /usr/local/dv-temp/temp at which point it is unzipped to this same directory (we double-zip). At this point there are (3) copies of the file on local storage. Next I see that iostat shows a long running ~30MB/s read operation. I believe this corresponds with DV calculating an MD5 checksum. Running the md5sum Linux utility on the same file proceeds at 500MB/s -- so there is decent available performance on our DV host. Looking at the code (https://github.com/IQSS/dataverse/blob/1435dcca970ee524ec32506f1d8d50c81026fe86/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java), it appears that the checksum is calculated 1KB at a time, without using buffered IO -- which could explain the suboptimal performance that I think I am seeing.

  • To whom does it occur (all users, curators, superusers)?
    Anyone performing an upload.

  • What did you expect to happen?
    I would expect the md5 checksum calculation performance to be similar to the performance achieved by the md5sum linux utility.

Which version of Dataverse are you using?
5.10.1

Any related open or closed issues to this bug report?
I don't believe so.

Screenshots:

No matter the issue, screenshots are always welcome.

To add a screenshot, please use one of the following formats and/or methods described here:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions