Skip to content

DCM S3 #4703

@pameyer

Description

@pameyer

The DCM (data capture module, for big data upload) currently integrate with Datavese assumes POSIX storage (and ssh+rsync data transfer); however using object stores (AWS S3, OpenStack Swift, etc) is becoming more common.

Open technical design questions:

  • S3 (and swift?) are sometimes considered transfer protocols in addition to storage protocols. Should a S3 DCM support these as data transfer protocols, or only as storage?
  • DCM design assumes that having client side checksums calculated without direct user intervention is essential to "big data depositions", and that these checksums should be propagated from deposition to publication (and data file replication, etc). How do other disciplines view this trade-off (implementation complexity vs data integrity)?

Lowest-way complexity way of implementing this would be a second DCM implementation (or configuration option for current DCM), changing only how data files are transferred from the temporary upload location to the dataverse-accessable storage (aka - internal copy from temporary POSIX to DV S3/swift dataset buckets) and keeping existing ssh+rsync and client-side checksums. Moderately higher complexity would involve changes to the existing approach for client-side checksums and data transfers to support non-unix OS systems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions