-
Notifications
You must be signed in to change notification settings - Fork 535
Description
The DCM (data capture module, for big data upload) currently integrate with Datavese assumes POSIX storage (and ssh+rsync data transfer); however using object stores (AWS S3, OpenStack Swift, etc) is becoming more common.
Open technical design questions:
- S3 (and swift?) are sometimes considered transfer protocols in addition to storage protocols. Should a S3 DCM support these as data transfer protocols, or only as storage?
- DCM design assumes that having client side checksums calculated without direct user intervention is essential to "big data depositions", and that these checksums should be propagated from deposition to publication (and data file replication, etc). How do other disciplines view this trade-off (implementation complexity vs data integrity)?
Lowest-way complexity way of implementing this would be a second DCM implementation (or configuration option for current DCM), changing only how data files are transferred from the temporary upload location to the dataverse-accessable storage (aka - internal copy from temporary POSIX to DV S3/swift dataset buckets) and keeping existing ssh+rsync and client-side checksums. Moderately higher complexity would involve changes to the existing approach for client-side checksums and data transfers to support non-unix OS systems.