[CDV] Filenames in underlying storage should be human readable

https://github.com/IQSS/dataverse/issues/2909#issuecomment-298446655 affirmed that in Cloud Dataverse a filename in the underlying storage (Swift) would be a "filesystem name", which is unique, but also not human-readable.

The lack of a true rename operation in Swift, worries about uniqueness, and the fact that the Dataverse download API preserves meaningful filenames anyway meant that at the time we were satisfied with the solution of non-pretty names in Swift.

Two specific scenarios where pretty filenames are wanted/needed:

- BigData compute task globbing across multiple files, e.g. input to a Spark job is swift://container/*.csv
- Ben Lewis has pointed out the value of optimal/detailed naming in the GeoMesa context; we need an easy way to identify time chunks, so a meaningful file listing would be helpful

More generally, the relevant scenarios can be summarized as _any time someone or some service uses the Swift API to download files_. We are currently dreaming up more applications (besides Hadoop/Spark via Sahara) which would prefetch files from the Swift endpoint for the user play with using compute. In the current state of CDV, the user wouldn't be able to tell what's going on, since they would receive a whole bunch of random files (anything bundled with the dataset, not just raw data) with no way to tell what's what.

Worth noting that these concerns are really especially relevant for larger datasets -- direct access through the Swift API instead of the Dataverse API is crucial in that case.

This discussion also ties into a larger discussion about how dataset versioning is reflected on the Swift side of things.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CDV] Filenames in underlying storage should be human readable #4041

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CDV] Filenames in underlying storage should be human readable #4041

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions