Skip to content

[CDV] Filenames in underlying storage should be human readable #4041

@jeremyfreudberg

Description

@jeremyfreudberg

#2909 (comment) affirmed that in Cloud Dataverse a filename in the underlying storage (Swift) would be a "filesystem name", which is unique, but also not human-readable.

The lack of a true rename operation in Swift, worries about uniqueness, and the fact that the Dataverse download API preserves meaningful filenames anyway meant that at the time we were satisfied with the solution of non-pretty names in Swift.

Two specific scenarios where pretty filenames are wanted/needed:

  • BigData compute task globbing across multiple files, e.g. input to a Spark job is swift://container/*.csv
  • Ben Lewis has pointed out the value of optimal/detailed naming in the GeoMesa context; we need an easy way to identify time chunks, so a meaningful file listing would be helpful

More generally, the relevant scenarios can be summarized as any time someone or some service uses the Swift API to download files. We are currently dreaming up more applications (besides Hadoop/Spark via Sahara) which would prefetch files from the Swift endpoint for the user play with using compute. In the current state of CDV, the user wouldn't be able to tell what's going on, since they would receive a whole bunch of random files (anything bundled with the dataset, not just raw data) with no way to tell what's what.

Worth noting that these concerns are really especially relevant for larger datasets -- direct access through the Swift API instead of the Dataverse API is crucial in that case.

This discussion also ties into a larger discussion about how dataset versioning is reflected on the Swift side of things.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions