Skip to content

Storing very large image datasets on S3 #623

@AndreiBarsan

Description

@AndreiBarsan

Problem description

Hi Zarr Team!

We are interested in storing a large ML dataset of images on S3. The size will be over 30T, likely at least 50T. The dataset is mostly images, which have to be stored compressed (WebP) to save storage. We can't use a regular N x H x W dataset since that would make its size an order of magnitude bigger. The workloads will mostly be ML training, so images will need to be read randomly most of the time.

We are particularly interested in leveraging Zarr's ability to read parts of datasets from S3, which as I currently understand is non-trivial with other formats, such as hdf5.

As such, we end up with a ragged array since different images end up encoded as different byte counts.

I have a couple of questions about this set-up:

  1. Is it possible to index this dataset with pretty paths, like an hdf5 file would allow (basically one array per sample?), or is the only way of doing things through one giant ragged array, whose rows can't be accessed as paths?
  2. Does Zarr always store chunks as individual files (or S3 objects)? Or is a chunk just a conceptual element used when reading data? The reason I am asking is that for S3, we'd like users to be able to read individual samples of our dataset (ML training, so random access) without downloading unnecessary data, while at the same time we would like to avoid having millions of objects in S3 because of storage costs and performance reasons (S3 best practices discourage large numbers of small files). Is Zarr able to do partial file reads from chunks?

I am using the latest version of Zarr, v2.4.0.

Thank you, and please let me know if I can help by providing additional information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions