Consolidated metadata was a new feature in Zarr v2.3, which was released over two year ago (March 22, 2019).
Since then, I have used consolidated=True every time I've written or opened a Zarr store. As far as I can tell, this is almost always a good idea:
- With local storage, it usually doesn't really matter. You spend a bit of time writing the consolidated metadata and have one extra file on disk, but the overhead is typically negligible.
- With Cloud object stores or network filesystems, it can matter quite a large amount. Without consolidated metadata, these systems can be unusably slow for opening datasets. Cloud storage is of course the main use-case for Zarr. If you're using a local disk, you might as well stick with single files such as netCDF.
I wonder if consolidated metadata is mature enough now that we could consider switching the default behavior in Xarray. From my perspective, this is a big "gotcha" for getting good performance with Zarr. More than one of my colleagues has been unimpressed with the performance of Zarr until they learned to set consolidated=True.
I would suggest doing this in way is almost entirely backwards compatible, with only a minor performance costs for reading non-consolidated datasets:
to_zarr() switches the default to consolidated=True. The consolidate_metadata() will thus happen by default.
open_zarr() switches the default to consolidated=None, which means "Try reading consolidated metadata, and fall-back to non-consolidated if that fails." This will be slightly slower for non-consolidated metadata due to the extra file-lookup, but given that opening with non-consolidated metadata already requires a moderately large number of file look-ups, I doubt anyone will notice the difference.
CC @rabernat
Consolidated metadata was a new feature in Zarr v2.3, which was released over two year ago (March 22, 2019).
Since then, I have used
consolidated=Trueevery time I've written or opened a Zarr store. As far as I can tell, this is almost always a good idea:I wonder if consolidated metadata is mature enough now that we could consider switching the default behavior in Xarray. From my perspective, this is a big "gotcha" for getting good performance with Zarr. More than one of my colleagues has been unimpressed with the performance of Zarr until they learned to set
consolidated=True.I would suggest doing this in way is almost entirely backwards compatible, with only a minor performance costs for reading non-consolidated datasets:
to_zarr()switches the default toconsolidated=True. Theconsolidate_metadata()will thus happen by default.open_zarr()switches the default toconsolidated=None, which means "Try reading consolidated metadata, and fall-back to non-consolidated if that fails." This will be slightly slower for non-consolidated metadata due to the extra file-lookup, but given that opening with non-consolidated metadata already requires a moderately large number of file look-ups, I doubt anyone will notice the difference.CC @rabernat