In the eko presentation arose again the topic of the output format, already faced in #60.
The request was to have a more standard format, and at the same time to split the metadata from the actual data (@Zaharid).
We would like to accomplish the first one (the choice of yaml was to have a broadly supported format), and we don't dislike the second.
Nevertheless, when you combine this requests with our strict requirement, i.e.:
- we want to store a multidimensional array (rank 4 or rank 5)
- we want to store it in a as-minimal-as-possible way
it ends up in a particularly restrictive range of options.
The proposal was to use some broadly supported format like Apache Parquet, very common in the big data community.
These and the other database inherited formats are not suitable for our task, since they are optimized for tabular data, and so intrinsically two dimensional (even more, a few of the key points of Parquet are being appendable, readable in chunks, and columnar, and we don't get benefits from any of them).
The formats for multidimensional data available, broadly supported by the community (especially in science) are:
- NetCDF, that is a general format but has an especially good library in python for managing the in-memory counterpart (i.e. xarray, closely connected to numpy and inspired by pandas)
- HDF5, on which the former one is based, with its own python API
The first one is more specialized and preferable in general, but we don't need it as well, because it support so many features, while our goal is just to store a bare array of floats.
That's why our proposal is just to use the .npy format, coming from numpy library, and to zip it ourselves (using lz4 as it is done for pineapplgrids), who has a very simple API in python (i.e. numpy.save function, and the partner numpy.load).
It exists also an implementation of an API in C++ (or better a couple of), consisting in a very small codebase.
Many languages can interface directly with python (like Julia), and some support explicitly numpy with their own libraries (like ndarray), so we would support the numpy solution, since it is going to be a very flexible one, and at the same time the minimal thing required.
In the eko presentation arose again the topic of the output format, already faced in #60.
The request was to have a more standard format, and at the same time to split the metadata from the actual data (@Zaharid).
We would like to accomplish the first one (the choice of yaml was to have a broadly supported format), and we don't dislike the second.
Nevertheless, when you combine this requests with our strict requirement, i.e.:
it ends up in a particularly restrictive range of options.
The proposal was to use some broadly supported format like Apache Parquet, very common in the big data community.
These and the other database inherited formats are not suitable for our task, since they are optimized for tabular data, and so intrinsically two dimensional (even more, a few of the key points of Parquet are being appendable, readable in chunks, and columnar, and we don't get benefits from any of them).
The formats for multidimensional data available, broadly supported by the community (especially in science) are:
The first one is more specialized and preferable in general, but we don't need it as well, because it support so many features, while our goal is just to store a bare array of floats.
That's why our proposal is just to use the
.npyformat, coming from numpy library, and to zip it ourselves (using lz4 as it is done for pineapplgrids), who has a very simple API in python (i.e.numpy.savefunction, and the partnernumpy.load).It exists also an implementation of an API in C++ (or better a couple of), consisting in a very small codebase.
Many languages can interface directly with python (like Julia), and some support explicitly numpy with their own libraries (like
ndarray), so we would support the numpy solution, since it is going to be a very flexible one, and at the same time the minimal thing required.