Output format revamped

In the eko presentation arose again the topic of the output format, already faced in #60.

The request was to have a more standard format, and at the same time to split the metadata from the actual data (@zaharid).

We would like to accomplish the first one (the choice of yaml was to have a broadly supported format), and we don't dislike the second.
Nevertheless, when you combine this requests with our strict requirement, i.e.:
- we want to store a multidimensional array (rank 4 or rank 5)
- we want to store it in a as-minimal-as-possible way
it ends up in a particularly restrictive range of options.

The proposal was to use some broadly supported format like [Apache Parquet](https://parquet.apache.org/), very common in the big data community.
These and the other database inherited formats are not suitable for our task, since they are optimized for tabular data, and so intrinsically two dimensional (even more, a few of the key points of Parquet are being appendable, readable in chunks, and columnar, and we don't get benefits from any of them).

The formats for multidimensional data available, broadly supported by the community (especially in science) are:
- [NetCDF](https://www.unidata.ucar.edu/software/netcdf/), that is a general format but has an especially good library in python for managing the in-memory counterpart (i.e. [xarray](http://xarray.pydata.org/en/stable/), closely connected to numpy and inspired by pandas)
- [HDF5](https://www.hdfgroup.org/), on which the former one is based, with its own [python API](http://www.h5py.org/)

The first one is more specialized and preferable in general, but we don't need it as well, because it support so many features, while our goal is just to store a bare array of floats.

That's why our proposal is just to use the [`.npy` format](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html#module-numpy.lib.format), coming from numpy library, and to zip it ourselves (using lz4 as it is done for pineapplgrids), who has a very simple API in python (i.e. [`numpy.save`](https://numpy.org/doc/stable/reference/generated/numpy.save.html#numpy.save) function, and the partner [`numpy.load`](https://numpy.org/doc/stable/reference/generated/numpy.load.html#numpy.load)).

It exists also an implementation of an [API in C++](https://github.com/llohse/libnpy) (or better a [couple of](https://github.com/rogersce/cnpy)), consisting in a very small codebase.
Many languages can interface directly with python (like Julia), and some support explicitly numpy with their own libraries (like [`ndarray`](https://crates.io/crates/ndarray)), so we would support the numpy solution, since it is going to be a very flexible one, and at the same time the minimal thing required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output format revamped #76

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Output format revamped #76

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions