Skip to content

Try out the parquet format for some tables #449

@Zaharid

Description

@Zaharid

We currently write all the things marked with @table to tab delimited csv files. This has a number of problems:

  • The format does not contain enogh information to restore the table by itself. Instead we need to write inconvenient and buggy code in tableloader.py.
  • Parsing csv is very slow compared to sensible binary formats.
  • The files are huge compared to sensibly compressed binary formats.
  • There is no built in functionality to store or load metadata, so we cannot easily implement sensible checks without loading the whole file.

These thing make it very inconvenient to work with theory covariance matrices for example.

CSV does have the advantage that you can open it with any text editor or spreadsheet software, but that is not such a frequent use case.

We should instead start depending on binary formats. The one that seems the best for our needs, because it is supported directly by pandas, as well as a number of other things, and has answers to all of the problems above is parquet. It is probably what we are going to use for fktables as well (see #404).

At minimum we would need something similar to reportengine.table but writing parquet files. How that should interact with the existing table decorator is unclear to me. One option would be to just write parquet always, but that would break a number of things (such as the as analysis code) unless some compatibility layer was added. In any case I do think that the covariance matrix files should be smaller and faster (and smarter).

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions