Efficient Pandas serialization

We need an efficient way to serialize Pandas Dataframes.  As of #606 we can now customize this beyond just pickle.  There are a number of different data regimes here including pure numeric data, very compressible numeric data like time series, text data with repeats, categoricals, long text data etc..  We care both about fast encoding and about fast compression for larger results.  We want something that has minimal overhead on small dataframes (important for shuffling).  

Several options come to mind:

1.  Pickle + lz4: what we use now.  Actually not that bad.  Pickle deduplicates text data, handles numeric data without too much overhead, etc..  It's a shame that we can't compress per type.
2.  NumPy serialization + blosc on blocks + pickle or msgpack + lz4 on text: This is similar to what we do in partd.  A custom approach can get significant speed boosts in some cases.
3.  Arrow: we could give arrow a shot.  I suspect that it handles encoding intelligently.  It would be nice if it were to provide different compressions per column but we could probably push a bit upstream if this went into use.
4.  Other thoughts?

@jreback @wesm @shoyer 

It would be good to make this decision with a benchmark in hand.  It would be good both to get people's opinions on solutions we should consider as well as some benchmarks that are representative of data that they care about.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficient Pandas serialization #614

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Efficient Pandas serialization #614

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions