-
-
Notifications
You must be signed in to change notification settings - Fork 747
Closed
Description
We need an efficient way to serialize Pandas Dataframes. As of #606 we can now customize this beyond just pickle. There are a number of different data regimes here including pure numeric data, very compressible numeric data like time series, text data with repeats, categoricals, long text data etc.. We care both about fast encoding and about fast compression for larger results. We want something that has minimal overhead on small dataframes (important for shuffling).
Several options come to mind:
- Pickle + lz4: what we use now. Actually not that bad. Pickle deduplicates text data, handles numeric data without too much overhead, etc.. It's a shame that we can't compress per type.
- NumPy serialization + blosc on blocks + pickle or msgpack + lz4 on text: This is similar to what we do in partd. A custom approach can get significant speed boosts in some cases.
- Arrow: we could give arrow a shot. I suspect that it handles encoding intelligently. It would be nice if it were to provide different compressions per column but we could probably push a bit upstream if this went into use.
- Other thoughts?
It would be good to make this decision with a benchmark in hand. It would be good both to get people's opinions on solutions we should consider as well as some benchmarks that are representative of data that they care about.
cottrell, MatthieuBizien, den-run-ai and codeislife99
Metadata
Metadata
Assignees
Labels
No labels