ARROW-1784: [Python] Enable zero-copy serialization, deserialization of pandas.DataFrame via components #1390

wesm · 2017-12-05T02:52:50Z

This patch adds a serialization path for pandas.DataFrame (and Series) that decomposes the internal BlockManager into a dictionary structure that can be serialized to the zero-copy component representation from ARROW-1783, and then reconstructed similarly.

The impact of this is that when a DataFrame has no data that requires pickling, the reconstruction is zero-copy. I will post some benchmarks to illustrate the impact of this. The performance improvements are pretty remarkable, nearly 1000x speedup on a large DataFrame.

As some follow-up work, we will need to do more efficient serialization of the different pandas Index types. We should create a new JIRA for this

Change-Id: I7c2d71e10e8fb84c0606b62bbc537d5603b04766

Change-Id: I40d43b447d5336a2653c227cdbf6327121538ac0

wesm · 2017-12-05T02:56:06Z

Here's an example of a DataFrame that zero-copies. Serialization time goes from 200ms to 160 microseconds. Deserialization time from 56ms to about the same. This serialization code path is going to Arrow representation as an intermediary -- vanilla pickle is 126ms in, 60ms out.

wesm · 2017-12-05T02:58:18Z

Here's the same thing with a bunch of strings.

serialize with Arrow table as intermediary: 1.64s in, 1.44s out
serialize using pickle: 623ms in, 489ms out
serialize using component method: 554ms in, 408ms out

wesm · 2017-12-05T02:59:07Z

Most importantly for consumers like Dask, whenever there is an internal block where a copy can be avoided, it is avoided. This will avoid excess memory use on serialization (no additional copies) and extra memory use on receive (no copies)

mrocklin · 2017-12-05T03:03:17Z

Thank you for putting this together. I look forward to trying this out with Dask and seeing if it relieves the memory pressure we're seeing when sending dataframes. What does the current dev-build process look like? I think I read that you all had set up nightly builds on the twosigma channel?

The impact of this is that when a DataFrame has no data that requires pickling, the reconstruction is zero-copy. I will post some benchmarks to illustrate the impact of this. The performance improvements are pretty remarkable, nearly 1000x speedup on a large DataFrame.

This is to be expected, right?

serialize with Arrow table as intermediary: 1.64s in, 1.44s out
serialize using pickle: 623ms in, 489ms out
serialize using component method: 554ms in, 408ms out

That's surprisingly nice. Do you have a sense for what is going on here? 100ms in copying memory?

wesm · 2017-12-05T03:02:42Z

python/pyarrow/pandas_compat.py


-def make_datetimetz(tz):
+
+def dataframe_to_serialized_dict(frame):


@jreback let me know if I missed anything on these functions

wesm · 2017-12-05T03:11:38Z

I think I read that you all had set up nightly builds on the twosigma channel?

yes, as soon as this is merged, it should show up in the next nightly https://anaconda.org/twosigma/pyarrow/files. Though we are having a small problem with the version numbers in the nightlies (https://issues.apache.org/jira/browse/ARROW-1881) that needs to get fixed in the next day or two (cc @xhochy)

This is to be expected, right?

Yes, it's a nice confirmation that pandas definitely is not making any unexpected memory copies (it can be quite zealous about copying stuff)

That's surprisingly nice. Do you have a sense for what is going on here? 100ms in copying memory?

Yes, I think this is strictly from copying the internal numeric ndarrays. The memory use vs. pickle will also be less by whatever the total pickled footprint of those numeric arrays that are being copied

pitrou · 2017-12-05T10:54:06Z

Shouldn't dataframe_to_serialized_dict and serialized_dict_to_dataframe actually be exposed by Pandas? They seem generally useful (and touch internal details of dataframes).

jreback

lgtm. might want to test round-trip of Period and Intervals as well; they are serialized as object currently (the Index types are an extension dtype though).

In [62]: pd.DataFrame({'period': pd.period_range('2013', periods=3, freq='M'), 'interval': pd.interval_range(1, 4)})
Out[62]: 
  interval  period
0   (1, 2] 2013-01
1   (2, 3] 2013-02
2   (3, 4] 2013-03

jreback · 2017-12-05T03:26:27Z

python/pyarrow/pandas_compat.py

+    block_arr = item['block']
+    placement = item['placement']
+    if 'dictionary' in item:
+        cat = pd.Categorical(block_arr,


should be .from_codes as going to deprecate fastpath= soon

jreback · 2017-12-05T11:40:34Z

@pitrou the internal conversion functions could / should be exposed in pandas
but should also live here until pyarrow drops support for < 0.22 (a while)

…alization Change-Id: Idcc0172f2f0c5189f64ed28fc535e67d3d71009e

wesm · 2017-12-05T21:42:34Z

Done, and added docs. Will merge once the build passes

Change-Id: I6e173a39d4c508382c383164ecf0cebabfcc6059

wesm · 2017-12-06T17:05:07Z

Seems there is some problem with the manylinux1 build, will dig in

…pinned at 0.20.1 Change-Id: I55740c93b729b2f800834107cfe7b09c152d23a2

wesm · 2017-12-06T17:48:46Z

@jreback seems there is some pickling issue with IntervalIndex in pandas 0.20.x, I was that something changed or fixed in 0.21? See 21adbe7

wesm added 2 commits December 4, 2017 20:34

Begin refactoring

6b01746

Change-Id: I7c2d71e10e8fb84c0606b62bbc537d5603b04766

Complete component-based serializer for pandas.DataFrame

1ac073c

Change-Id: I40d43b447d5336a2653c227cdbf6327121538ac0

wesm commented Dec 5, 2017

View reviewed changes

jreback approved these changes Dec 5, 2017

View reviewed changes

Code comment, add more serialization docs for pandas / component seri…

4b4c776

…alization Change-Id: Idcc0172f2f0c5189f64ed28fc535e67d3d71009e

Add pandas serialization test for periods, intervals

939c02b

Change-Id: I6e173a39d4c508382c383164ecf0cebabfcc6059

wesm force-pushed the ARROW-1784 branch from 92bc928 to 939c02b Compare December 6, 2017 03:00

Do not test with IntervalIndex in pandas < 0.21, since manylinux1 is …

21adbe7

…pinned at 0.20.1 Change-Id: I55740c93b729b2f800834107cfe7b09c152d23a2

wesm closed this in 712b9d2 Dec 6, 2017

wesm deleted the ARROW-1784 branch December 6, 2017 19:10

asfimport mentioned this pull request Dec 6, 2017

[Python] Read and write pandas.DataFrame in pyarrow.serialize by decomposing the BlockManager rather than coercing to Arrow format #17782

Closed


		def make_datetimetz(tz):

		def dataframe_to_serialized_dict(frame):

ARROW-1784: [Python] Enable zero-copy serialization, deserialization of pandas.DataFrame via components #1390

ARROW-1784: [Python] Enable zero-copy serialization, deserialization of pandas.DataFrame via components #1390

Uh oh!

Conversation

wesm commented Dec 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Dec 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Dec 5, 2017

Uh oh!

wesm commented Dec 5, 2017

Uh oh!

mrocklin commented Dec 5, 2017

Uh oh!

wesm Dec 5, 2017

Choose a reason for hiding this comment

Uh oh!

wesm commented Dec 5, 2017

Uh oh!

pitrou commented Dec 5, 2017

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Dec 5, 2017

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 5, 2017

Uh oh!

wesm commented Dec 5, 2017

Uh oh!

wesm commented Dec 6, 2017

Uh oh!

wesm commented Dec 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wesm commented Dec 5, 2017 •

edited

Loading

wesm commented Dec 5, 2017 •

edited

Loading