Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Dec 5, 2017

This patch adds a serialization path for pandas.DataFrame (and Series) that decomposes the internal BlockManager into a dictionary structure that can be serialized to the zero-copy component representation from ARROW-1783, and then reconstructed similarly.

The impact of this is that when a DataFrame has no data that requires pickling, the reconstruction is zero-copy. I will post some benchmarks to illustrate the impact of this. The performance improvements are pretty remarkable, nearly 1000x speedup on a large DataFrame.

As some follow-up work, we will need to do more efficient serialization of the different pandas Index types. We should create a new JIRA for this

wesm added 2 commits December 4, 2017 20:34
Change-Id: I7c2d71e10e8fb84c0606b62bbc537d5603b04766
Change-Id: I40d43b447d5336a2653c227cdbf6327121538ac0
@wesm
Copy link
Member Author

wesm commented Dec 5, 2017

Here's an example of a DataFrame that zero-copies. Serialization time goes from 200ms to 160 microseconds. Deserialization time from 56ms to about the same. This serialization code path is going to Arrow representation as an intermediary -- vanilla pickle is 126ms in, 60ms out.

serialize_no_strings

@wesm
Copy link
Member Author

wesm commented Dec 5, 2017

Here's the same thing with a bunch of strings.

  • serialize with Arrow table as intermediary: 1.64s in, 1.44s out
  • serialize using pickle: 623ms in, 489ms out
  • serialize using component method: 554ms in, 408ms out

serialize_with_strings

@wesm
Copy link
Member Author

wesm commented Dec 5, 2017

Most importantly for consumers like Dask, whenever there is an internal block where a copy can be avoided, it is avoided. This will avoid excess memory use on serialization (no additional copies) and extra memory use on receive (no copies)

@mrocklin
Copy link

mrocklin commented Dec 5, 2017

Thank you for putting this together. I look forward to trying this out with Dask and seeing if it relieves the memory pressure we're seeing when sending dataframes. What does the current dev-build process look like? I think I read that you all had set up nightly builds on the twosigma channel?

The impact of this is that when a DataFrame has no data that requires pickling, the reconstruction is zero-copy. I will post some benchmarks to illustrate the impact of this. The performance improvements are pretty remarkable, nearly 1000x speedup on a large DataFrame.

This is to be expected, right?

serialize with Arrow table as intermediary: 1.64s in, 1.44s out
serialize using pickle: 623ms in, 489ms out
serialize using component method: 554ms in, 408ms out

That's surprisingly nice. Do you have a sense for what is going on here? 100ms in copying memory?


def make_datetimetz(tz):

def dataframe_to_serialized_dict(frame):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback let me know if I missed anything on these functions

@wesm
Copy link
Member Author

wesm commented Dec 5, 2017

I think I read that you all had set up nightly builds on the twosigma channel?

yes, as soon as this is merged, it should show up in the next nightly https://anaconda.org/twosigma/pyarrow/files. Though we are having a small problem with the version numbers in the nightlies (https://issues.apache.org/jira/browse/ARROW-1881) that needs to get fixed in the next day or two (cc @xhochy)

This is to be expected, right?

Yes, it's a nice confirmation that pandas definitely is not making any unexpected memory copies (it can be quite zealous about copying stuff)

That's surprisingly nice. Do you have a sense for what is going on here? 100ms in copying memory?

Yes, I think this is strictly from copying the internal numeric ndarrays. The memory use vs. pickle will also be less by whatever the total pickled footprint of those numeric arrays that are being copied

@pitrou
Copy link
Member

pitrou commented Dec 5, 2017

Shouldn't dataframe_to_serialized_dict and serialized_dict_to_dataframe actually be exposed by Pandas? They seem generally useful (and touch internal details of dataframes).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. might want to test round-trip of Period and Intervals as well; they are serialized as object currently (the Index types are an extension dtype though).

In [62]: pd.DataFrame({'period': pd.period_range('2013', periods=3, freq='M'), 'interval': pd.interval_range(1, 4)})
Out[62]: 
  interval  period
0   (1, 2] 2013-01
1   (2, 3] 2013-02
2   (3, 4] 2013-03

block_arr = item['block']
placement = item['placement']
if 'dictionary' in item:
cat = pd.Categorical(block_arr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be .from_codes as going to deprecate fastpath= soon

@jreback
Copy link
Contributor

jreback commented Dec 5, 2017

@pitrou the internal conversion functions could / should be exposed in pandas
but should also live here until pyarrow drops support for < 0.22 (a while)

…alization

Change-Id: Idcc0172f2f0c5189f64ed28fc535e67d3d71009e
@wesm
Copy link
Member Author

wesm commented Dec 5, 2017

Done, and added docs. Will merge once the build passes

Change-Id: I6e173a39d4c508382c383164ecf0cebabfcc6059
@wesm
Copy link
Member Author

wesm commented Dec 6, 2017

Seems there is some problem with the manylinux1 build, will dig in

…pinned at 0.20.1

Change-Id: I55740c93b729b2f800834107cfe7b09c152d23a2
@wesm
Copy link
Member Author

wesm commented Dec 6, 2017

@jreback seems there is some pickling issue with IntervalIndex in pandas 0.20.x, I was that something changed or fixed in 0.21? See 21adbe7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants