-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
If a large array is sliced, and pickled, it seems the full buffer is serialized, this leads to excessive memory usage and data transfer when using multiprocessing or dask.
>>> import pyarrow as pa
>>> ar = pa.array(['foo'] * 100_000)
>>> ar.nbytes
700004
>>> import pickle
>>> len(pickle.dumps(ar.slice(10, 1)))
700165
NumPy for instance
>>> import numpy as np
>>> ar_np = np.array(ar)
>>> ar_np
array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
>>> import pickle
>>> len(pickle.dumps(ar_np[10:11]))
165I think this makes sense if you know arrow, but kind of unexpected as a user.
Is there a workaround for this? For instance copy an arrow array to get rid of the offset, and trim the buffers?
Reporter: Maarten Breddels / @maartenbreddels
Assignee: Clark Zinzow
Related issues:
Note: This issue was originally created as ARROW-10739. Please see the migration documentation for further details.
RocketRider