Skip to content

[Python] Pickling a sliced array serializes all the buffers #26685

@asfimport

Description

@asfimport

If a large array is sliced, and pickled, it seems the full buffer is serialized, this leads to excessive memory usage and data transfer when using multiprocessing or dask.

>>> import pyarrow as pa
>>> ar = pa.array(['foo'] * 100_000)
>>> ar.nbytes
700004
>>> import pickle
>>> len(pickle.dumps(ar.slice(10, 1)))
700165

NumPy for instance
>>> import numpy as np
>>> ar_np = np.array(ar)
>>> ar_np
array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
>>> import pickle
>>> len(pickle.dumps(ar_np[10:11]))
165

I think this makes sense if you know arrow, but kind of unexpected as a user.

Is there a workaround for this? For instance copy an arrow array to get rid of the offset, and trim the buffers?

Reporter: Maarten Breddels / @maartenbreddels
Assignee: Clark Zinzow

Related issues:

Note: This issue was originally created as ARROW-10739. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions