Skip to content

[Python] Pretty printing very large ChunkedArray objects can use unbounded memory #20692

@asfimport

Description

@asfimport

In working on ARROW-2970, I have the following dataset:

values = [b'x'] + [
    b'x' * (1 << 20)
] * 2 * (1 << 10)

arr = np.array(values)

arrow_arr = pa.array(arr)

The object arrow_arr has 129 chunks, each element of which is 1MB of binary. The repr for this object is over 600MB:

In [10]: rep = repr(arrow_arr)

In [11]: len(rep)
Out[11]: 637536258

There's probably a number of failsafes we can implement to avoid badness in these pathological cases (which may not happen often, but given the kinds of bug reports we are seeing, people do have datasets that look like this)

Reporter: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-4099. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions