[Python] Make self_destruct and RecordBatchReader work better together

When reading record batches via IPC, Arrow generally constructs each batch as a single allocation, with each column in the batch composed of slices of that allocation. This doesn't play well with `to_pandas(self_destruct=True)` as even though Arrow will release references to each column, those references were just to slices of a larger allocation, so no memory actually gets freed until the end of the conversion - defeating the point.

Reallocating the batches via pa.concat_arrays avoids this but requires a copy. Additionally, it's unclear that pa.concat_arrays is suitable for this purpose. It would be convenient if the record batch readers could (at least in some cases) provide suitably allocated batches (this may be hard, e.g. in Flight, the batches are ultimately based on memory allocated by gRPC). If that's not possible, then at least, we should either guarantee that concat_arrays truly returns a copy, or provide an explicit way to copy arrays.

This came up when trying to integrate self_destruct into PySpark (see SPARK-32953/https://github.com/apache/spark/pull/29818)

**Reporter**: [David Li](https://issues.apache.org/jira/browse/ARROW-10670) / @lidavidm
#### Related issues:
- [[C++][Python] Create utils for deep-copying an Array/ ArrayData](https://github.com/apache/arrow/issues/30503) (is related to)

<sub>**Note**: *This issue was originally created as [ARROW-10670](https://issues.apache.org/jira/browse/ARROW-10670). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Make self_destruct and RecordBatchReader work better together #26623

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Make self_destruct and RecordBatchReader work better together #26623

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions