[Python] pyarrow.csv.read_csv hangs + eats all RAM

I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.

When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats all memory and gets killed.

More details on the conditions further. Script to run and all mentioned files are under attachments.

1) `sample_32769_cols.csv` is the dataset that suffers the problem.

2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.

The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.

I have created flame graph for the case (1) to support this issue resolution (`graph.svg`).

 

**Environment**: Ubuntu Xenial, python 2.7
**Reporter**: [Bogdan Klichuk](https://issues.apache.org/jira/browse/ARROW-5791)
**Assignee**: [Micah Kornfield](https://issues.apache.org/jira/browse/ARROW-5791) / @emkornfield
#### Original Issue Attachments:
- [csvtest.py](https://issues.apache.org/jira/secure/attachment/12973265/csvtest.py)
- [graph.svg](https://issues.apache.org/jira/secure/attachment/12973264/graph.svg)
- [sample_32768_cols.csv](https://issues.apache.org/jira/secure/attachment/12973263/sample_32768_cols.csv)
- [sample_32769_cols.csv](https://issues.apache.org/jira/secure/attachment/12973262/sample_32769_cols.csv)
#### PRs and other links:
- [GitHub Pull Request #4762](https://github.com/apache/arrow/pull/4762)

<sub>**Note**: *This issue was originally created as [ARROW-5791](https://issues.apache.org/jira/browse/ARROW-5791). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

Original Issue Attachments:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

Description

Original Issue Attachments:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions