Skip to content

segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() #38878

@jiale0402

Description

@jiale0402

Describe the usage question you have. Please include as many useful details as possible.

platform:
NAME="Ubuntu" VERSION="23.04 (Lunar Lobster)"
pyarrow version:
pyarrow 14.0.1
pyarrow-hotfix 0.5
python version:
Python 3.11.4 (main, Jun 9 2023, 07:59:55) [GCC 12.3.0] on linux

I have a very large single column csv file (about 63 million rows). I was hoping to create a lazy file streamer that reads one entry from the csv file at a time. I know each entry in my file has a length of 12 chars, so I tried setting block size to 13 (+1 for \n) with the pyarrow.csv.open_csv function.
import pyarrow.csv as csv
c_options = csv.ConvertOptions(column_types={'dne': pa.float32()})
r_options = csv.ReadOptions(skip_rows_after_names=8200,use_threads=True, column_names=["dne"],block_size=13)
stream = csv.open_csv(file, convert_options = c_options, read_options = r_options )
this code functions properly as expected, but when i change the skip_rows_after_names param of read options to 8300 I start to get segmentation faults when in the open_csv function. How to fix this (or am I using it wrong)? I want to be able to use only a portion of at (like from row 98885 to 111200)

I was able to produce this error on another computer with the exact same platform and versions. The file was created with
with open(f"feature_{i}.csv", "w+") as f: for i in range(FILE_LEN): n = random.uniform(-0.5, 0.5) nn = str(n)[:12] f.write(f"{nn}\n")

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Labels

    Component: PythonStatus: stale-warningIssues and PRs flagged as stale which are due to be closed if no indication otherwiseType: usageIssue is a user question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions