-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the usage question you have. Please include as many useful details as possible.
platform:
NAME="Ubuntu" VERSION="23.04 (Lunar Lobster)"
pyarrow version:
pyarrow 14.0.1
pyarrow-hotfix 0.5
python version:
Python 3.11.4 (main, Jun 9 2023, 07:59:55) [GCC 12.3.0] on linux
I have a very large single column csv file (about 63 million rows). I was hoping to create a lazy file streamer that reads one entry from the csv file at a time. I know each entry in my file has a length of 12 chars, so I tried setting block size to 13 (+1 for \n) with the pyarrow.csv.open_csv function.
import pyarrow.csv as csv
c_options = csv.ConvertOptions(column_types={'dne': pa.float32()})
r_options = csv.ReadOptions(skip_rows_after_names=8200,use_threads=True, column_names=["dne"],block_size=13)
stream = csv.open_csv(file, convert_options = c_options, read_options = r_options )
this code functions properly as expected, but when i change the skip_rows_after_names param of read options to 8300 I start to get segmentation faults when in the open_csv function. How to fix this (or am I using it wrong)? I want to be able to use only a portion of at (like from row 98885 to 111200)
I was able to produce this error on another computer with the exact same platform and versions. The file was created with
with open(f"feature_{i}.csv", "w+") as f: for i in range(FILE_LEN): n = random.uniform(-0.5, 0.5) nn = str(n)[:12] f.write(f"{nn}\n")
Component(s)
Python