-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
My assumption: the problem is caused by a large object column containing strings up to 27 characters long. (so that column is much larger than 2GB of strings, chunking issue)
Code
-
basket_plateau= pq.read_table("basket_plateau.parquet")
-
basket_plateau = pd.read_parquet("basket_plateau.parquet")
Error produced
-
ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655
Dataset
-
Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
-
2.7 billion record, 4 columns ( int64/object/datetime64/float64)
-
aprox 90GB in memory
-
example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think food retail categories)
History to bug:
-
was using older version of pyarrow
-
tried writing dataset to disk (parquet) and failed
-
stumbled on https://issues.apache.org/jira/browse/ARROW-2227
-
upgraded to 0.10
-
tried writing dataset to disk (parquet) and succeeded
-
tried reading dataset and failed
-
looks like a similar case as: https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
Environment: pandas=0.23.1=py36h637b7d7_0
pyarrow==0.10.0
Reporter: Frédérique Vanneste
Related issues:
Externally tracked issue: #2485
Note: This issue was originally created as ARROW-3139. Please see the migration documentation for further details.