[Python] ArrowIOError: Arrow error: Capacity error during read

My assumption: the problem is caused by a large object column containing strings up to 27 characters long. (so that column is much larger than 2GB of strings, chunking issue)

looks similar as  https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 

Code
- basket_plateau= pq.read_table("basket_plateau.parquet")
- basket_plateau = pd.read_parquet("basket_plateau.parquet")
  
  Error produced
- ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655
  
  Dataset
- Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
- 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
- aprox 90GB in memory
- example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think food retail categories)
  
  History to bug:
- was using older version of pyarrow
- tried writing dataset to disk (parquet) and failed
- stumbled on https://issues.apache.org/jira/browse/ARROW-2227
- upgraded to 0.10
- tried writing dataset to disk (parquet) and succeeded
- tried reading dataset and failed
- looks like a similar case as: https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 

**Environment**: pandas=0.23.1=py36h637b7d7_0
pyarrow==0.10.0
**Reporter**: [Frédérique Vanneste](https://issues.apache.org/jira/browse/ARROW-3139)
#### Related issues:
- [[C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray](https://github.com/apache/arrow/issues/20081) (duplicates)
#### Externally tracked issue: [https://github.com/apache/arrow/issues/2485](https://github.com/apache/arrow/issues/2485)

<sub>**Note**: *This issue was originally created as [ARROW-3139](https://issues.apache.org/jira/browse/ARROW-3139). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] ArrowIOError: Arrow error: Capacity error during read #19488

Related issues:

Externally tracked issue: #2485

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] ArrowIOError: Arrow error: Capacity error during read #19488

Description

Related issues:

Externally tracked issue: #2485

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions