[C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

1. When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however.
   
    
   {code:java}
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   
   x = pa.array(list('1' \* 2\*\*30))
   
   demo = 'demo.parquet'
   
   
   def scenario():
       t = pa.Table.from_arrays([x], ['x'])
       writer = pq.ParquetWriter(demo, t.schema)
       for i in range(2):
           writer.write_table(t)
       writer.close()
   
       pf = pq.ParquetFile(demo)
   
1. pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647
       t2 = pf.read()
   
1. Works, but note, there are 32 row groups, not 2 as suggested by:
1. https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
       tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
       t3 = pa.concat_tables(tables)
   
   scenario()
   {code}

**Reporter**: [Left Screen](https://issues.apache.org/jira/browse/ARROW-3762)
**Assignee**: [Ben Kietzman](https://issues.apache.org/jira/browse/ARROW-3762) / @bkietz
#### Related issues:
- [[Python] Error with errno 22 when loading 3.6 GB Parquet file](https://github.com/apache/arrow/issues/19048) (is duplicated by)
- [[Python] ArrowIOError: Arrow error: Capacity error during read](https://github.com/apache/arrow/issues/19488) (is duplicated by)
- [[Python] Table.from_pandas does not create chunked_arrays.](https://github.com/apache/arrow/issues/18190) (relates to)
- [[Python] read_row_group fails with Nested data conversions not implemented for chunked array outputs](https://github.com/apache/arrow/issues/21526) (is related to)
- [[C++] Add chunked builder classes](https://github.com/apache/arrow/issues/18843) (is related to)
#### PRs and other links:
- [GitHub Pull Request #3171](https://github.com/apache/arrow/pull/3171)
- [GitHub Pull Request #4695](https://github.com/apache/arrow/pull/4695)
- [GitHub Pull Request #5312](https://github.com/apache/arrow/pull/5312)
- [Apache Arrow Issue 1677](https://github.com/apache/arrow/issues/1677)

<sub>**Note**: *This issue was originally created as [ARROW-3762](https://issues.apache.org/jira/browse/ARROW-3762). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray #20081

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray #20081

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions