Skip to content

[C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray #20081

@asfimport

Description

@asfimport
  1. When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however.

     
    {code:java}
    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq

    x = pa.array(list('1' * 2**30))

    demo = 'demo.parquet'

    def scenario():
    t = pa.Table.from_arrays([x], ['x'])
    writer = pq.ParquetWriter(demo, t.schema)
    for i in range(2):
    writer.write_table(t)
    writer.close()

    pf = pq.ParquetFile(demo)
    
  2. pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647
    t2 = pf.read()

  3. Works, but note, there are 32 row groups, not 2 as suggested by:

  4. https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
    tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
    t3 = pa.concat_tables(tables)

    scenario()
    {code}

Reporter: Left Screen
Assignee: Ben Kietzman / @bkietz

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-3762. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions