-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
-
When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however.
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pqx = pa.array(list('1' * 2**30))
demo = 'demo.parquet'
def scenario():
t = pa.Table.from_arrays([x], ['x'])
writer = pq.ParquetWriter(demo, t.schema)
for i in range(2):
writer.write_table(t)
writer.close()pf = pq.ParquetFile(demo) -
pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647
t2 = pf.read() -
Works, but note, there are 32 row groups, not 2 as suggested by:
-
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
t3 = pa.concat_tables(tables)scenario()
{code}
Reporter: Left Screen
Assignee: Ben Kietzman / @bkietz
Related issues:
- [Python] Error with errno 22 when loading 3.6 GB Parquet file (is duplicated by)
- [Python] ArrowIOError: Arrow error: Capacity error during read (is duplicated by)
- [Python] Table.from_pandas does not create chunked_arrays. (relates to)
- [Python] read_row_group fails with Nested data conversions not implemented for chunked array outputs (is related to)
- [C++] Add chunked builder classes (is related to)
PRs and other links:
- GitHub Pull Request #3171
- GitHub Pull Request #4695
- GitHub Pull Request #5312
- Apache Arrow Issue 1677
Note: This issue was originally created as ARROW-3762. Please see the migration documentation for further details.