-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Environment
- pyarrow version: 0.7.1 - 0.8.0
- Python version: 3.6
- Operating System: Windows 8-10
Details
-
Describe what you were trying to get done.
Save large parquet file containing categorical column
-
What commands did you run to trigger this issue? If you can provide a
Minimal, Complete, and Verifiable example
this will help us understand the issue.
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def test_bytes_exceed_2gb_write():
"""Based on test from #867. Shows OK behavior for `str` column."""
val = 'x' * (1 << 20)
df = pd.DataFrame({
'strings': np.array([val] * 4000, dtype=object)
})
table = pa.Table.from_pandas(df)
assert table[0].data.num_chunks == 2
pq.write_table(table, 'foo.parquet')
def test_categorical_exceed_2gb_write():
"""Based on test from #867. Shows Crash for `categorical` column."""
val = 'x' * (1 << 20)
df = pd.DataFrame({
'strings': np.array([val] * 4000, dtype=object)
})
df['strings'] = df['strings'].astype('category')
table = pa.Table.from_pandas(df)
assert table[0].data.num_chunks == 1 # Only 1 block is being generated
pq.write_table(table, 'foo.parquet') # Fails here
test_bytes_exceed_2gb_write()
test_categorical_exceed_2gb_write()- If there was a crash, please include the traceback here.
Traceback (most recent call last):
File "C:\Users\some_user\Desktop\bug_categorical.py", line 39, in <module>
test_categorical_exceed_2gb_write()
File "C:\Users\some_user\Desktop\bug_categorical.py", line 35, in test_categorical_exceed_2gb_write
pq.write_table(table, 'foo.parquet') # Fails here
File "C:\Anaconda\lib\site-packages\pyarrow\parquet.py", line 944, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "C:\Anaconda\lib\site-packages\pyarrow\parquet.py", line 297, in write_table
self.writer.write_table(table, row_group_size=row_group_size)
File "_parquet.pyx", line 930, in pyarrow._parquet.ParquetWriter.write_table
File "error.pxi", line 77, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483648Similar to https://issues.apache.org/jira/browse/ARROW-1167 and also #1673
kylebarron, mpearmain and suissemaxx
Metadata
Metadata
Assignees
Labels
No labels