-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
When saving a pandas dataframe to parquet, if there is a categorical column where the categories are boolean, the column is saved as regular boolean.
This causes an issue because, when reading back the parquet file, I expect the column to still be categorical.
Reproducible example:
import pandas as pd
import pyarrow
# Create dataframe with boolean column that is then converted to categorical
df = pd.DataFrame({'a': [True, True, False, True, False]})
df['a'] = df['a'].astype('category')
# Convert to arrow Table and save to disk
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table, 'test.parquet')
# Reload data and convert back to pandas
table_rel = pyarrow.parquet.read_table('test.parquet')
df_rel = table_rel.to_pandas()The arrow table variable correctly converts the column to an arrow DICTIONARY type:
>>> df['a']
0 True
1 True
2 False
3 True
4 False
Name: a, dtype: category
Categories (2, object): [False, True]
>>>
>>> table
pyarrow.Table
a: dictionary<values=bool, indices=int8, ordered=0>
However, the reloaded column is now a regular boolean:
>>> table_rel
pyarrow.Table
a: bool
>>>
>>> df_rel['a']
0 True
1 True
2 False
3 True
4 False
Name: a, dtype: bool
I would have expected the column to be read back as categorical.
Reporter: Joao Moreira
Related issues:
- [C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY (is blocked by)
- [Python] Consistent handling of categoricals (is related to)
Note: This issue was originally created as ARROW-13342. Please see the migration documentation for further details.