Skip to content

[Python] Categorical boolean column saved as regular boolean in parquet #29017

@asfimport

Description

@asfimport

When saving a pandas dataframe to parquet, if there is a categorical column where the categories are boolean, the column is saved as regular boolean.

This causes an issue because, when reading back the parquet file, I expect the column to still be categorical.

 
Reproducible example:

import pandas as pd
import pyarrow

# Create dataframe with boolean column that is then converted to categorical
df = pd.DataFrame({'a': [True, True, False, True, False]})
df['a'] = df['a'].astype('category')

# Convert to arrow Table and save to disk
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table, 'test.parquet')

# Reload data and convert back to pandas
table_rel = pyarrow.parquet.read_table('test.parquet')
df_rel = table_rel.to_pandas()

The arrow table variable correctly converts the column to an arrow DICTIONARY type:


>>> df['a']
0     True
1     True
2    False
3     True
4    False
Name: a, dtype: category
Categories (2, object): [False, True]
>>>
>>> table
pyarrow.Table
a: dictionary<values=bool, indices=int8, ordered=0>

However, the reloaded column is now a regular boolean:


>>> table_rel
pyarrow.Table
a: bool
>>>
>>> df_rel['a']
0     True
1     True
2    False
3     True
4    False
Name: a, dtype: bool

I would have expected the column to be read back as categorical.

Reporter: Joao Moreira

Related issues:

Note: This issue was originally created as ARROW-13342. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions