[Python] Consistent handling of categoricals

What is the current state of categoricals with pyarrow? The `categories` parameter mentioned [in this GitHub](https://github.com/apache/arrow/issues/1688) issue does not seem to be accepted in `pd.read_parquet` anymore. I see that read/write of `int` categoricals does not work, though `str` do – except if the file is written by fastparquet.

Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following handling of categoricals:

 
```java

import os
import pandas as pd


fname = '/tmp/tst'


data = {
    'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
    'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 'bar'])),
}
df = pd.DataFrame(data)


for write in ['fastparquet', 'pyarrow']:
    for read in ['fastparquet', 'pyarrow']:
        if os.path.exists(fname):
            os.remove(fname)
        df.to_parquet(fname, engine=write, compression=None)
        df_read = pd.read_parquet(fname, engine=read)


        print()
        print('write:', write, 'read:', read)
        for t in data.keys():
            print(t, df[t].dtype == df_read[t].dtype)
```
 

 
```

write: fastparquet read: fastparquet
int True
str True
write: fastparquet read: pyarrow
int False
str False
write: pyarrow read: fastparquet
int True
str True
write: pyarrow read: pyarrow
int False
str True
```

**Reporter**: [Chris Roat](https://issues.apache.org/jira/browse/ARROW-11157)
#### Related issues:
- [[Python] Categorical boolean column saved as regular boolean in parquet](https://github.com/apache/arrow/issues/29017) (relates to)

<sub>**Note**: *This issue was originally created as [ARROW-11157](https://issues.apache.org/jira/browse/ARROW-11157). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Consistent handling of categoricals #27067

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Consistent handling of categoricals #27067

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions