[Python][Parquet] direct reading/writing of pandas categoricals in parquet

Parquet supports "dictionary encoding" of column data in a manner very similar to the concept of Categoricals in pandas. It is natural to use this encoding for a column which originated as a categorical. Conversely, when loading, if the file metadata says that a given column came from a pandas (or arrow) categorical, then we can trust that the whole of the column is dictionary-encoded and load the data directly into a categorical column, rather than expanding the labels upon load and recategorising later.

If the data does not have the pandas metadata, then the guarantee cannot hold, and we cannot assume either that the whole column is dictionary encoded or that the labels are the same throughout. In this case, the current behaviour is fine.

 

(please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at <https://github.com/dask/fastparquet/issues/374> as a feature that is useful in fastparquet)

**Reporter**: [Martin Durant](https://issues.apache.org/jira/browse/ARROW-3246) / @martindurant
**Assignee**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-3246) / @wesm
#### Related issues:
- [[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray](https://github.com/apache/arrow/issues/20110) (relates to)
- [[Python] CategoricalIndex is lost after reading back](https://github.com/apache/arrow/issues/19959) (is related to)
- [[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size](https://github.com/apache/arrow/issues/21577) (is related to)
- [[C++] Provide public API to access dictionary-encoded indices and values](https://issues.apache.org/jira/browse/PARQUET-800) (is related to)
- [[C++] Persist original type metadata from Arrow schemas](https://issues.apache.org/jira/browse/PARQUET-924) (is related to)
- [[Python] Support reading Parquet binary/string columns directly as DictionaryArray](https://github.com/apache/arrow/issues/19660) (is related to)
- [[Python] Pandas categorical type doesn't survive a round-trip through parquet](https://github.com/apache/arrow/issues/21930) (is related to)
- [[Python] Column metadata is not saved or loaded in parquet](https://github.com/apache/arrow/issues/20926) (supercedes)
- [[C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter<T>](https://github.com/apache/arrow/issues/22545) (depends upon)
#### PRs and other links:
- [GitHub Pull Request #5077](https://github.com/apache/arrow/pull/5077)

<sub>**Note**: *This issue was originally created as [ARROW-3246](https://issues.apache.org/jira/browse/ARROW-3246). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions