[Python] Support reading Parquet binary/string columns directly as DictionaryArray

Requires PARQUET-1324 and probably quite a bit of extra work  

Properly implementing this will require dictionary normalization across row groups. When reading a new row group, a fast path that compares the current dictionary with the prior dictionary should be used. This also needs to handle the case where a column chunk "fell back" to PLAIN encoding mid-stream

**Reporter**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-3325) / @wesm
**Assignee**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-3325) / @wesm
#### Related issues:
- [[C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY](https://github.com/apache/arrow/issues/22534) (relates to)
- [[C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray](https://github.com/apache/arrow/issues/20103) (relates to)
- [[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray](https://github.com/apache/arrow/issues/20110) (relates to)
- [[Python][Parquet] direct reading/writing of pandas categoricals in parquet](https://github.com/apache/arrow/issues/19588) (relates to)
- [[Python] CategoricalIndex is lost after reading back](https://github.com/apache/arrow/issues/19959) (is related to)
- [[C++] Move "dictionary" member from DictionaryType to ArrayData to allow for changing dictionaries between Array chunks](https://github.com/apache/arrow/issues/19492) (depends upon)
- [[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray](https://github.com/apache/arrow/issues/20110) (depends upon)
- [[C++] Implement alternative DictionaryBuilder that always yields int32 indices](https://github.com/apache/arrow/issues/22445) (depends upon)
- [[C++] Reorganize parquet/arrow/reader.cc, remove code duplication, improve readability](https://github.com/apache/arrow/issues/22466) (depends upon)
- [[C++][Parquet] Build logical schema tree mapping Arrow fields to Parquet schema levels](https://github.com/apache/arrow/issues/22477) (depends upon)
#### PRs and other links:
- [GitHub Pull Request #4999](https://github.com/apache/arrow/pull/4999)
- [GitHub Pull Request #5018](https://github.com/apache/arrow/pull/5018)

<sub>**Note**: *This issue was originally created as [ARROW-3325](https://issues.apache.org/jira/browse/ARROW-3325). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] Support reading Parquet binary/string columns directly as DictionaryArray #19660

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Support reading Parquet binary/string columns directly as DictionaryArray #19660

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions