-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Closed
Copy link
Description
Requires PARQUET-1324 and probably quite a bit of extra work
Properly implementing this will require dictionary normalization across row groups. When reading a new row group, a fast path that compares the current dictionary with the prior dictionary should be used. This also needs to handle the case where a column chunk "fell back" to PLAIN encoding mid-stream
Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm
Related issues:
- [C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY (relates to)
- [C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray (relates to)
- [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray (relates to)
- [Python][Parquet] direct reading/writing of pandas categoricals in parquet (relates to)
- [Python] CategoricalIndex is lost after reading back (is related to)
- [C++] Move "dictionary" member from DictionaryType to ArrayData to allow for changing dictionaries between Array chunks (depends upon)
- [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray (depends upon)
- [C++] Implement alternative DictionaryBuilder that always yields int32 indices (depends upon)
- [C++] Reorganize parquet/arrow/reader.cc, remove code duplication, improve readability (depends upon)
- [C++][Parquet] Build logical schema tree mapping Arrow fields to Parquet schema levels (depends upon)
PRs and other links:
Note: This issue was originally created as ARROW-3325. Please see the migration documentation for further details.