Skip to content

[C++][Dataset] Optionally encode partition field values as dictionary type #24808

@asfimport

Description

@asfimport

In the Python ParquetDataset implementation, the partition fields are returned as dictionary type columns.

In the new Dataset API, we now use a plain type (integer or string when inferred). But, you can already manually specify that the partition keys should be dictionary type by specifying the partitioning schema (in Partitioning passed to the dataset factory).

Since using dictionary type can be more efficient (since partition keys will typically be repeated values in the resulting table), it might be good to still have an option in the DatasetFactory to use dictionary types for the partition fields.

See also #6303 (comment)

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Ben Kietzman / @bkietz

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8647. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions