[C++][Dataset] Partition columns with specified dictionary type result in all nulls

When specifying an explicit schema for the Partitioning, and when using a dictionary type, the materialization of the partition keys goes wrong: you don't get an error, but you get columns with all nulls.

Python example:

```python

foo_keys = [0, 1]
bar_keys = ['a', 'b', 'c']
N = 30

df = pd.DataFrame({
    'foo': np.array(foo_keys, dtype='i4').repeat(15),
    'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
    'values': np.random.randn(N)
})

pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
```

When reading with discovery, all is fine:

```python

>>> ds.dataset("test_order", format="parquet", partitioning="hive").to_table().schema
values: double
bar: string
foo: int32
>>> ds.dataset("test_order", format="parquet", partitioning="hive").to_table().to_pandas().head(2)
     values bar  foo
0  2.505903   a    0
1 -1.760135   a    0
```

But when specifying the partition columns to be dictionary type with explicit `HivePartitioning`, you get no error but all null values:

```python

>>> partitioning = ds.HivePartitioning(pa.schema([
...     ("foo", pa.dictionary(pa.int32(), pa.int64())),
...     ("bar", pa.dictionary(pa.int32(), pa.string()))
... ]))
>>> ds.dataset("test_order", format="parquet", partitioning=partitioning).to_table().schema
values: double
foo: dictionary<values=int64, indices=int32, ordered=0>
bar: dictionary<values=string, indices=int32, ordered=0>
>>> ds.dataset("test_order", format="parquet", partitioning=partitioning).to_table().to_pandas().head(2)
     values  foo  bar
0  2.505903  NaN  NaN
1 -1.760135  NaN  NaN
```

**Reporter**: [Joris Van den Bossche](https://issues.apache.org/jira/browse/ARROW-8088) / @jorisvandenbossche
**Assignee**: [Ben Kietzman](https://issues.apache.org/jira/browse/ARROW-8088) / @bkietz
#### PRs and other links:
- [GitHub Pull Request #6641](https://github.com/apache/arrow/pull/6641)

<sub>**Note**: *This issue was originally created as [ARROW-8088](https://issues.apache.org/jira/browse/ARROW-8088). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Dataset] Partition columns with specified dictionary type result in all nulls #24298

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Dataset] Partition columns with specified dictionary type result in all nulls #24298

Description

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions