-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Closed
Copy link
Description
When specifying an explicit schema for the Partitioning, and when using a dictionary type, the materialization of the partition keys goes wrong: you don't get an error, but you get columns with all nulls.
Python example:
foo_keys = [0, 1]
bar_keys = ['a', 'b', 'c']
N = 30
df = pd.DataFrame({
'foo': np.array(foo_keys, dtype='i4').repeat(15),
'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
'values': np.random.randn(N)
})
pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])When reading with discovery, all is fine:
>>> ds.dataset("test_order", format="parquet", partitioning="hive").to_table().schema
values: double
bar: string
foo: int32
>>> ds.dataset("test_order", format="parquet", partitioning="hive").to_table().to_pandas().head(2)
values bar foo
0 2.505903 a 0
1 -1.760135 a 0But when specifying the partition columns to be dictionary type with explicit HivePartitioning, you get no error but all null values:
>>> partitioning = ds.HivePartitioning(pa.schema([
... ("foo", pa.dictionary(pa.int32(), pa.int64())),
... ("bar", pa.dictionary(pa.int32(), pa.string()))
... ]))
>>> ds.dataset("test_order", format="parquet", partitioning=partitioning).to_table().schema
values: double
foo: dictionary<values=int64, indices=int32, ordered=0>
bar: dictionary<values=string, indices=int32, ordered=0>
>>> ds.dataset("test_order", format="parquet", partitioning=partitioning).to_table().to_pandas().head(2)
values foo bar
0 2.505903 NaN NaN
1 -1.760135 NaN NaNReporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Ben Kietzman / @bkietz
PRs and other links:
Note: This issue was originally created as ARROW-8088. Please see the migration documentation for further details.