-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type #7536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ferred with dictionary type
|
We could use just int32() dictionary indices and call it a day? |
|
I think there's value in finding the smallest index type possible; we expect partition fields to have few unique values in most cases. |
|
Actually, on reflection: I'm not sure it's worthwhile to check the count of unique values at all. In any given batch a virtual column would be materialized with a single-item dictionary so Currently it seems preferable to remove |
|
Currently for the ParquetDataset, it also simply uses int32 for the indices. Now, there is a more fundamental issue I had not thought of: the actual dictionary of the DictionaryArray. Right now, you create a DictionaryArray with only the (single) value of the partition field for that specific fragment (because we don't keep track of all unique values of a certain partition level?). To illustrate with a small dataset with In [1]: import pyarrow.dataset as ds
In [6]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", partitioning=part)
In [10]: fragment = list(dataset.get_fragments())[0]
In [11]: fragment.to_table(schema=dataset.schema)
Out[11]:
pyarrow.Table
dummy: int64
part: dictionary<values=string, indices=int8, ordered=0>
# only A included
In [13]: fragment.to_table(schema=dataset.schema).column("part")
Out[13]:
<pyarrow.lib.ChunkedArray object at 0x7fb4a0b6c5e8>
[
-- dictionary:
[
"A"
]
-- indices:
[
0,
0
]
]
In [15]: import pyarrow.parquet as pq
In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")
In [19]: piece = dataset2.pieces[0]
In [25]: piece.read(partitions=dataset2.partitions)
Out[25]:
pyarrow.Table
dummy: int64
part: dictionary<values=string, indices=int32, ordered=0>
# both A and B included
In [26]: piece.read(partitions=dataset2.partitions).column("part")
Out[26]:
<pyarrow.lib.ChunkedArray object at 0x7fb4a08b26d8>
[
-- dictionary:
[
"A",
"B"
]
-- indices:
[
0,
0
]
]I think for this being valuable (eg in the context of dask, or for pandas where reading in only a part of the parquet dataset), it's important to get all values of the partition field. But I am not sure to what extent that fits in the Dataset design (although I think that during the discovery in the Factory, we could keep track of all unique values of a partition field?) |
|
@jorisvandenbossche okay, I'll extend the key value |
|
I'm also of the opinion that we should stick with int32_t. That's what parquet uses for dict column, that's what R uses for factor columns, that's what we use by default in DictType, etc... I suspect the short-time net effect of this is uncovering index type issues we have around. |
|
Int32 indices are now used whatever the dictionary size |
|
@bkietz thanks for the update ensuring all uniques as dictionary values! Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / #7608 Further, a usability issue: this now creates partition expressions that use a dictionary type. Which means that doing something like It might also be an option to keep the |
|
I think that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict). |
Opened https://issues.apache.org/jira/browse/ARROW-9345 to track this. |
A maximum cardinality for the inferred dictionary can be specified: