-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Related to ARROW-8647, see comment at #7536 (comment)
When using dictionary type for the partition fields, this now creates partition expressions that also use a dictionary type. Which means that doing something like dataset.to_table(filter=ds.field("part") == "A") to filter on the partition field with a plain string expression doesn't work, limiting the usability of this option (and even with the new Python scalar stuff, it would not be easy to construct the correct expression):
In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)
In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", partitioning=part)
In [11]: fragment = list(dataset.get_fragments())[0]
In [12]: fragment.partition_expression
Out[12]:
<pyarrow.dataset.Expression (part == [
"A",
"B"
][0]:dictionary<values=string, indices=int32, ordered=0>)>
In [13]: dataset.to_table(filter=ds.field("part") == "A")
...
ArrowNotImplementedError: cast from stringIt might be an option to keep the partition_expression use the dictionary value type instead of dictionary type? Or alternatively, as @fsaintjacques proposed, ensure that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict).
Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Ben Kietzman / @bkietz
PRs and other links:
Note: This issue was originally created as ARROW-9345. Please see the migration documentation for further details.