ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type #7536

bkietz · 2020-06-24T22:45:12Z

A maximum cardinality for the inferred dictionary can be specified:

max	result
0 (default)	All string partition fields will be inferred as utf8
None	All string partition fields will be inferred as dictionary(, utf8)
64	If a field has 64 or fewer distinct partition values, it will be inferred as dictionary(int8, utf8) otherwise as utf8

…ferred with dictionary type

github-actions · 2020-06-24T22:46:31Z

https://issues.apache.org/jira/browse/ARROW-8647

wesm · 2020-06-25T00:36:01Z

We could use just int32() dictionary indices and call it a day?

bkietz · 2020-06-25T00:48:27Z

I think there's value in finding the smallest index type possible; we expect partition fields to have few unique values in most cases.

bkietz · 2020-06-25T01:08:32Z

Actually, on reflection: I'm not sure it's worthwhile to check the count of unique values at all. In any given batch a virtual column would be materialized with a single-item dictionary so int8 should always suffice. (Unless we want to always support concatenation of a materialized table's chunks, though even in that case we could promote the index type on concatenation...).

Currently it seems preferable to remove max_partition_dictionary_size in favor of a boolean flag and always infer dictionary<indices=int8, values=utf8>.

@jorisvandenbossche ?

jorisvandenbossche · 2020-06-25T12:05:16Z

Currently for the ParquetDataset, it also simply uses int32 for the indices.

Now, there is a more fundamental issue I had not thought of: the actual dictionary of the DictionaryArray. Right now, you create a DictionaryArray with only the (single) value of the partition field for that specific fragment (because we don't keep track of all unique values of a certain partition level?).
In the python ParquetDataset, however, we create a DictionaryArray with all occurring values of that partition field (also for the other fragments/pieces).

To illustrate with a small dataset with part=A and part=B directories:

In [1]: import pyarrow.dataset as ds

In [6]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)   

In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", partitioning=part) 

In [10]: fragment = list(dataset.get_fragments())[0] 

In [11]: fragment.to_table(schema=dataset.schema) 
Out[11]: 
pyarrow.Table
dummy: int64
part: dictionary<values=string, indices=int8, ordered=0>

# only A included
In [13]: fragment.to_table(schema=dataset.schema).column("part")  
Out[13]: 
<pyarrow.lib.ChunkedArray object at 0x7fb4a0b6c5e8>
[

  -- dictionary:
    [
      "A"
    ]
  -- indices:
    [
      0,
      0
    ]
]

In [15]: import pyarrow.parquet as pq  

In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")  

In [19]: piece = dataset2.pieces[0] 

In [25]: piece.read(partitions=dataset2.partitions) 
Out[25]: 
pyarrow.Table
dummy: int64
part: dictionary<values=string, indices=int32, ordered=0>

# both A and B included
In [26]: piece.read(partitions=dataset2.partitions).column("part")   
Out[26]: 
<pyarrow.lib.ChunkedArray object at 0x7fb4a08b26d8>
[

  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      0,
      0
    ]
]

I think for this being valuable (eg in the context of dask, or for pandas where reading in only a part of the parquet dataset), it's important to get all values of the partition field. But I am not sure to what extent that fits in the Dataset design (although I think that during the discovery in the Factory, we could keep track of all unique values of a partition field?)

bkietz · 2020-06-25T12:38:58Z

@jorisvandenbossche okay, I'll extend the key value Partitionings to maintain dictionaries of all unique values of a field.

fsaintjacques · 2020-06-25T15:06:05Z

I'm also of the opinion that we should stick with int32_t. That's what parquet uses for dict column, that's what R uses for factor columns, that's what we use by default in DictType, etc... I suspect the short-time net effect of this is uncovering index type issues we have around.

bkietz · 2020-06-29T03:32:33Z

Int32 indices are now used whatever the dictionary size

jorisvandenbossche · 2020-07-01T15:39:41Z

@bkietz thanks for the update ensuring all uniques as dictionary values!

Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / #7608

Further, a usability issue: this now creates partition expressions that use a dictionary type. Which means that doing something like dataset.to_table(filter=ds.field("part") == "A") to filter on the partition field with a plain string expression doesn't work, limiting the usability of this option (and even with the new Python scalar stuff, it would not be easy to construct the correct expression):

In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)  

In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", partitioning=part)

In [11]: fragment = list(dataset.get_fragments())[0]   

In [12]: fragment.partition_expression  
Out[12]: 
<pyarrow.dataset.Expression (part == [
  "A",
  "B"
][0]:dictionary<values=string, indices=int32, ordered=0>)>

In [13]: dataset.to_table(filter=ds.field("part") == "A") 
...
ArrowNotImplementedError: cast from string

It might also be an option to keep the partition_expression use the dictionary value type instead of dictionary type?

fsaintjacques · 2020-07-02T17:07:45Z

I think that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict).

jorisvandenbossche · 2020-07-07T08:57:03Z

I think that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict).

Opened https://issues.apache.org/jira/browse/ARROW-9345 to track this.

ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be in…

2f4d09c

…ferred with dictionary type

bkietz requested a review from jorisvandenbossche June 24, 2020 22:45

lint fixes

286a96f

track full dictionary for dictionary fields

b04319e

remove failing dictionary scalar expression test

2a042c7

fsaintjacques self-assigned this Jun 29, 2020

fsaintjacques closed this in 95a5904 Jun 30, 2020

kszucs mentioned this pull request Jun 30, 2020

ARROW-9017: [C++][Python] Refactor scalar bindings #7519

Closed

4 tasks

bkietz mentioned this pull request Jan 14, 2021

ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns #9130

Closed

bkietz deleted the 8647-dictionary-partition-fields branch February 25, 2021 16:31

This was referenced Jul 7, 2020

[C++][Dataset] Optionally encode partition field values as dictionary type #24808

Closed

[C++][Dataset] Expression with dictionary type should work with operand of value type #25429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type #7536

ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type #7536

Uh oh!

bkietz commented Jun 24, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Jun 24, 2020

Uh oh!

wesm commented Jun 25, 2020

Uh oh!

bkietz commented Jun 25, 2020

Uh oh!

bkietz commented Jun 25, 2020

Uh oh!

jorisvandenbossche commented Jun 25, 2020

Uh oh!

bkietz commented Jun 25, 2020

Uh oh!

fsaintjacques commented Jun 25, 2020

Uh oh!

bkietz commented Jun 29, 2020

Uh oh!

jorisvandenbossche commented Jul 1, 2020

Uh oh!

fsaintjacques commented Jul 2, 2020

Uh oh!

jorisvandenbossche commented Jul 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type #7536

ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type #7536

Uh oh!

Conversation

bkietz commented Jun 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 24, 2020

Uh oh!

wesm commented Jun 25, 2020

Uh oh!

bkietz commented Jun 25, 2020

Uh oh!

bkietz commented Jun 25, 2020

Uh oh!

jorisvandenbossche commented Jun 25, 2020

Uh oh!

bkietz commented Jun 25, 2020

Uh oh!

fsaintjacques commented Jun 25, 2020

Uh oh!

bkietz commented Jun 29, 2020

Uh oh!

jorisvandenbossche commented Jul 1, 2020

Uh oh!

fsaintjacques commented Jul 2, 2020

Uh oh!

jorisvandenbossche commented Jul 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bkietz commented Jun 24, 2020 •

edited

Loading