Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Jun 24, 2020

A maximum cardinality for the inferred dictionary can be specified:

max result
0 (default) All string partition fields will be inferred as utf8
None All string partition fields will be inferred as dictionary(, utf8)
64 If a field has 64 or fewer distinct partition values, it will be inferred as dictionary(int8, utf8) otherwise as utf8

@github-actions
Copy link

@wesm
Copy link
Member

wesm commented Jun 25, 2020

We could use just int32() dictionary indices and call it a day?

@bkietz
Copy link
Member Author

bkietz commented Jun 25, 2020

I think there's value in finding the smallest index type possible; we expect partition fields to have few unique values in most cases.

@bkietz
Copy link
Member Author

bkietz commented Jun 25, 2020

Actually, on reflection: I'm not sure it's worthwhile to check the count of unique values at all. In any given batch a virtual column would be materialized with a single-item dictionary so int8 should always suffice. (Unless we want to always support concatenation of a materialized table's chunks, though even in that case we could promote the index type on concatenation...).

Currently it seems preferable to remove max_partition_dictionary_size in favor of a boolean flag and always infer dictionary<indices=int8, values=utf8>.

@jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

Currently for the ParquetDataset, it also simply uses int32 for the indices.

Now, there is a more fundamental issue I had not thought of: the actual dictionary of the DictionaryArray. Right now, you create a DictionaryArray with only the (single) value of the partition field for that specific fragment (because we don't keep track of all unique values of a certain partition level?).
In the python ParquetDataset, however, we create a DictionaryArray with all occurring values of that partition field (also for the other fragments/pieces).

To illustrate with a small dataset with part=A and part=B directories:

In [1]: import pyarrow.dataset as ds

In [6]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)   

In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", partitioning=part) 

In [10]: fragment = list(dataset.get_fragments())[0] 

In [11]: fragment.to_table(schema=dataset.schema) 
Out[11]: 
pyarrow.Table
dummy: int64
part: dictionary<values=string, indices=int8, ordered=0>

# only A included
In [13]: fragment.to_table(schema=dataset.schema).column("part")  
Out[13]: 
<pyarrow.lib.ChunkedArray object at 0x7fb4a0b6c5e8>
[

  -- dictionary:
    [
      "A"
    ]
  -- indices:
    [
      0,
      0
    ]
]

In [15]: import pyarrow.parquet as pq  

In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")  

In [19]: piece = dataset2.pieces[0] 

In [25]: piece.read(partitions=dataset2.partitions) 
Out[25]: 
pyarrow.Table
dummy: int64
part: dictionary<values=string, indices=int32, ordered=0>

# both A and B included
In [26]: piece.read(partitions=dataset2.partitions).column("part")   
Out[26]: 
<pyarrow.lib.ChunkedArray object at 0x7fb4a08b26d8>
[

  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      0,
      0
    ]
]

I think for this being valuable (eg in the context of dask, or for pandas where reading in only a part of the parquet dataset), it's important to get all values of the partition field. But I am not sure to what extent that fits in the Dataset design (although I think that during the discovery in the Factory, we could keep track of all unique values of a partition field?)

@bkietz
Copy link
Member Author

bkietz commented Jun 25, 2020

@jorisvandenbossche okay, I'll extend the key value Partitionings to maintain dictionaries of all unique values of a field.

@fsaintjacques
Copy link
Contributor

I'm also of the opinion that we should stick with int32_t. That's what parquet uses for dict column, that's what R uses for factor columns, that's what we use by default in DictType, etc... I suspect the short-time net effect of this is uncovering index type issues we have around.

@bkietz
Copy link
Member Author

bkietz commented Jun 29, 2020

Int32 indices are now used whatever the dictionary size

@jorisvandenbossche
Copy link
Member

@bkietz thanks for the update ensuring all uniques as dictionary values!

Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / #7608

Further, a usability issue: this now creates partition expressions that use a dictionary type. Which means that doing something like dataset.to_table(filter=ds.field("part") == "A") to filter on the partition field with a plain string expression doesn't work, limiting the usability of this option (and even with the new Python scalar stuff, it would not be easy to construct the correct expression):

In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)  

In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", partitioning=part)

In [11]: fragment = list(dataset.get_fragments())[0]   

In [12]: fragment.partition_expression  
Out[12]: 
<pyarrow.dataset.Expression (part == [
  "A",
  "B"
][0]:dictionary<values=string, indices=int32, ordered=0>)>

In [13]: dataset.to_table(filter=ds.field("part") == "A") 
...
ArrowNotImplementedError: cast from string

It might also be an option to keep the partition_expression use the dictionary value type instead of dictionary type?

@fsaintjacques
Copy link
Contributor

I think that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict).

@jorisvandenbossche
Copy link
Member

I think that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict).

Opened https://issues.apache.org/jira/browse/ARROW-9345 to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants