ARROW-11260: [C++][Dataset] Don't require dictionaries when specifying explicit partition schema #9677

lidavidm · 2021-03-11T17:10:37Z

The API here is a little different than before, but this allows you to give a partition schema with dictionary fields, without having to fill in the dictionaries themselves. I opted to do this as part of the factory to avoid having to let partitionings be 'half constructed', and because the factory has the necessary logic already. As a bonus, this also lets you now check the inferred and actual types against each other.

I also fixed a bug when using non-int32 dictionary indices: a cast was missing.

github-actions · 2021-03-11T17:19:36Z

https://issues.apache.org/jira/browse/ARROW-11260

bkietz

Thanks for doing this! Some minor comments

cpp/src/arrow/dataset/expression.cc

cpp/src/arrow/dataset/partition.h

cpp/src/arrow/dataset/partition.cc

python/pyarrow/_dataset.pyx

bkietz

LGTM

jorisvandenbossche

Thanks! Only looked at the cython/python part, which looks good.

We should only still update the partitioning() function/docstring in dataset.py, I think.
Now that the discover method also takes a schema, it's not fully clear what the difference is between doing DirectoryPartitioning(schema) and DirectoryPartitioning.discover(schema) The first requires passing the dictionaries as well, I suppose? (in case of having dictionary fields) But then that should be documented in the docstring, and we might want to update the partitioning function to use the discover() method in case a schema is provided but no dictionaries?

…g explicit partition schema

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

lidavidm · 2021-03-12T19:58:12Z

I tried updating ds.partitioning() to give a PartitioningFactory when a schema (but not dictionaries) is given, but that breaks things (e.g. writers expecting ds.partitioning() to give a Partitioning still)…how about a separate flag?

jorisvandenbossche · 2021-03-12T20:03:08Z

What you currently pushed here breaks other tests?

Maybe we could also check if the schema has any dictionary type?

lidavidm · 2021-03-12T20:05:06Z

What you currently pushed here breaks other tests?

Yup, I realized that right after I pushed.

Maybe we could also check if the schema has any dictionary type?

We could; we'd have to do that recursively, right? In case of a nested dictionary. (…though is that handled anyways?) It also doesn't help the fact that we need a Partitioning, not a PartitioningFactory, when we want to write data, so the auto-detection might be a little too magical…

…ying explicit partition schema

jorisvandenbossche · 2021-03-12T20:30:39Z

We could; we'd have to do that recursively, right? In case of a nested dictionary. (…though is that handled anyways?)

I don't think we can parse nested types from the file paths?
In that case, we wouldn't need to check it recursively.

From a user point of view, having to specify dictionaries="infer" feels superfluous, as it is clear that's needed (but to be clear, this PR is already a nice improvement compared to the current situation! ;))

It also doesn't help the fact that we need a Partitioning, not a PartitioningFactory, when we want to write data, so the auto-detection might be a little too magical…

Hmm, yes, that complicates things. When writing, you don't need to specify the dictionaries. But indeed you still need the actual Partitioning and not the factory. So returning the factory if the schemas has a dictionary type and no dictionaries are passed, would then fail when writing ..

The current API mixing both for reading/writing and the full object / the factory makes it a bit complex ..

lidavidm · 2021-03-12T21:11:06Z

We could also split out read_partitioning and write_partitioning functions, perhaps, or add a similar flag, and accept the API break.

jorisvandenbossche · 2021-03-15T14:06:17Z

I think the current PR (with the dictionary="infer") is already a clear, strict improvement over the situation in master, so that's fine for me.
(having to split in a reading/writing partitioning object is also not very nice)

bkietz · 2021-03-15T15:31:46Z

+1, merging

lidavidm added Component: C++ Component: Python labels Mar 11, 2021

bkietz self-requested a review March 11, 2021 17:50

lidavidm force-pushed the arrow-11260 branch from 6d26ab2 to 5ed572e Compare March 11, 2021 18:03

bkietz requested changes Mar 11, 2021

View reviewed changes

lidavidm force-pushed the arrow-11260 branch from a979f83 to 6e7d31a Compare March 11, 2021 21:58

bkietz approved these changes Mar 12, 2021

View reviewed changes

bkietz requested a review from jorisvandenbossche March 12, 2021 15:53

jorisvandenbossche reviewed Mar 12, 2021

View reviewed changes

lidavidm and others added 2 commits March 12, 2021 14:54

ARROW-11260: [C++][Dataset] Don't require dictionaries when specifyin…

febebf3

…g explicit partition schema

Apply suggestions from code review

097fd3d

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

lidavidm force-pushed the arrow-11260 branch from 6e7d31a to 682228a Compare March 12, 2021 19:54

ARROW-11260: [Python][Dataset] Don't require dictionaries when specif…

f1c2f35

…ying explicit partition schema

lidavidm force-pushed the arrow-11260 branch from 682228a to f1c2f35 Compare March 12, 2021 20:17

bkietz closed this in 71892d7 Mar 15, 2021

asfimport mentioned this pull request Mar 15, 2021

[C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning #27161

Closed

ARROW-11260: [C++][Dataset] Don't require dictionaries when specifying explicit partition schema #9677

ARROW-11260: [C++][Dataset] Don't require dictionaries when specifying explicit partition schema #9677

Uh oh!

Conversation

lidavidm commented Mar 11, 2021

Uh oh!

github-actions bot commented Mar 11, 2021

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Mar 12, 2021

Uh oh!

jorisvandenbossche commented Mar 12, 2021

Uh oh!

lidavidm commented Mar 12, 2021

Uh oh!

jorisvandenbossche commented Mar 12, 2021

Uh oh!

lidavidm commented Mar 12, 2021

Uh oh!

jorisvandenbossche commented Mar 15, 2021

Uh oh!

bkietz commented Mar 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants