-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11260: [C++][Dataset] Don't require dictionaries when specifying explicit partition schema #9677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Some minor comments
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Only looked at the cython/python part, which looks good.
We should only still update the partitioning() function/docstring in dataset.py, I think.
Now that the discover method also takes a schema, it's not fully clear what the difference is between doing DirectoryPartitioning(schema) and DirectoryPartitioning.discover(schema) The first requires passing the dictionaries as well, I suppose? (in case of having dictionary fields) But then that should be documented in the docstring, and we might want to update the partitioning function to use the discover() method in case a schema is provided but no dictionaries?
…g explicit partition schema
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
|
I tried updating |
|
What you currently pushed here breaks other tests? Maybe we could also check if the schema has any dictionary type? |
Yup, I realized that right after I pushed.
We could; we'd have to do that recursively, right? In case of a nested dictionary. (…though is that handled anyways?) It also doesn't help the fact that we need a Partitioning, not a PartitioningFactory, when we want to write data, so the auto-detection might be a little too magical… |
…ying explicit partition schema
I don't think we can parse nested types from the file paths? From a user point of view, having to specify
Hmm, yes, that complicates things. When writing, you don't need to specify the dictionaries. But indeed you still need the actual Partitioning and not the factory. So returning the factory if the schemas has a dictionary type and no dictionaries are passed, would then fail when writing .. The current API mixing both for reading/writing and the full object / the factory makes it a bit complex .. |
|
We could also split out read_partitioning and write_partitioning functions, perhaps, or add a similar flag, and accept the API break. |
|
I think the current PR (with the |
|
+1, merging |
The API here is a little different than before, but this allows you to give a partition schema with dictionary fields, without having to fill in the dictionaries themselves. I opted to do this as part of the factory to avoid having to let partitionings be 'half constructed', and because the factory has the necessary logic already. As a bonus, this also lets you now check the inferred and actual types against each other.
I also fixed a bug when using non-int32 dictionary indices: a cast was missing.