-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning #7608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
455f6dc to
9f0a90b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach here is to also determine field_names_ in HivePartitioningFactory after inspecting (for DirectoryPartitioningFactory, those field names are passed to the constructor). So that we can then trim the schema and have the dictionaries match the order of the schema.
However, thinking of it now: there might still be a problem if the user specified the full dataset schema so no inspection happens .. So we might need to think of a better solution.
(I should also add some C++ tests)
cpp/src/arrow/dataset/partition.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably guard here against the case that field_names_ was not yet updated (if Finish is called without Inspect being called), with empty vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, the first line of this method should just call
auto field_names = FieldNames();
and replace occurrences of the private member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no FieldNames() method on the PartitioningFactory (only the impl has one, but that is not accessible here; that's the reason I added the field_names_ private member to store those)
…coding for HivePartioning
9f0a90b to
81aecfa
Compare
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
No description provided.