-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16302: [C++] Null values in partitioning field for FilenamePartitioning #12977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/arrow/dataset/discovery.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should make it the responsibility of the partitioning implementation to strip the filename, instead of hardcoding an exception. (After all, what if the user wants to define a custom filename-based partitioning scheme?) Or we could pass both the path and the filename separately to partitioning->Parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it's like passing the entire info.path() in the Parse() method and then various partitioning implementations will do it in the way they want? I think we can then strip the prefix/filename in the ParseKeys method of each of the partitioning modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can still strip the prefix, but we can let the partitioning handle the rest.
And in that case it might make sense to split the remainder of the path and the filename for the partitioning implementation too and just pass both as arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We should still strip the prefix since presumably we don't want partitioning to depend on the prefix itself.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may pass both the filename and the path(other than the prefix) to the Parse() method, but the filename is only required for FilenamePartitioning I believe, and the path will be required by the Directory & Hive Partitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. From the point of view of datasets however it doesn't matter that some implementations only need some of the info.
CC @westonpace for thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest change, I modified the StripPrefixAndFilename() method to return a PartitionPathFormat object which will contain both the directory and filename prefix and then passing that to the Parse() method which now expects both the directory and filename-prefix.
We can modify the Parse() method as well to accept an object of PartitionPathFormat that way it will be symmetrical to the Format() method. But then, we need to implement similar changes to PyArrow, and I believe then we have to define an object of PartitionPathFormat first to use the partitioning.parse() method in PyArrow.
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the default parameter values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though yes, it would be better if we could pass const PartitionPathFormat& instead. FWIW I don't think we have to expose it to Python. Or we can just make a namedtuple on the Python side, it doesn't have to be a C++ class wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Removed the default parameter
- Modified the Parse method to use a
PartitionPathFormatobject as an argument. As for the PyArrow interface, modified theparse()method to accept just the two strings (directory & prefix) and then uses a cppclass object to form the PartitionPathFormat object which is then passed into the internalParse()method. Is this a good approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we passing both components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, can we cover this with a test?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are using the PartitionPathFormat as the argument, the Parse() method will work accordingly as per the partitioning mode. Any particular test you want here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making sure that a filename partitioning works properly in this path, basically, since before it seems like it would have failed since the filename was being omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a round-trip test in PyArrow to check whether the partitions are read correctly. Do we need any tests other than that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't we still want the filename in the path in case the partitioning factory is a filename PartitioningFactory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Can we cover this with a test?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems off, but I can't figure out how to hit this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do the same changes in the Inspect() method which currently accepts a path. Instead of passing a vector of strings, we can then pass a vector of PartitionPathFormat object, and then the Inspect methods of individual partitioning modes will use either the directory or the filename accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That probably makes the most sense, but we might want to split out that refactoring separately, and also make sure that we can hit this path in a unit test in the first place (I was trying this morning but couldn't) as I don't want to expand the scope of this PR too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Inspect() methods now accept a PartitionPathFormat object as an argument with the latest commit. I have modified the tests and SplitFilenameAndPrefix() methods to return forms of PartitionPathFormat object as required. The Inspect() methods then extracts the directory or prefix accordingly whichever is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes in the Inspect() method, I think the R build is failing, I am trying to investigate on fixing it, but not very sure about the R implementation.
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not for this PR but I would expect FunctionPartitioning to be "as powerful as" any other partitioning. But I don't think it's used much anyways.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but do both of these need a default? I can see prefix having a default because it's a new argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also IMO the parameter should be named "filename", and/or we should have a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, when the parse() method only accepted a path, we used to pass the filename there, as FilenamePartitioning only needs the filename and not the directory. But, now that it is expecting both a directory and filename prefix, so I believe a directory will not be required for FilenamePartitioning, thus using an empty string as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed prefix to filename in PartitionPathFormat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems off, but I can't figure out how to hit this case.
|
Thanks. @westonpace any comments here? |
|
@github-actions autotune |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the R change and it boils down to how we want to expose this to R. However, I don't think we want to change the interface to PartitioningFactory. It makes sense that Partitioning might change. We could justify it because users might create custom subclasses and we want to make it as easy as possible.
However, I think PartitioningFactory::Inspect should continue to take in std::vector<std::string>
…ngFactory__Inspect
4517671 to
6133121
Compare
8a6cc9c to
2ae2ea3
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems correct. One last question I think for me.
|
Benchmark runs are scheduled for baseline = f3e09b9 and contender = adb5b00. adb5b00 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR fixes the issue of the partitioning field having null values while using FilenamePartitioning.
For FilenamePartitioning, we should only remove the prefix and thus should not use
StripPrefixAndFilename(), which will remove the filename too along with the prefix.