Skip to content

Conversation

@sanjibansg
Copy link
Contributor

This PR fixes the issue of the partitioning field having null values while using FilenamePartitioning.
For FilenamePartitioning, we should only remove the prefix and thus should not use StripPrefixAndFilename(), which will remove the filename too along with the prefix.

@github-actions
Copy link

Comment on lines 283 to 287
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make it the responsibility of the partitioning implementation to strip the filename, instead of hardcoding an exception. (After all, what if the user wants to define a custom filename-based partitioning scheme?) Or we could pass both the path and the filename separately to partitioning->Parse.

Copy link
Contributor Author

@sanjibansg sanjibansg Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it's like passing the entire info.path() in the Parse() method and then various partitioning implementations will do it in the way they want? I think we can then strip the prefix/filename in the ParseKeys method of each of the partitioning modes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can still strip the prefix, but we can let the partitioning handle the rest.

And in that case it might make sense to split the remainder of the path and the filename for the partitioning implementation too and just pass both as arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We should still strip the prefix since presumably we don't want partitioning to depend on the prefix itself.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may pass both the filename and the path(other than the prefix) to the Parse() method, but the filename is only required for FilenamePartitioning I believe, and the path will be required by the Directory & Hive Partitioning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. From the point of view of datasets however it doesn't matter that some implementations only need some of the info.

CC @westonpace for thoughts

Copy link
Contributor Author

@sanjibansg sanjibansg Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest change, I modified the StripPrefixAndFilename() method to return a PartitionPathFormat object which will contain both the directory and filename prefix and then passing that to the Parse() method which now expects both the directory and filename-prefix.

We can modify the Parse() method as well to accept an object of PartitionPathFormat that way it will be symmetrical to the Format() method. But then, we need to implement similar changes to PyArrow, and I believe then we have to define an object of PartitionPathFormat first to use the partitioning.parse() method in PyArrow.

@sanjibansg sanjibansg requested a review from lidavidm April 26, 2022 20:16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the default parameter values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though yes, it would be better if we could pass const PartitionPathFormat& instead. FWIW I don't think we have to expose it to Python. Or we can just make a namedtuple on the Python side, it doesn't have to be a C++ class wrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Removed the default parameter
  • Modified the Parse method to use a PartitionPathFormat object as an argument. As for the PyArrow interface, modified the parse() method to accept just the two strings (directory & prefix) and then uses a cppclass object to form the PartitionPathFormat object which is then passed into the internal Parse() method. Is this a good approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we passing both components?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, can we cover this with a test?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are using the PartitionPathFormat as the argument, the Parse() method will work accordingly as per the partitioning mode. Any particular test you want here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure that a filename partitioning works properly in this path, basically, since before it seems like it would have failed since the filename was being omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a round-trip test in PyArrow to check whether the partitions are read correctly. Do we need any tests other than that?

Comment on lines 877 to 878
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't we still want the filename in the path in case the partitioning factory is a filename PartitioningFactory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Can we cover this with a test?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems off, but I can't figure out how to hit this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do the same changes in the Inspect() method which currently accepts a path. Instead of passing a vector of strings, we can then pass a vector of PartitionPathFormat object, and then the Inspect methods of individual partitioning modes will use either the directory or the filename accordingly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably makes the most sense, but we might want to split out that refactoring separately, and also make sure that we can hit this path in a unit test in the first place (I was trying this morning but couldn't) as I don't want to expand the scope of this PR too much.

Copy link
Contributor Author

@sanjibansg sanjibansg Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Inspect() methods now accept a PartitionPathFormat object as an argument with the latest commit. I have modified the tests and SplitFilenameAndPrefix() methods to return forms of PartitionPathFormat object as required. The Inspect() methods then extracts the directory or prefix accordingly whichever is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes in the Inspect() method, I think the R build is failing, I am trying to investigate on fixing it, but not very sure about the R implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not for this PR but I would expect FunctionPartitioning to be "as powerful as" any other partitioning. But I don't think it's used much anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but do both of these need a default? I can see prefix having a default because it's a new argument

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also IMO the parameter should be named "filename", and/or we should have a docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, when the parse() method only accepted a path, we used to pass the filename there, as FilenamePartitioning only needs the filename and not the directory. But, now that it is expecting both a directory and filename prefix, so I believe a directory will not be required for FilenamePartitioning, thus using an empty string as default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed prefix to filename in PartitionPathFormat

Comment on lines 877 to 878
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems off, but I can't figure out how to hit this case.

@lidavidm
Copy link
Member

Thanks.

@westonpace any comments here?

@westonpace westonpace self-requested a review April 29, 2022 20:34
@westonpace
Copy link
Member

@github-actions autotune

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the R change and it boils down to how we want to expose this to R. However, I don't think we want to change the interface to PartitioningFactory. It makes sense that Partitioning might change. We could justify it because users might create custom subclasses and we want to make it as easy as possible.

However, I think PartitioningFactory::Inspect should continue to take in std::vector<std::string>

@sanjibansg sanjibansg requested a review from westonpace May 5, 2022 14:10
@sanjibansg sanjibansg force-pushed the fix-FilenamePartitioning branch from 4517671 to 6133121 Compare May 21, 2022 14:24
@sanjibansg sanjibansg force-pushed the fix-FilenamePartitioning branch from 8a6cc9c to 2ae2ea3 Compare May 23, 2022 23:33
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct. One last question I think for me.

@sanjibansg sanjibansg requested a review from westonpace May 26, 2022 08:37
@sanjibansg sanjibansg deleted the fix-FilenamePartitioning branch May 27, 2022 01:35
@ursabot
Copy link

ursabot commented May 27, 2022

Benchmark runs are scheduled for baseline = f3e09b9 and contender = adb5b00. adb5b00 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.08% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.08% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] adb5b000 ec2-t3-xlarge-us-east-2
[Failed] adb5b000 test-mac-arm
[Finished] adb5b000 ursa-i9-9960x
[Finished] adb5b000 ursa-thinkcentre-m75q
[Finished] f3e09b9b ec2-t3-xlarge-us-east-2
[Finished] f3e09b9b test-mac-arm
[Finished] f3e09b9b ursa-i9-9960x
[Finished] f3e09b9b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants