ARROW-16302: [C++] Null values in partitioning field for FilenamePartitioning #12977

sanjibansg · 2022-04-24T23:05:58Z

This PR fixes the issue of the partitioning field having null values while using FilenamePartitioning.
For FilenamePartitioning, we should only remove the prefix and thus should not use StripPrefixAndFilename(), which will remove the filename too along with the prefix.

github-actions · 2022-04-24T23:06:20Z

https://issues.apache.org/jira/browse/ARROW-16302

lidavidm · 2022-04-25T13:18:30Z

cpp/src/arrow/dataset/discovery.cc

I wonder if we should make it the responsibility of the partitioning implementation to strip the filename, instead of hardcoding an exception. (After all, what if the user wants to define a custom filename-based partitioning scheme?) Or we could pass both the path and the filename separately to partitioning->Parse.

So, it's like passing the entire info.path() in the Parse() method and then various partitioning implementations will do it in the way they want? I think we can then strip the prefix/filename in the ParseKeys method of each of the partitioning modes.

Yeah, I think we can still strip the prefix, but we can let the partitioning handle the rest.

And in that case it might make sense to split the remainder of the path and the filename for the partitioning implementation too and just pass both as arguments.

(We should still strip the prefix since presumably we don't want partitioning to depend on the prefix itself.)

We may pass both the filename and the path(other than the prefix) to the Parse() method, but the filename is only required for FilenamePartitioning I believe, and the path will be required by the Directory & Hive Partitioning.

Sure. From the point of view of datasets however it doesn't matter that some implementations only need some of the info.

CC @westonpace for thoughts

With the latest change, I modified the StripPrefixAndFilename() method to return a PartitionPathFormat object which will contain both the directory and filename prefix and then passing that to the Parse() method which now expects both the directory and filename-prefix.

We can modify the Parse() method as well to accept an object of PartitionPathFormat that way it will be symmetrical to the Format() method. But then, we need to implement similar changes to PyArrow, and I believe then we have to define an object of PartitionPathFormat first to use the partitioning.parse() method in PyArrow.

cpp/src/arrow/dataset/partition.h

lidavidm · 2022-04-26T21:04:45Z

cpp/src/arrow/dataset/partition.h

Why do we need the default parameter values?

Though yes, it would be better if we could pass const PartitionPathFormat& instead. FWIW I don't think we have to expose it to Python. Or we can just make a namedtuple on the Python side, it doesn't have to be a C++ class wrapper.

Removed the default parameter

Modified the Parse method to use a PartitionPathFormat object as an argument. As for the PyArrow interface, modified the parse() method to accept just the two strings (directory & prefix) and then uses a cppclass object to form the PartitionPathFormat object which is then passed into the internal Parse() method. Is this a good approach?

cpp/src/arrow/dataset/partition.h

cpp/src/arrow/dataset/partition.cc

cpp/src/arrow/dataset/discovery.cc

lidavidm · 2022-04-27T12:34:28Z

cpp/src/arrow/dataset/file_parquet.cc

Why aren't we passing both components?

(Also, can we cover this with a test?)

Now that we are using the PartitionPathFormat as the argument, the Parse() method will work accordingly as per the partitioning mode. Any particular test you want here?

Making sure that a filename partitioning works properly in this path, basically, since before it seems like it would have failed since the filename was being omitted.

I have added a round-trip test in PyArrow to check whether the partitions are read correctly. Do we need any tests other than that?

lidavidm · 2022-04-27T12:37:12Z

cpp/src/arrow/dataset/file_parquet.cc

Hmm, don't we still want the filename in the path in case the partitioning factory is a filename PartitioningFactory?

(Can we cover this with a test?)

This still seems off, but I can't figure out how to hit this case.

I think we can do the same changes in the Inspect() method which currently accepts a path. Instead of passing a vector of strings, we can then pass a vector of PartitionPathFormat object, and then the Inspect methods of individual partitioning modes will use either the directory or the filename accordingly?

That probably makes the most sense, but we might want to split out that refactoring separately, and also make sure that we can hit this path in a unit test in the first place (I was trying this morning but couldn't) as I don't want to expand the scope of this PR too much.

The Inspect() methods now accept a PartitionPathFormat object as an argument with the latest commit. I have modified the tests and SplitFilenameAndPrefix() methods to return forms of PartitionPathFormat object as required. The Inspect() methods then extracts the directory or prefix accordingly whichever is required.

With the changes in the Inspect() method, I think the R build is failing, I am trying to investigate on fixing it, but not very sure about the R implementation.

lidavidm · 2022-04-29T16:48:27Z

cpp/src/arrow/dataset/partition.h

Hmm, not for this PR but I would expect FunctionPartitioning to be "as powerful as" any other partitioning. But I don't think it's used much anyways.

lidavidm · 2022-04-29T16:49:12Z

python/pyarrow/_dataset.pyx

nit but do both of these need a default? I can see prefix having a default because it's a new argument

Also IMO the parameter should be named "filename", and/or we should have a docstring

Previously, when the parse() method only accepted a path, we used to pass the filename there, as FilenamePartitioning only needs the filename and not the directory. But, now that it is expecting both a directory and filename prefix, so I believe a directory will not be required for FilenamePartitioning, thus using an empty string as default.

Renamed prefix to filename in PartitionPathFormat

lidavidm · 2022-04-29T17:23:02Z

cpp/src/arrow/dataset/file_parquet.cc

This still seems off, but I can't figure out how to hit this case.

lidavidm · 2022-04-29T18:46:07Z

Thanks.

@westonpace any comments here?

westonpace · 2022-05-03T07:16:12Z

@github-actions autotune

westonpace

I was looking at the R change and it boils down to how we want to expose this to R. However, I don't think we want to change the interface to PartitioningFactory. It makes sense that Partitioning might change. We could justify it because users might create custom subclasses and we want to make it as easy as possible.

However, I think PartitioningFactory::Inspect should continue to take in std::vector<std::string>

…t for R

…ngFactory__Inspect

westonpace

This seems correct. One last question I think for me.

cpp/src/arrow/dataset/partition.cc

ursabot · 2022-05-27T03:51:14Z

Benchmark runs are scheduled for baseline = f3e09b9 and contender = adb5b00. adb5b00 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.08% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.08% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] adb5b000 ec2-t3-xlarge-us-east-2
[Failed] adb5b000 test-mac-arm
[Finished] adb5b000 ursa-i9-9960x
[Finished] adb5b000 ursa-thinkcentre-m75q
[Finished] f3e09b9b ec2-t3-xlarge-us-east-2
[Finished] f3e09b9b test-mac-arm
[Finished] f3e09b9b ursa-i9-9960x
[Finished] f3e09b9b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added Component: C++ Component: Python labels Apr 24, 2022

lidavidm reviewed Apr 25, 2022

View reviewed changes

sanjibansg requested a review from lidavidm April 26, 2022 20:16

lidavidm reviewed Apr 27, 2022

View reviewed changes

lidavidm approved these changes Apr 29, 2022

View reviewed changes

westonpace self-requested a review April 29, 2022 20:34

github-actions bot added the Component: R label Apr 29, 2022

westonpace reviewed May 5, 2022

View reviewed changes

sanjibansg requested a review from westonpace May 5, 2022 14:10

sanjibansg added 17 commits May 21, 2022 19:33

fix: use StripPrefix() for FilenamePartitioning

2c3cfb0

test: roundtrip test for partitioning

ba752aa

feat: modify Parse() to accept both directory and filename prefix

d666ce9

feat: ARROW_EXPORT PartitionPathFormat

8da2d5e

fix: remove shadowed variable fixed_path

ab1e387

feat: use PartitionPathFormat as argument to Parse methods

52ec5ee

test: Parse tests to use PartitionPathArgument object

ecaffb7

feat: modify Parse() method in PyArrow to use PartitionPathFormat

9d277cd

docs: update docstring of StripPrefixAndFilename

6d8c4f2

refactor: rename prefix to filename in PartitionPathFormat

6876f21

feat: Inspect() methods to use PartitionPathFormat

ec93579

fix: use PartitionPathFormat in dataset___PartitioningFactory__Inspec…

83ea5fd

…t for R

fix: using PartitionPathFormat in arrowExports for R

5b5cf95

fix: creating PartitionPathFormat object only in dataset___Partitioni…

c526065

…ngFactory__Inspect

fix: Remove R implementation, instead put check in InspectSchemas

c7d1136

fix: GetOrInferSchema() to have PartitionPathFormat in argument

267ac94

refactor: remove including partition.h in r/src/arrow_types.h

084fd96

fix: cpp lint

6133121

sanjibansg force-pushed the fix-FilenamePartitioning branch from 4517671 to 6133121 Compare May 21, 2022 14:24

sanjibansg added 5 commits May 23, 2022 12:52

feat: Parse() method to take the complete path

14b2711

feat: [Python] Parse method to take the entire path

a524acc

fix: python test changes for new parse method signature

e08b2e5

fix: cpp & python lint

1a61b0c

fix: modify tests to have correct parse strings

2ae2ea3

sanjibansg force-pushed the fix-FilenamePartitioning branch from 8a6cc9c to 2ae2ea3 Compare May 23, 2022 23:33

westonpace reviewed May 25, 2022

View reviewed changes

cpp/src/arrow/dataset/partition.cc Show resolved Hide resolved

fix: processing segments in ParseKeys method of FunctionPartitioning

718c924

sanjibansg requested a review from westonpace May 26, 2022 08:37

westonpace approved these changes May 27, 2022

View reviewed changes

westonpace closed this in adb5b00 May 27, 2022

sanjibansg deleted the fix-FilenamePartitioning branch May 27, 2022 01:35

jorisvandenbossche mentioned this pull request May 31, 2022

MINOR: [Python] Fix Partitioning parse() docstring example #13269

Merged

asfimport mentioned this pull request Jul 28, 2022

[C++] Null values in partitioning field for FilenamePartitioning #31689

Closed

ARROW-16302: [C++] Null values in partitioning field for FilenamePartitioning #12977

ARROW-16302: [C++] Null values in partitioning field for FilenamePartitioning #12977

Uh oh!

Conversation

sanjibansg commented Apr 24, 2022

Uh oh!

github-actions bot commented Apr 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanjibansg Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanjibansg Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanjibansg Apr 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Apr 29, 2022

Uh oh!

westonpace commented May 3, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

sanjibansg Apr 25, 2022 •

edited

Loading

sanjibansg Apr 26, 2022 •

edited

Loading

sanjibansg Apr 29, 2022 •

edited

Loading