ARROW-16420: [Python] pq.write_to_dataset always ignores partitioning #13062

AlenkaF · 2022-05-04T09:44:54Z

Remove the lines that unconditionally set partitioning and file_visitor in pq.write_to_dataset to None. This is a leftover from #12811 where additional pq.write_dataset keywords were exposed.

github-actions · 2022-05-04T09:47:00Z

https://issues.apache.org/jira/browse/ARROW-16420

github-actions · 2022-05-04T09:47:02Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2022-05-04T12:07:18Z

Thanks!

Is it possible to add tests for these?

…error for existing_data_behavior check in the write_to_dataset

AlenkaF · 2022-05-05T07:02:41Z

I added a test that checks for partitioning and file_visitor being correctly passed in pq.write_to_dataset.

While writing the test I bumped into another error. If the basename_template is specified as a keyword in pq.write_to_dataset (not being None) the code missed the check for existing_data_behavior and so the call to ds.write_dataset errored due to existing_data_behavior being None and not a string. I decided to add a correction here as this is also my leftover, but from #12838. I could do a separate PR if there will be any opinion in favour of it.

jorisvandenbossche

Thanks, looks perfect!

I think it is fine to include the other changes here as well, as they are very similar

lidavidm

LGTM.

I wonder for some of these 'conflicting' options, should we raise an error? For instance if the user passes both 'partitioning' and 'partition_cols', or 'metadata_collector' and 'file_visitor'.

AlenkaF · 2022-05-10T09:34:03Z

Yes, that makes sense. Will do.

jorisvandenbossche · 2022-05-18T16:01:37Z

@AlenkaF do you can to do that here, or in a follow-up PR? (either way is fine)

AlenkaF · 2022-05-18T16:20:02Z

Sorry, am a bit distracted by other issues.
Let's do a follow-up so this PR can get closed. Will create a JIRA for it today.

AlenkaF · 2022-05-18T18:51:11Z

Created a JIRA for the follow-up:
https://issues.apache.org/jira/browse/ARROW-16610

ursabot · 2022-05-19T23:12:14Z

Benchmark runs are scheduled for baseline = 1cdedc4 and contender = 0a0d7fe. 0a0d7fe is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.51% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.2% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0a0d7fea ec2-t3-xlarge-us-east-2
[Failed] 0a0d7fea test-mac-arm
[Failed] 0a0d7fea ursa-i9-9960x
[Finished] 0a0d7fea ursa-thinkcentre-m75q
[Finished] 1cdedc4c ec2-t3-xlarge-us-east-2
[Failed] 1cdedc4c test-mac-arm
[Failed] 1cdedc4c ursa-i9-9960x
[Finished] 1cdedc4c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Fix write_to_dataset to not ignore partitioning and file_visitor keyord

3df165e

github-actions bot added the Component: Python label May 4, 2022

AlenkaF added 2 commits May 4, 2022 15:13

Add a test for partitioning keyword

0f11634

Rearange the test to check file_visitor keyword also PLUS correct an …

67a1bd8

…error for existing_data_behavior check in the write_to_dataset

jorisvandenbossche approved these changes May 5, 2022

View reviewed changes

lidavidm approved these changes May 9, 2022

View reviewed changes

jorisvandenbossche closed this in 0a0d7fe May 19, 2022

AlenkaF deleted the ARROW-16420 branch June 6, 2022 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16420: [Python] pq.write_to_dataset always ignores partitioning #13062

ARROW-16420: [Python] pq.write_to_dataset always ignores partitioning #13062

Uh oh!

AlenkaF commented May 4, 2022

Uh oh!

github-actions bot commented May 4, 2022

Uh oh!

github-actions bot commented May 4, 2022

Uh oh!

lidavidm commented May 4, 2022

Uh oh!

AlenkaF commented May 5, 2022

Uh oh!

jorisvandenbossche left a comment

Uh oh!

lidavidm left a comment

Uh oh!

AlenkaF commented May 10, 2022

Uh oh!

jorisvandenbossche commented May 18, 2022

Uh oh!

AlenkaF commented May 18, 2022

Uh oh!

AlenkaF commented May 18, 2022

Uh oh!

ursabot commented May 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-16420: [Python] pq.write_to_dataset always ignores partitioning #13062

ARROW-16420: [Python] pq.write_to_dataset always ignores partitioning #13062

Uh oh!

Conversation

AlenkaF commented May 4, 2022

Uh oh!

github-actions bot commented May 4, 2022

Uh oh!

github-actions bot commented May 4, 2022

Uh oh!

lidavidm commented May 4, 2022

Uh oh!

AlenkaF commented May 5, 2022

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

AlenkaF commented May 10, 2022

Uh oh!

jorisvandenbossche commented May 18, 2022

Uh oh!

AlenkaF commented May 18, 2022

Uh oh!

AlenkaF commented May 18, 2022

Uh oh!

ursabot commented May 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants