ARROW-14620: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior #11632

westonpace · 2021-11-05T23:41:13Z

No description provided.

github-actions · 2021-11-05T23:41:34Z

https://issues.apache.org/jira/browse/ARROW-14620

…ions of pandas. Using reset_index instead

westonpace · 2021-11-06T02:26:27Z

Appveyor is failing S3FS test which is unrelated.

jorisvandenbossche · 2021-11-06T11:54:59Z

python/pyarrow/dataset.py


            def file_visitor(written_file):
                visited_paths.append(written_file.path)
+    existing_data_behavior : 'error' | 'overwrite' | 'delete_matching'


I know "overwrite_or_ignore" would get long, but on the other hand it is also more explicit .. One could misinterpret "overwrite" as "overwrite the full dataset" instead of "overwrite matching files and ignore the others" (one for me "overwrite the full dataset" would mean it deletes all files, whether they would clash or not)

I'm fine with overwrite_or_ignore. I changed it. I also clarified the docstring for the option a little to be explicit that non-matching files are left alone.

Of course this then also makes it inconsistent with R ..

Let's stick with overwrite_or_ignore. Should we decide we need to change at some point down the line it would be a fairly minor change even if we wanted to keep backwards compatibility with the old style. The R & python dataset APIs are already pretty different.

pitrou · 2021-11-08T14:55:18Z

python/pyarrow/_dataset.pyx

+    else:
+        raise ValueError(
+            ('existing_data_behavior must be one of error, ',
+             'overwrite_or_ignore or delete_matching')


Nit: add quotes around each possible value?

Good idea, added.

pitrou · 2021-11-08T14:57:10Z

python/pyarrow/dataset.py

                  filesystem=None, file_options=None, use_threads=True,
-                  max_partitions=None, file_visitor=None):
+                  max_partitions=None, file_visitor=None,
+                  existing_data_behavior='error'):


Unrelated to this PR, but I would expect most of the parameters to be declared keyword-only. @jorisvandenbossche Thoughts?

Yes, that would indeed be good. But maybe let's leave that for another (non 6.0.1) PR? (could already mark the new keyword here)

Yes, definitely make it a separate JIRA.

ARROW-14632

jorisvandenbossche

Looks good! (just a few small docstring formatting comments)

python/pyarrow/dataset.py

python/pyarrow/tests/test_dataset.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

westonpace · 2021-11-08T19:52:14Z

Thank you very much for the improvements. I'll merge on green.

westonpace · 2021-11-08T22:35:32Z

Travis seems backed up and it has passed before (and the changes were all comments) so I'm going to merge.

ursabot · 2021-11-08T22:37:08Z

Benchmark runs are scheduled for baseline = 939db7f and contender = caf1e1e. caf1e1e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.03% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

…es it impossible to maintain old behavior Closes #11632 from westonpace/bugfix/ARROW-14620--existing-data-behavior-missing-in-python Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

chribag · 2022-02-22T19:59:49Z

Hi everyone, shouldn't the existing_data_behavior bindings be propagated higher up here in the parquet.py module?
Passing **kwargs as is the case for write_table would do the trick I think.
I am finding myself stuck while using pandas.to_parquet with use_legacy_dataset=false and no way to set the existing_data_behavior flag to overwrite_or_ignore

jorisvandenbossche · 2022-02-24T15:12:58Z

@chribag good catch. Let's continue on the JIRA you opened: https://issues.apache.org/jira/browse/ARROW-15757

ARROW-13703: Added existing data behavior bindings to python

7c0494e

westonpace mentioned this pull request Nov 5, 2021

ARROW-14620: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior #11631

Closed

github-actions bot added the Component: Python label Nov 5, 2021

westonpace added 2 commits November 5, 2021 14:46

ARROW-14620: Missed a line in the original commit

419a7e6

ARROW-14620: sort_values(ignore_index) is not supported on older vers…

4398368

…ions of pandas. Using reset_index instead

jorisvandenbossche reviewed Nov 6, 2021

View reviewed changes

ARROW-14620: Renamed overwrite to overwrite_or_ignore and clarified docs

a75eadb

pitrou reviewed Nov 8, 2021

View reviewed changes

ARROW-14620: Improving error message per PR suggestion

5fc42df

westonpace requested review from jorisvandenbossche and pitrou November 8, 2021 18:40

jorisvandenbossche approved these changes Nov 8, 2021

View reviewed changes

Apply suggestions from code review

9f7780b

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

westonpace closed this in caf1e1e Nov 8, 2021

westonpace deleted the bugfix/ARROW-14620--existing-data-behavior-missing-in-python branch January 6, 2022 08:14

This was referenced Feb 24, 2022

[Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior #30165

Closed

[Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior #31202

Closed

ARROW-14620: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior #11632

ARROW-14620: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior #11632

Uh oh!

Conversation

westonpace commented Nov 5, 2021

Uh oh!

github-actions bot commented Nov 5, 2021

Uh oh!

westonpace commented Nov 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace commented Nov 8, 2021

Uh oh!

westonpace commented Nov 8, 2021

Uh oh!

ursabot commented Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chribag commented Feb 22, 2022

Uh oh!

jorisvandenbossche commented Feb 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ursabot commented Nov 8, 2021 •

edited

Loading