-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14620: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior #11632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ions of pandas. Using reset_index instead
|
Appveyor is failing S3FS test which is unrelated. |
python/pyarrow/dataset.py
Outdated
| def file_visitor(written_file): | ||
| visited_paths.append(written_file.path) | ||
| existing_data_behavior : 'error' | 'overwrite' | 'delete_matching' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know "overwrite_or_ignore" would get long, but on the other hand it is also more explicit .. One could misinterpret "overwrite" as "overwrite the full dataset" instead of "overwrite matching files and ignore the others" (one for me "overwrite the full dataset" would mean it deletes all files, whether they would clash or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with overwrite_or_ignore. I changed it. I also clarified the docstring for the option a little to be explicit that non-matching files are left alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course this then also makes it inconsistent with R ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick with overwrite_or_ignore. Should we decide we need to change at some point down the line it would be a fairly minor change even if we wanted to keep backwards compatibility with the old style. The R & python dataset APIs are already pretty different.
python/pyarrow/_dataset.pyx
Outdated
| else: | ||
| raise ValueError( | ||
| ('existing_data_behavior must be one of error, ', | ||
| 'overwrite_or_ignore or delete_matching') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add quotes around each possible value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, added.
| filesystem=None, file_options=None, use_threads=True, | ||
| max_partitions=None, file_visitor=None): | ||
| max_partitions=None, file_visitor=None, | ||
| existing_data_behavior='error'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but I would expect most of the parameters to be declared keyword-only. @jorisvandenbossche Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would indeed be good. But maybe let's leave that for another (non 6.0.1) PR? (could already mark the new keyword here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely make it a separate JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! (just a few small docstring formatting comments)
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
|
Thank you very much for the improvements. I'll merge on green. |
|
Travis seems backed up and it has passed before (and the changes were all comments) so I'm going to merge. |
|
Benchmark runs are scheduled for baseline = 939db7f and contender = caf1e1e. caf1e1e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…es it impossible to maintain old behavior Closes #11632 from westonpace/bugfix/ARROW-14620--existing-data-behavior-missing-in-python Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
|
Hi everyone, shouldn't the |
|
@chribag good catch. Let's continue on the JIRA you opened: https://issues.apache.org/jira/browse/ARROW-15757 |
No description provided.