Skip to content

Conversation

@westonpace
Copy link
Member

No description provided.

@github-actions
Copy link

github-actions bot commented Nov 5, 2021

@westonpace
Copy link
Member Author

Appveyor is failing S3FS test which is unrelated.

def file_visitor(written_file):
visited_paths.append(written_file.path)
existing_data_behavior : 'error' | 'overwrite' | 'delete_matching'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know "overwrite_or_ignore" would get long, but on the other hand it is also more explicit .. One could misinterpret "overwrite" as "overwrite the full dataset" instead of "overwrite matching files and ignore the others" (one for me "overwrite the full dataset" would mean it deletes all files, whether they would clash or not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with overwrite_or_ignore. I changed it. I also clarified the docstring for the option a little to be explicit that non-matching files are left alone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course this then also makes it inconsistent with R ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with overwrite_or_ignore. Should we decide we need to change at some point down the line it would be a fairly minor change even if we wanted to keep backwards compatibility with the old style. The R & python dataset APIs are already pretty different.

else:
raise ValueError(
('existing_data_behavior must be one of error, ',
'overwrite_or_ignore or delete_matching')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add quotes around each possible value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added.

filesystem=None, file_options=None, use_threads=True,
max_partitions=None, file_visitor=None):
max_partitions=None, file_visitor=None,
existing_data_behavior='error'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but I would expect most of the parameters to be declared keyword-only. @jorisvandenbossche Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would indeed be good. But maybe let's leave that for another (non 6.0.1) PR? (could already mark the new keyword here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely make it a separate JIRA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! (just a few small docstring formatting comments)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@westonpace
Copy link
Member Author

Thank you very much for the improvements. I'll merge on green.

@westonpace
Copy link
Member Author

Travis seems backed up and it has passed before (and the changes were all comments) so I'm going to merge.

@westonpace westonpace closed this in caf1e1e Nov 8, 2021
@ursabot
Copy link

ursabot commented Nov 8, 2021

Benchmark runs are scheduled for baseline = 939db7f and contender = caf1e1e. caf1e1e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.03% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

kou pushed a commit that referenced this pull request Nov 10, 2021
…es it impossible to maintain old behavior

Closes #11632 from westonpace/bugfix/ARROW-14620--existing-data-behavior-missing-in-python

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace deleted the bugfix/ARROW-14620--existing-data-behavior-missing-in-python branch January 6, 2022 08:14
@chribag
Copy link

chribag commented Feb 22, 2022

Hi everyone, shouldn't the existing_data_behavior bindings be propagated higher up here in the parquet.py module?
Passing **kwargs as is the case for write_table would do the trick I think.
I am finding myself stuck while using pandas.to_parquet with use_legacy_dataset=false and no way to set the existing_data_behavior flag to overwrite_or_ignore

@jorisvandenbossche
Copy link
Member

@chribag good catch. Let's continue on the JIRA you opened: https://issues.apache.org/jira/browse/ARROW-15757

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants