-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset #12811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
23c62fd
Deprecate use_legacy_dataset=True in parquet.write_to_dataset and cor…
AlenkaF a75664e
Linter corrections
AlenkaF 5a9f5c5
Rearrange the code a bit
AlenkaF 359ddc0
Revert to using None and checking for partition_filename_cb keyward
AlenkaF eb891c9
Add info to partition_filename_cb depr msg and add it where missing
AlenkaF 90e6b72
Add change to the write_to_dataset docstring
AlenkaF 55069f0
Expose all write_dataset keywords to write_to_dataset
AlenkaF 19c9440
Add keyword parameters to the docstrings
AlenkaF 1417886
Remove format from write_to_dataset keywords
AlenkaF 52b7f5a
Move the check and warning for partition_filename_cb
AlenkaF 26ad057
Linter corrections
AlenkaF b0e4423
Add basename_tamplate to new implementation - mimic old implementation
AlenkaF 0515a7d
Make corrections to the docstrings
AlenkaF 4477b5f
Remove format from keywords
AlenkaF 71c97dc
Remove file_options keyword
AlenkaF 170ef52
Amend the tests to use use_legacy_dataset=use_legacy_dataset
AlenkaF aa33d71
Apply suggestions from code review
AlenkaF c2cf529
Update python/pyarrow/parquet/__init__.py
AlenkaF 6c88c33
Correct _create_parquet_dataset_simple helper function and linter error
AlenkaF File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here this fails with using the new dataset implementation, because
dataset.write_dataset(..)doesn't support the parquetrow_group_sizekeyword (to whichchunk_sizegets translated). TheParquetFileWriteOptionsdoesn't support this keyword.On the parquet side, this is also the only keyword that is not passed to the
ParquetWriterinit (and thus to parquet'sWriterPropertiesorArrowWriterProperties), but to the actualwrite_tablecall. In C++ this can be seen atarrow/cpp/src/parquet/arrow/writer.h
Lines 62 to 71 in 76d064c
cc @westonpace do you remember if this has been discussed before how the
row_group_size/chunk_sizesetting from Parquet fits into the dataset API?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataset API now has a
max_rows_per_group, I see, but that doesn't necessarily directly relate to Parquet row groups?It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can open a follow-up JIRA for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.apache.org/jira/browse/ARROW-16240