-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112
Conversation
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(already posting whatever I have right now)
The PartitionSet and ParquetPartitions classes can also be removed?
There a few helper methods like _get_filesystem_and_path and _mkdir_if_not_exists that are no longer used and can be removed as well
|
It is very hard to review this PR due to the way the diff is presented in GitHub. I tried to summarise the main changes in the description of the PR, hope it helps a bit. @jorisvandenbossche after the last review I have updated the marks in the tests, added I have also removed |
|
The HDFS failures seem to be related: https://github.com/ursacomputing/crossbow/actions/runs/7288206604/job/19860362598#step:6:9473 |
|
Yeah, missed some metadata there. Thanks, will correct! |
|
@jorisvandenbossche if I am understanding correctly |
|
@github-actions crossbow submit -g python-*-hdfs |
|
|
@github-actions crossbow submit hdfs |
|
I am not sure if raising for the |
|
Revision: 77b4ecb Submitted crossbow builds: ursacomputing/crossbow @ actions-bb763d7a57
|
|
Still need to look into the failures with |
|
Ah, I guess the issue is the same it only fails in |
|
@github-actions crossbow submit hdfs |
|
Revision: 481a85c Submitted crossbow builds: ursacomputing/crossbow @ actions-fc72b69813
|
|
Thanks a lot @AlenkaF for the work here! |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b70ad0b. There were 7 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
The However, I can't reproduce this locally with pyarrow 14.0 (where the legacy reader still exists): |
…on-based implementation (apache#39112) ### Rationale for this change Legacy ParquetDataset has been deprecated for a while now, see apache#31529. This PR is removing the legacy implementation from the code. ### What changes are included in this PR? The PR is removing: - `ParquetDatasetPiece ` - `ParquetManifest` - `_ParquetDatasetMetadata ` - `ParquetDataset` The PR is renaming `_ParquetDatasetV2` to `ParquetDataset` which was removed. It is also updating the docstrings. The PR is updating: - `read_table` - `write_to_dataset` The PR is updating all the tests to not use `use_legacy_dataset` keyword or legacy parametrisation. ### Are these changes tested? Yes. ### Are there any user-facing changes? Deprecated code is removed. * Closes: apache#31303
…able in case dataset is not available (#46550) ### Rationale for this change Since we merged: - #39112 The fallback case were dataset is not available is not being tested on parquet.read_table (I did validate myself with a breakpoint). ### What changes are included in this PR? Adds a basic test that checks the exercises that code path and tests the read_table method from `ParquetFile` instead of `ParquetDataset` in case dataset is not available. ### Are these changes tested? Yes via CI. ### Are there any user-facing changes? No * GitHub Issue: #46373 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
Rationale for this change
Legacy ParquetDataset has been deprecated for a while now, see #31529. This PR is removing the legacy implementation from the code.
What changes are included in this PR?
The PR is removing:
ParquetDatasetPieceParquetManifest_ParquetDatasetMetadataParquetDatasetThe PR is renaming
_ParquetDatasetV2toParquetDatasetwhich was removed. It is also updating the docstrings.The PR is updating:
read_tablewrite_to_datasetThe PR is updating all the tests to not use
use_legacy_datasetkeyword or legacy parametrisation.Are these changes tested?
Yes.
Are there any user-facing changes?
Deprecated code is removed.