Skip to content

Conversation

@AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Dec 6, 2023

Rationale for this change

Legacy ParquetDataset has been deprecated for a while now, see #31529. This PR is removing the legacy implementation from the code.

What changes are included in this PR?

The PR is removing:

  • ParquetDatasetPiece
  • ParquetManifest
  • _ParquetDatasetMetadata
  • ParquetDataset

The PR is renaming _ParquetDatasetV2 to ParquetDataset which was removed. It is also updating the docstrings.

The PR is updating:

  • read_table
  • write_to_dataset

The PR is updating all the tests to not use use_legacy_dataset keyword or legacy parametrisation.

Are these changes tested?

Yes.

Are there any user-facing changes?

Deprecated code is removed.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(already posting whatever I have right now)

The PartitionSet and ParquetPartitions classes can also be removed?

There a few helper methods like _get_filesystem_and_path and _mkdir_if_not_exists that are no longer used and can be removed as well

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 7, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 7, 2023
@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 11, 2023

It is very hard to review this PR due to the way the diff is presented in GitHub. I tried to summarise the main changes in the description of the PR, hope it helps a bit.

@jorisvandenbossche after the last review I have updated the marks in the tests, added use_legacy_dataset=None to ParquetDataset class, read_table and arite_to_dataset.

I have also removed ** kwargs from ParquetDataset, previously _ParquetDatasetV2, which meant I had to remove the code connected to metadata, split_row_groups and validate_schema (raising an error) and so I added a note in the docstrings b6799cf. Not sure if that is well done though.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 12, 2023
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Dec 12, 2023
@jorisvandenbossche
Copy link
Member

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

Yeah, missed some metadata there. Thanks, will correct!

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

@jorisvandenbossche if I am understanding correctly read_parquet from the failed build is part of a legacy Filesystem and will be removed? So I can for time being add a ValueError if metadata is specified or I can just remove it from the signature?

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

@github-actions crossbow submit -g python-*-hdfs

@github-actions
Copy link

Invalid group(s) {'python-*-hdfs'}. Must be one of {'integration', 'example-cpp', 'verify-rc-source', 'fuzz', 'nightly', 'python', 'nightly-release', 'verify-rc-binaries', 'cpp', 'c-glib', 'conan', 'java', 'linux', 'go', 'example', 'verify-rc-wheels', 'r', 'packaging', 'test', 'nightly-tests', 'example-python', 'vcpkg', 'verify-rc-jars', 'linux-amd64', 'conda', 'homebrew', 'ruby', 'linux-arm64', 'verify-rc', 'verify-rc-source-macos', 'wheel', 'verify-rc-source-linux', 'nightly-packaging'}
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/7289911234

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

@github-actions crossbow submit hdfs

@jorisvandenbossche
Copy link
Member

I am not sure if raising for the metadata will be sufficient? My understanding is that those tests (for the legacy HDFS filesystem) are using hdfs.read_parquet -> pq.read_table by passing a legacy HDFS filesystem object, and with removing the legacy parquet code, we no longer support the legacy filesystems (and now as a next step, we can actually also remove those legacy filesystems).
So maybe we can just "xfail" those tests for now? And then when removing the HDFS legacy filesystem, we can see if we just remove those tests, or if we rewrite them using the new hdfs filesystem.

@github-actions
Copy link

Revision: 77b4ecb

Submitted crossbow builds: ursacomputing/crossbow @ actions-bb763d7a57

Task Status
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

Still need to look into the failures with Unrecognized filesystem error.

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

Ah, I guess the issue is the same it only fails in _ensure_filesystem and not in read_parquet. Will add xfail marks there also.

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

@github-actions crossbow submit hdfs

@github-actions
Copy link

Revision: 481a85c

Submitted crossbow builds: ursacomputing/crossbow @ actions-fc72b69813

Task Status
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions

@jorisvandenbossche jorisvandenbossche merged commit b70ad0b into apache:main Dec 21, 2023
@jorisvandenbossche
Copy link
Member

Thanks a lot @AlenkaF for the work here!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b70ad0b.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

@jorisvandenbossche
Copy link
Member

The wide-dataframe case seems a genuine perf regression (and not a flaky outlier as the other listed cases). That might mean that for wide dataframes, the new code path is slower compared to the legacy dataset reader (since with this commit, also when specifying use_legacy_dataset=True, the new code path will be used).
That seems to match with the timing in the use_legacy_dataset=False case of the wide-dataframe benchmark, as now both benchmarks more or less show the same timing.

However, I can't reproduce this locally with pyarrow 14.0 (where the legacy reader still exists):

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

dataframe = pd.DataFrame(np.random.rand(100, 10000))
table = pa.Table.from_pandas(dataframe)
pq.write_table(table, "test_wide_dataframe.parquet")
In [7]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet", use_legacy_dataset=True)
392 ms ± 4.67 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)

In [8]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet", use_legacy_dataset=False)
350 ms ± 11.5 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…on-based implementation (apache#39112)

### Rationale for this change

Legacy ParquetDataset has been deprecated for a while now, see
apache#31529. This PR is removing the
legacy implementation from the code.

### What changes are included in this PR?

The PR is removing:
- `ParquetDatasetPiece `
-  `ParquetManifest`
-  `_ParquetDatasetMetadata `
-  `ParquetDataset`

The PR is renaming `_ParquetDatasetV2` to `ParquetDataset` which was
removed. It is also updating the docstrings.

The PR is updating:
- `read_table`
-  `write_to_dataset`

The PR is updating all the tests to not use `use_legacy_dataset` keyword
or legacy parametrisation.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Deprecated code is removed.
* Closes: apache#31303
raulcd added a commit that referenced this pull request May 26, 2025
…able in case dataset is not available (#46550)

### Rationale for this change

Since we merged:
- #39112

The fallback case were dataset is not available is not being tested on parquet.read_table (I did validate myself with a breakpoint).

### What changes are included in this PR?

Adds a basic test that checks the exercises that code path and tests the read_table method from `ParquetFile` instead of `ParquetDataset` in case dataset is not available.

### Are these changes tested?

Yes via CI.

### Are there any user-facing changes?

No

* GitHub Issue: #46373

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Remove the legacy ParquetDataset custom python-based implementation

2 participants