Skip to content

Conversation

@frazar
Copy link
Contributor

@frazar frazar commented Oct 19, 2023

Rationale for this change

The C++ Parquet API already supports enabling CRC checksum for read and write operations.

CRC checksum are optional and can detect data corruption due to, for example, file storage issues or cosmic rays.

It would then be beneficial to expose this optional functionality to the Python API too.

This PR is based on a previous PR which became stale: #37439

What changes are included in this PR?

The PyArrow interface is expanded to include a page_checksum_enabled flag.

Are these changes tested?

[ ] NOT YET!

Are there any user-facing changes?

The change is backward compatible. An additional, optional keyword argument is added to some interfaces.

Closes #37242
Supersedes #37439

@frazar frazar force-pushed the parquet/python-support-crc-NEW branch 6 times, most recently from 94ab787 to a137062 Compare October 24, 2023 20:00
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM

It's a pity that we don't have testing for parquet-testing ( https://github.com/apache/parquet-testing/tree/master/data ). So that it's a bit hard to verify the case the crc is corrupt. Maybe the case can be added later.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 25, 2023
@mapleFU
Copy link
Member

mapleFU commented Oct 25, 2023

Would you mind rebase master and retrigger the CI? I don't know why continuous-integration/appveyor/pr failed :-(

@frazar

@AlenkaF
Copy link
Member

AlenkaF commented Oct 25, 2023

AppVeyor failure is not connected, there is already an issue opened: #38431

@frazar
Copy link
Contributor Author

frazar commented Oct 25, 2023

It's a pity that we don't have testing for parquet-testing ( https://github.com/apache/parquet-testing/tree/master/data ). So that it's a bit hard to verify the case the crc is corrupt.

I'm actually working on a very dumb unit test to verify CRC checks. However, my test passes instead of failing.. Will soon push my draft unit test so that I can get some feedback on that too.

@mapleFU
Copy link
Member

mapleFU commented Oct 25, 2023

You can have a try at datapage_v1-corrupt-checksum.parquet and rle-dict-uncompressed-corrupt-checksum.parquet in parquet-testing repo. These would help.

When detect a Parquet CRC error, the underlying C++ will throw an Exception ( see https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L500 ).

@frazar frazar force-pushed the parquet/python-support-crc-NEW branch 3 times, most recently from 481420c to 69a9ba8 Compare October 25, 2023 20:43
@frazar
Copy link
Contributor Author

frazar commented Oct 25, 2023

I have pushed my changes for the unit test. In particular, the test test_page_checksum_verification() of "python/pyarrow/tests/parquet/test_basic.py" does not pass because reading a corrupted file does not raise an exception!

I also tried reading the following corrupted files, as suggested:

But I had the same results, no exception is raised!

Am I missing something?

@frazar frazar force-pushed the parquet/python-support-crc-NEW branch 2 times, most recently from 7aac42f to 30e8311 Compare October 26, 2023 20:13
@mapleFU
Copy link
Member

mapleFU commented Oct 28, 2023

But I had the same results, no exception is raised!

Have you finish it now? Let me take a look

@mapleFU
Copy link
Member

mapleFU commented Oct 28, 2023

I've open an pr to test with debug log. Let me dive into it tonight

@frazar frazar force-pushed the parquet/python-support-crc-NEW branch from f4d6b72 to ef94bfa Compare October 28, 2023 17:36
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2023

I've add some logs in #38501

The ParquetFragmentScanOptions is built with crc, but Parquet reader didn't. Would @jorisvandenbossche mind help and take a look?

@frazar
Copy link
Contributor Author

frazar commented Oct 30, 2023

Added even more logs in this branch, and got a surprising result:

dataset: read_options_args: {'coerce_int96_timestamp_unit': None} scan_args {'pre_buffer': True, 'thrift_string_size_limit': None, 'thrift_container_size_limit': None, 'page_checksum_verification': True}
read_options: None
default_fragment_scan_options: None
build scanOptions with  {'pre_buffer': True, 'thrift_string_size_limit': None, 'thrift_container_size_limit': None, 'page_checksum_verification': True}
set_page_checksum_verification() called with: check_crc=true     <--- Here the C++ setter is called with argument true
page_checksum_verification_ is now: true
page_checksum_verification() called, returning: false            <--- Here the C++ getter is called, but returns false!
Open with crc: false
current page type is: 2, isset crc is: true
page_checksum_verification() called, returning: false
properties_.page_checksum_verification(): false
current_page_header_.__isset.crc: true
PageCanUseChecksum(page_type): true
page_checksum_verification() called, returning: false
current page type is: 0, isset crc is: true
page_checksum_verification() called, returning: false
properties_.page_checksum_verification(): false
current_page_header_.__isset.crc: true
PageCanUseChecksum(page_type): true
page_checksum_verification() called, returning: false

I see 2 possible explanations:

  • Either something is zero-ing page_checksum_verification_ without using the setter,
  • Or we are looking at methods call for two different instances

@mapleFU
Copy link
Member

mapleFU commented Nov 2, 2023

set_page_checksum_verification() called with: check_crc=true <--- Here the C++ setter is called with argument true
page_checksum_verification_ is now: true

Sorry for late reply because I'm so busy in these days. I'll try to find the reason out this weekend. I guess this is C++'s dataset's problem. Let me find it out.

@frazar frazar force-pushed the parquet/python-support-crc-NEW branch from ef94bfa to b5ea1ec Compare November 2, 2023 17:12
@frazar frazar requested a review from westonpace as a code owner November 2, 2023 17:12
Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you for the changes! I think it is looking very good, I just have two small nits. Will try to have one last detailed look again today 👍

@frazar frazar force-pushed the parquet/python-support-crc-NEW branch from 74341e3 to fd30563 Compare November 14, 2023 00:09
Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick change! I have some suggestions about the tests, but we are close to ready 👍

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Nov 14, 2023
@frazar frazar force-pushed the parquet/python-support-crc-NEW branch from 2de1ab7 to 0a5c7e7 Compare November 14, 2023 19:12
frazar and others added 2 commits November 14, 2023 20:20
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
@frazar frazar force-pushed the parquet/python-support-crc-NEW branch from 0a5c7e7 to b25c124 Compare November 14, 2023 19:20
@AlenkaF
Copy link
Member

AlenkaF commented Nov 15, 2023

Thank you so much for quick iterations and patience with our suggestions 👍

I have re-run the integration job. The R failures do not seem to be related.

@frazar
Copy link
Contributor Author

frazar commented Nov 15, 2023

It seems like the "Integration / AMD64 Conda Integration Test (pull_request)" jobs is cancelled after 60m due to a timeout. Not sure why it's taking so long, it took only 35m on other PRs that were recently closed.

@AlenkaF
Copy link
Member

AlenkaF commented Nov 16, 2023

There are some PRs that also have the integration job time close to 60min (57min for example) and some PRs are also timing out https://github.com/apache/arrow/actions/workflows/integration.yml?query=is%3Afailure. I do not think it is related but will run the job one more time just to see if it changes.

@mapleFU
Copy link
Member

mapleFU commented Nov 19, 2023

I've rerun the failed job. Would you mind move forward if ci passed? @AlenkaF

@AlenkaF
Copy link
Member

AlenkaF commented Nov 20, 2023

@github-actions crossbow submit -g python

@github-actions
Copy link

Revision: b25c124

Submitted crossbow builds: ursacomputing/crossbow @ actions-e7d4f3b41c

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.10-cython2 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.5.0 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.12 Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.5.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-38-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@AlenkaF
Copy link
Member

AlenkaF commented Nov 20, 2023

@frazar would you mind adding "take" to the issue #37242 before I merge?

@frazar
Copy link
Contributor Author

frazar commented Nov 20, 2023

Thank you very much for all your guidance and reviews. I felt welcomed and supported in all steps of the process.

One question: which could be the first official release of the pyarrow package including this feature? Not asking for an ETA, just a broad estimate. Thanks!!

@mapleFU
Copy link
Member

mapleFU commented Nov 20, 2023

One question: which could be the first official release of the pyarrow package including this feature? Not asking for an ETA, just a broad estimate. Thanks!!

It would be in 15.0.0 . Currently we're just released 14.0.0, 14 only accept severe bugfixes.

@AlenkaF AlenkaF merged commit 68ba49d into apache:main Nov 20, 2023
@AlenkaF AlenkaF removed the awaiting change review Awaiting change review label Nov 20, 2023
@mapleFU
Copy link
Member

mapleFU commented Nov 20, 2023

Thanks @frazar !

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 68ba49d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…Page CRC (apache#38360)

### Rationale for this change

The C++ Parquet API already supports enabling CRC checksum for read and write operations.

CRC checksum are optional and can detect data corruption due to, for example, file storage issues or [cosmic rays](https://en.wikipedia.org/wiki/Soft_error).

It would then be beneficial to expose this optional functionality to the Python API too.

This PR is based on a previous PR which became stale: apache#37439

### What changes are included in this PR?

The PyArrow interface is expanded to include a `page_checksum_enabled` flag.

### Are these changes tested?

[ ] NOT YET!

### Are there any user-facing changes?

The change is backward compatible. An additional, optional keyword argument is added to some interfaces.

Closes apache#37242
Supersedes apache#37439
* Closes: apache#37242

Lead-authored-by: Francesco Zardi <frazar0@hotmail.it>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python][Parquet] Parquet Support write and validate CRC

4 participants