-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-37242: [Python][Parquet] Parquet Support write and validate Page CRC #38360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
94ab787 to
a137062
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM
It's a pity that we don't have testing for parquet-testing ( https://github.com/apache/parquet-testing/tree/master/data ). So that it's a bit hard to verify the case the crc is corrupt. Maybe the case can be added later.
|
Would you mind rebase master and retrigger the CI? I don't know why |
|
AppVeyor failure is not connected, there is already an issue opened: #38431 |
I'm actually working on a very dumb unit test to verify CRC checks. However, my test passes instead of failing.. Will soon push my draft unit test so that I can get some feedback on that too. |
|
You can have a try at When detect a Parquet CRC error, the underlying C++ will throw an Exception ( see https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L500 ). |
481420c to
69a9ba8
Compare
|
I have pushed my changes for the unit test. In particular, the test I also tried reading the following corrupted files, as suggested: But I had the same results, no exception is raised! Am I missing something? |
7aac42f to
30e8311
Compare
Have you finish it now? Let me take a look |
|
I've open an pr to test with debug log. Let me dive into it tonight |
f4d6b72 to
ef94bfa
Compare
|
I've add some logs in #38501 The |
|
Added even more logs in this branch, and got a surprising result: I see 2 possible explanations:
|
Sorry for late reply because I'm so busy in these days. I'll try to find the reason out this weekend. I guess this is C++'s dataset's problem. Let me find it out. |
ef94bfa to
b5ea1ec
Compare
AlenkaF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you for the changes! I think it is looking very good, I just have two small nits. Will try to have one last detailed look again today 👍
74341e3 to
fd30563
Compare
AlenkaF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick change! I have some suggestions about the tests, but we are close to ready 👍
2de1ab7 to
0a5c7e7
Compare
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: mwish <maplewish117@gmail.com>
0a5c7e7 to
b25c124
Compare
|
Thank you so much for quick iterations and patience with our suggestions 👍 I have re-run the integration job. The R failures do not seem to be related. |
|
It seems like the "Integration / AMD64 Conda Integration Test (pull_request)" jobs is cancelled after 60m due to a timeout. Not sure why it's taking so long, it took only 35m on other PRs that were recently closed. |
|
There are some PRs that also have the integration job time close to 60min (57min for example) and some PRs are also timing out https://github.com/apache/arrow/actions/workflows/integration.yml?query=is%3Afailure. I do not think it is related but will run the job one more time just to see if it changes. |
|
I've rerun the failed job. Would you mind move forward if ci passed? @AlenkaF |
|
@github-actions crossbow submit -g python |
|
Revision: b25c124 Submitted crossbow builds: ursacomputing/crossbow @ actions-e7d4f3b41c |
AlenkaF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
|
Thank you very much for all your guidance and reviews. I felt welcomed and supported in all steps of the process. One question: which could be the first official release of the |
It would be in 15.0.0 . Currently we're just released 14.0.0, 14 only accept severe bugfixes. |
|
Thanks @frazar ! |
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 68ba49d. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…Page CRC (apache#38360) ### Rationale for this change The C++ Parquet API already supports enabling CRC checksum for read and write operations. CRC checksum are optional and can detect data corruption due to, for example, file storage issues or [cosmic rays](https://en.wikipedia.org/wiki/Soft_error). It would then be beneficial to expose this optional functionality to the Python API too. This PR is based on a previous PR which became stale: apache#37439 ### What changes are included in this PR? The PyArrow interface is expanded to include a `page_checksum_enabled` flag. ### Are these changes tested? [ ] NOT YET! ### Are there any user-facing changes? The change is backward compatible. An additional, optional keyword argument is added to some interfaces. Closes apache#37242 Supersedes apache#37439 * Closes: apache#37242 Lead-authored-by: Francesco Zardi <frazar0@hotmail.it> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Rationale for this change
The C++ Parquet API already supports enabling CRC checksum for read and write operations.
CRC checksum are optional and can detect data corruption due to, for example, file storage issues or cosmic rays.
It would then be beneficial to expose this optional functionality to the Python API too.
This PR is based on a previous PR which became stale: #37439
What changes are included in this PR?
The PyArrow interface is expanded to include a
page_checksum_enabledflag.Are these changes tested?
[ ] NOT YET!
Are there any user-facing changes?
The change is backward compatible. An additional, optional keyword argument is added to some interfaces.
Closes #37242
Supersedes #37439