Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 29, 2023

Rationale for this change

Now, C++ Parquet API already supports CRC in reading and write.

Though system like S3 will ensure the storage data works well. But some data storage like HDD or SSD might corrupt. And network might provide bad result. So having CRC would helps.

Now it's better to has crc in Python code.

What changes are included in this PR?

Nothing changed in C++ codebase. It's a wrapper for using CRC in PyArrow.

Are these changes tested?

  • Will do

Are there any user-facing changes?

Yes, user can verify and write crc for Parquet Page.

@github-actions
Copy link

⚠️ GitHub issue #37242 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 25, 2023

Close since has a newer impl

@mapleFU mapleFU closed this Oct 25, 2023
AlenkaF added a commit that referenced this pull request Nov 20, 2023
…RC (#38360)

### Rationale for this change

The C++ Parquet API already supports enabling CRC checksum for read and write operations.

CRC checksum are optional and can detect data corruption due to, for example, file storage issues or [cosmic rays](https://en.wikipedia.org/wiki/Soft_error).

It would then be beneficial to expose this optional functionality to the Python API too.

This PR is based on a previous PR which became stale: #37439

### What changes are included in this PR?

The PyArrow interface is expanded to include a `page_checksum_enabled` flag.

### Are these changes tested?

[ ] NOT YET!

### Are there any user-facing changes?

The change is backward compatible. An additional, optional keyword argument is added to some interfaces.

Closes #37242
Supersedes #37439
* Closes: #37242

Lead-authored-by: Francesco Zardi <frazar0@hotmail.it>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…Page CRC (apache#38360)

### Rationale for this change

The C++ Parquet API already supports enabling CRC checksum for read and write operations.

CRC checksum are optional and can detect data corruption due to, for example, file storage issues or [cosmic rays](https://en.wikipedia.org/wiki/Soft_error).

It would then be beneficial to expose this optional functionality to the Python API too.

This PR is based on a previous PR which became stale: apache#37439

### What changes are included in this PR?

The PyArrow interface is expanded to include a `page_checksum_enabled` flag.

### Are these changes tested?

[ ] NOT YET!

### Are there any user-facing changes?

The change is backward compatible. An additional, optional keyword argument is added to some interfaces.

Closes apache#37242
Supersedes apache#37439
* Closes: apache#37242

Lead-authored-by: Francesco Zardi <frazar0@hotmail.it>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python][Parquet] Parquet Support write and validate CRC

1 participant