-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is cdc a part of the parquet spec? Or is it a poc?
It is not. You can think of it as an implementation specific feature similar to the existing options to specify how record batches and pages are being split. |
9919b4f to
6fc058d
Compare
|
Thanks for doing this @kszucs ! I like how this doesn't need any changes to readers. Questions:
|
e82efee to
f854ee8
Compare
f854ee8 to
58e1a82
Compare
58e1a82 to
7556467
Compare
…supported extension type by the parquet reader
9275cad to
1cc2e4b
Compare
|
@github-actions crossbow submit test-conda-cpp-valgrind |
|
Revision: 1cc2e4b Submitted crossbow builds: ursacomputing/crossbow @ actions-6a43d39b56
|
|
@github-actions crossbow submit preview-docs |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations @kszucs :)
|
Revision: 1cc2e4b Submitted crossbow builds: ursacomputing/crossbow @ actions-81dc2a98bf
|
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (for the C++ part)
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit dd94c90. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…edColumnWriterImpl` (#45688) ### Rationale for this change I am planning to introduce `WriteBatchInternal` and `WriteBatchSpacedInternal` private methods in apache/arrow#45360 which would have required specifying `WriteArrowSerialize`, `WriteArrowZeroCopy` and `WriteTimestamps` as friend functions. Then I noticed that these functions could be consolidated into the column writer making the implementation simpler. ### What changes are included in this PR? - Move `WriteArrowSerialize`, `WriteArrowZeroCopy` and `WriteTimestamps` to be methods on `TypedColumnWriterImpl`. - Remove the `column writer` argument, reorder their parameters to align with `WriteArrow` public method. - Use more explicit type parameter names. ### Are these changes tested? Existing tests should cover these. ### Are there any user-facing changes? No, these are private functions and methods. Resolves apache/arrow#45690 * GitHub Issue: #45690 Authored-by: Krisztian Szucs <szucs.krisztian@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
Rationale for this change
I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.
Deduplication efficiency
The feature enables efficient data deduplication for compressed parquet files on content addressable storage (CAS) systems such as Hugging Face Hub. There is a purpose built evaluation tool is available at https://github.com/kszucs/de used during development to continuously check the improvements and to visually inspect the results. Please take a look at the repository's readme to see how different changes made to parquet files affect the deduplication ratio when they are stored in CAS systems.
Some results calculated on all revisions of datasets.parquet
Some results calculated on all revisions of food.parquet
Chunk sizeshows the actual storage required to store the CDC chunked parquet files in a simple CAS implementation.What changes are included in this PR?
A new column chunker implementation based on CDC algorithm, see more details in the docstrings. The implementation is added to the C++ Parquet writer and exposed in PyArrow as well.
Are these changes tested?
Yes. Tests have been added to the C++ implementation as well as the exposed PyArrow API.
Are there any user-facing changes?
There are two new parquet writer properties on the C++ side:
enable_content_defined_chunking()to enable the featurecontent_defined_chunking_options(min_chunk_size, max_chunk_size, norm_factor)to provide additional optionsThere is a new
pq.write_table(..., use_content_defined_chunking=)keyword argument to expose the feature on the Python side.I marked all user-facing changes as
EXPERIMENTAL.