GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

kszucs · 2025-01-27T12:40:52Z

Rationale for this change

I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.

Deduplication efficiency

The feature enables efficient data deduplication for compressed parquet files on content addressable storage (CAS) systems such as Hugging Face Hub. There is a purpose built evaluation tool is available at https://github.com/kszucs/de used during development to continuously check the improvements and to visually inspect the results. Please take a look at the repository's readme to see how different changes made to parquet files affect the deduplication ratio when they are stored in CAS systems.

Some results calculated on all revisions of datasets.parquet

❯ de stats /tmp/datasets                                                                                                  
Writing CDC Parquet files with ZSTD compression                                                                           
100%|███████████████████████████████████████████████████████████████████████████████████| 194/194 [00:12<00:00, 15.73it/s]
Writing CDC Parquet files with Snappy compression                                                                         
100%|███████████████████████████████████████████████████████████████████████████████████| 194/194 [00:10<00:00, 17.95it/s]
Estimating deduplication for Parquet                                                                                      
Estimating deduplication for CDC ZSTD                                                                                     
Estimating deduplication for CDC Snappy                                                                                   
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃            ┃            ┃            ┃     Compressed Chunk ┃             ┃    Compressed Dedup ┃    Transmitted XTool ┃
┃ Title      ┃ Total Size ┃ Chunk Size ┃                 Size ┃ Dedup Ratio ┃               Ratio ┃                Bytes ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Parquet    │   16.2 GiB │   15.0 GiB │             13.4 GiB │         93% │                 83% │             13.5 GiB │
│ CDC ZSTD   │    8.8 GiB │    5.6 GiB │              5.6 GiB │         64% │                 64% │              6.0 GiB │
│ CDC Snappy │   16.2 GiB │    8.6 GiB │              8.1 GiB │         53% │                 50% │              9.4 GiB │
└────────────┴────────────┴────────────┴──────────────────────┴─────────────┴─────────────────────┴──────────────────────┘

Some results calculated on all revisions of food.parquet

❯ de stats /tmp/food --max-processes 4                                                                                    
Writing CDC Parquet files with ZSTD compression                                                                           
100%|█████████████████████████████████████████████████████████████████████████████████████| 32/32 [10:28<00:00, 19.64s/it]
Writing CDC Parquet files with Snappy compression                                                                         
100%|█████████████████████████████████████████████████████████████████████████████████████| 32/32 [08:11<00:00, 15.37s/it]
Estimating deduplication for Parquet                                                                                      
Estimating deduplication for CDC ZSTD                                                                                     
Estimating deduplication for CDC Snappy                                                                                   
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃            ┃            ┃            ┃     Compressed Chunk ┃             ┃    Compressed Dedup ┃    Transmitted XTool ┃
┃ Title      ┃ Total Size ┃ Chunk Size ┃                 Size ┃ Dedup Ratio ┃               Ratio ┃                Bytes ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Parquet    │  182.6 GiB │  148.0 GiB │            140.5 GiB │         81% │                 77% │            146.4 GiB │
│ CDC ZSTD   │  107.1 GiB │   58.0 GiB │             57.9 GiB │         54% │                 54% │             66.2 GiB │
│ CDC Snappy │  176.7 GiB │   79.6 GiB │             77.2 GiB │         45% │                 44% │            101.0 GiB │
└────────────┴────────────┴────────────┴──────────────────────┴─────────────┴─────────────────────┴──────────────────────┘

Chunk size shows the actual storage required to store the CDC chunked parquet files in a simple CAS implementation.

What changes are included in this PR?

A new column chunker implementation based on CDC algorithm, see more details in the docstrings. The implementation is added to the C++ Parquet writer and exposed in PyArrow as well.

Are these changes tested?

Yes. Tests have been added to the C++ implementation as well as the exposed PyArrow API.

Are there any user-facing changes?

There are two new parquet writer properties on the C++ side:

enable_content_defined_chunking() to enable the feature
content_defined_chunking_options(min_chunk_size, max_chunk_size, norm_factor) to provide additional options

There is a new pq.write_table(..., use_content_defined_chunking=) keyword argument to expose the feature on the Python side.

I marked all user-facing changes as EXPERIMENTAL.

GitHub Issue: [C++][Python][Parquet] Support Content-Defined Chunking of Parquet files #45750

cpp/src/parquet/column_chunker.h

mapleFU

Is cdc a part of the parquet spec? Or is it a poc?

kszucs · 2025-01-28T14:57:07Z

Is cdc a part of the parquet spec? Or is it a poc?

It is not. You can think of it as an implementation specific feature similar to the existing options to specify how record batches and pages are being split.

cpp/src/parquet/column_chunker.h

rok · 2025-02-11T14:41:16Z

Thanks for doing this @kszucs ! I like how this doesn't need any changes to readers.

Questions:

As it stands in this PR, CDC is either on or off for all columns. How about enabling it per column? In general case some columns might not be worthy candidates for it.
Use case described in HF blogpost describes cases where rows are added or removed but not much else is changed. Wouldn't it then make sense to first try a shortcut deduplication where if we identify a duplication in the first column we first check for the same duplication at the same indices in all other columns before running a full hashing pass?

…ssertions

…gral value

…C is enabled

…supported extension type by the parquet reader

… docstrings

kszucs · 2025-05-13T14:45:59Z

@github-actions crossbow submit test-conda-cpp-valgrind

github-actions · 2025-05-13T14:48:12Z

Revision: 1cc2e4b

Submitted crossbow builds: ursacomputing/crossbow @ actions-6a43d39b56

Task	Status
test-conda-cpp-valgrind

kszucs · 2025-05-13T14:57:18Z

I collected the possible follow-ups, once the PR is merged I will create the corresponding tickets:

[C++][Parquet] Support CDC in WriteBatch and WriteBatchSpaced GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360 (comment)
[C++][Parquet] Center the chunk size around avg_chunk_size if norm_level is not 0 GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360 (comment)
[C++][Parquet] Consider enforcing little-endian intepretation during chunking GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360 (comment)
[C++][Parquet] Avoid unconditional array slice in WriteArrow, WriteArrowDense and WriteArrowDictionary if value_offset == 0 GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360 (comment)
[C++][Python][Parquet] Store the properties used to write the parquet file as arbitrary metadata GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360 (comment)
[Python][Parquet] Implement a hierarchy of Parquet exceptions rather than raising OSError GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360 (comment)

pitrou · 2025-05-13T15:20:00Z

@github-actions crossbow submit preview-docs

pitrou

Congratulations @kszucs :)

github-actions · 2025-05-13T15:22:13Z

Revision: 1cc2e4b

Submitted crossbow builds: ursacomputing/crossbow @ actions-81dc2a98bf

Task	Status
preview-docs

wgtmac

+1 (for the C++ part)

kszucs · 2025-05-13T16:26:11Z

Thanks @pitrou @wgtmac @kou @mapleFU for the reviews!

conbench-apache-arrow · 2025-05-14T04:37:04Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit dd94c90.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

…edColumnWriterImpl` (#45688) ### Rationale for this change I am planning to introduce `WriteBatchInternal` and `WriteBatchSpacedInternal` private methods in apache/arrow#45360 which would have required specifying `WriteArrowSerialize`, `WriteArrowZeroCopy` and `WriteTimestamps` as friend functions. Then I noticed that these functions could be consolidated into the column writer making the implementation simpler. ### What changes are included in this PR? - Move `WriteArrowSerialize`, `WriteArrowZeroCopy` and `WriteTimestamps` to be methods on `TypedColumnWriterImpl`. - Remove the `column writer` argument, reorder their parameters to align with `WriteArrow` public method. - Use more explicit type parameter names. ### Are these changes tested? Existing tests should cover these. ### Are there any user-facing changes? No, these are private functions and methods. Resolves apache/arrow#45690 * GitHub Issue: #45690 Authored-by: Krisztian Szucs <szucs.krisztian@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>

github-actions bot added Component: Parquet Component: C++ Component: Python awaiting committer review Awaiting committer review labels Jan 27, 2025

kszucs commented Jan 27, 2025

View reviewed changes

cpp/src/parquet/column_chunker.h Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 27, 2025

kszucs commented Jan 27, 2025

View reviewed changes

cpp/src/parquet/column_chunker.h Outdated Show resolved Hide resolved

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 27, 2025

mapleFU reviewed Jan 28, 2025

View reviewed changes

kszucs force-pushed the content-defined-chunking branch 2 times, most recently from 9919b4f to 6fc058d Compare February 6, 2025 15:49

kszucs commented Feb 7, 2025

View reviewed changes

cpp/src/parquet/column_chunker.h Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 7, 2025

kszucs commented Feb 7, 2025

View reviewed changes

cpp/src/parquet/column_chunker.h Outdated Show resolved Hide resolved

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 7, 2025

kszucs force-pushed the content-defined-chunking branch 2 times, most recently from e82efee to f854ee8 Compare February 20, 2025 14:01

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 20, 2025

kszucs force-pushed the content-defined-chunking branch from f854ee8 to 58e1a82 Compare February 21, 2025 16:09

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 21, 2025

kszucs force-pushed the content-defined-chunking branch from 58e1a82 to 7556467 Compare February 21, 2025 16:10

kszucs marked this pull request as ready for review February 24, 2025 20:47

kszucs added 13 commits May 13, 2025 16:45

Use PLAIN encoding in the pyarrow test so that we can have stricter a…

4966f9c

…ssertions

Mention to use the same cdc parameters

ae5c929

Change the multi row-group tests to use more columns

0aa90c4

address review comments

cd27277

Use anonymus namespace for CalculateMask

5a78e86

Correct error message for min_chunk_size=0

4721f00

Rename norm_factor to norm_level to better reflect that it is an inte…

9b4522d

…gral value

Add note about norm_level recommended range

8f56430

Make content defined chunking branches unlikely

893465a

Address review comments

768743c

Assert on exception message for WriteBatchSpaced and WriteBatch if CD…

cb5e16c

…C is enabled

Test JSON extension type instead of UUID because since UUID is not a …

ab3f86e

…supported extension type by the parquet reader

Reduce the number of test cases for ASAN/Valgrind builds and add more…

1cc2e4b

… docstrings

kszucs force-pushed the content-defined-chunking branch from 9275cad to 1cc2e4b Compare May 13, 2025 14:45

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 13, 2025

pitrou approved these changes May 13, 2025

View reviewed changes

wgtmac approved these changes May 13, 2025

View reviewed changes

pitrou merged commit dd94c90 into apache:main May 13, 2025
34 of 35 checks passed

pitrou removed the awaiting change review Awaiting change review label May 13, 2025

pitrou mentioned this pull request May 13, 2025

[C++][Python][Parquet] Support Content-Defined Chunking of Parquet files #45750

Closed

crepererum mentioned this pull request Jul 28, 2025

Parquet Writer: Content-defined Chunking (CDC) apache/arrow-rs#8010

Open

GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

Uh oh!

Conversation

kszucs commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Deduplication efficiency

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

kszucs commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rok commented Feb 11, 2025

Uh oh!

kszucs commented May 13, 2025

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

kszucs commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented May 13, 2025

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kszucs commented May 13, 2025

Uh oh!

conbench-apache-arrow bot commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kszucs commented Jan 27, 2025 •

edited

Loading

kszucs commented Jan 28, 2025 •

edited

Loading

kszucs commented May 13, 2025 •

edited

Loading