Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Jan 27, 2025

Rationale for this change

I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.

Deduplication efficiency

The feature enables efficient data deduplication for compressed parquet files on content addressable storage (CAS) systems such as Hugging Face Hub. There is a purpose built evaluation tool is available at https://github.com/kszucs/de used during development to continuously check the improvements and to visually inspect the results. Please take a look at the repository's readme to see how different changes made to parquet files affect the deduplication ratio when they are stored in CAS systems.

Some results calculated on all revisions of datasets.parquet

❯ de stats /tmp/datasets                                                                                                  
Writing CDC Parquet files with ZSTD compression                                                                           
100%|███████████████████████████████████████████████████████████████████████████████████| 194/194 [00:12<00:00, 15.73it/s]
Writing CDC Parquet files with Snappy compression                                                                         
100%|███████████████████████████████████████████████████████████████████████████████████| 194/194 [00:10<00:00, 17.95it/s]
Estimating deduplication for Parquet                                                                                      
Estimating deduplication for CDC ZSTD                                                                                     
Estimating deduplication for CDC Snappy                                                                                   
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃            ┃            ┃            ┃     Compressed Chunk ┃             ┃    Compressed Dedup ┃    Transmitted XTool ┃
┃ Title      ┃ Total Size ┃ Chunk Size ┃                 Size ┃ Dedup Ratio ┃               Ratio ┃                Bytes ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Parquet    │   16.2 GiB │   15.0 GiB │             13.4 GiB │         93% │                 83% │             13.5 GiB │
│ CDC ZSTD   │    8.8 GiB │    5.6 GiB │              5.6 GiB │         64% │                 64% │              6.0 GiB │
│ CDC Snappy │   16.2 GiB │    8.6 GiB │              8.1 GiB │         53% │                 50% │              9.4 GiB │
└────────────┴────────────┴────────────┴──────────────────────┴─────────────┴─────────────────────┴──────────────────────┘

Some results calculated on all revisions of food.parquet

❯ de stats /tmp/food --max-processes 4                                                                                    
Writing CDC Parquet files with ZSTD compression                                                                           
100%|█████████████████████████████████████████████████████████████████████████████████████| 32/32 [10:28<00:00, 19.64s/it]
Writing CDC Parquet files with Snappy compression                                                                         
100%|█████████████████████████████████████████████████████████████████████████████████████| 32/32 [08:11<00:00, 15.37s/it]
Estimating deduplication for Parquet                                                                                      
Estimating deduplication for CDC ZSTD                                                                                     
Estimating deduplication for CDC Snappy                                                                                   
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃            ┃            ┃            ┃     Compressed Chunk ┃             ┃    Compressed Dedup ┃    Transmitted XTool ┃
┃ Title      ┃ Total Size ┃ Chunk Size ┃                 Size ┃ Dedup Ratio ┃               Ratio ┃                Bytes ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Parquet    │  182.6 GiB │  148.0 GiB │            140.5 GiB │         81% │                 77% │            146.4 GiB │
│ CDC ZSTD   │  107.1 GiB │   58.0 GiB │             57.9 GiB │         54% │                 54% │             66.2 GiB │
│ CDC Snappy │  176.7 GiB │   79.6 GiB │             77.2 GiB │         45% │                 44% │            101.0 GiB │
└────────────┴────────────┴────────────┴──────────────────────┴─────────────┴─────────────────────┴──────────────────────┘

Chunk size shows the actual storage required to store the CDC chunked parquet files in a simple CAS implementation.

What changes are included in this PR?

A new column chunker implementation based on CDC algorithm, see more details in the docstrings. The implementation is added to the C++ Parquet writer and exposed in PyArrow as well.

Are these changes tested?

Yes. Tests have been added to the C++ implementation as well as the exposed PyArrow API.

Are there any user-facing changes?

There are two new parquet writer properties on the C++ side:

  • enable_content_defined_chunking() to enable the feature
  • content_defined_chunking_options(min_chunk_size, max_chunk_size, norm_factor) to provide additional options

There is a new pq.write_table(..., use_content_defined_chunking=) keyword argument to expose the feature on the Python side.

I marked all user-facing changes as EXPERIMENTAL.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 27, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 27, 2025
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cdc a part of the parquet spec? Or is it a poc?

@kszucs
Copy link
Member Author

kszucs commented Jan 28, 2025

Is cdc a part of the parquet spec? Or is it a poc?

It is not. You can think of it as an implementation specific feature similar to the existing options to specify how record batches and pages are being split.

@kszucs kszucs force-pushed the content-defined-chunking branch 2 times, most recently from 9919b4f to 6fc058d Compare February 6, 2025 15:49
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 7, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 7, 2025
@rok
Copy link
Member

rok commented Feb 11, 2025

Thanks for doing this @kszucs ! I like how this doesn't need any changes to readers.

Questions:

  • As it stands in this PR, CDC is either on or off for all columns. How about enabling it per column? In general case some columns might not be worthy candidates for it.
  • Use case described in HF blogpost describes cases where rows are added or removed but not much else is changed. Wouldn't it then make sense to first try a shortcut deduplication where if we identify a duplication in the first column we first check for the same duplication at the same indices in all other columns before running a full hashing pass?

@kszucs kszucs force-pushed the content-defined-chunking branch 2 times, most recently from e82efee to f854ee8 Compare February 20, 2025 14:01
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 20, 2025
@kszucs kszucs force-pushed the content-defined-chunking branch from f854ee8 to 58e1a82 Compare February 21, 2025 16:09
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 21, 2025
@kszucs kszucs force-pushed the content-defined-chunking branch from 58e1a82 to 7556467 Compare February 21, 2025 16:10
@kszucs kszucs marked this pull request as ready for review February 24, 2025 20:47
@kszucs kszucs force-pushed the content-defined-chunking branch from 9275cad to 1cc2e4b Compare May 13, 2025 14:45
@kszucs
Copy link
Member Author

kszucs commented May 13, 2025

@github-actions crossbow submit test-conda-cpp-valgrind

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 13, 2025
@github-actions
Copy link

Revision: 1cc2e4b

Submitted crossbow builds: ursacomputing/crossbow @ actions-6a43d39b56

Task Status
test-conda-cpp-valgrind GitHub Actions

@kszucs
Copy link
Member Author

kszucs commented May 13, 2025

I collected the possible follow-ups, once the PR is merged I will create the corresponding tickets:

@pitrou
Copy link
Member

pitrou commented May 13, 2025

@github-actions crossbow submit preview-docs

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations @kszucs :)

@github-actions
Copy link

Revision: 1cc2e4b

Submitted crossbow builds: ursacomputing/crossbow @ actions-81dc2a98bf

Task Status
preview-docs GitHub Actions

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (for the C++ part)

@pitrou pitrou merged commit dd94c90 into apache:main May 13, 2025
34 of 35 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label May 13, 2025
@kszucs
Copy link
Member Author

kszucs commented May 13, 2025

Thanks @pitrou @wgtmac @kou @mapleFU for the reviews!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit dd94c90.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…edColumnWriterImpl` (#45688)

### Rationale for this change

I am planning to introduce `WriteBatchInternal` and `WriteBatchSpacedInternal` private methods in apache/arrow#45360 which would have required specifying `WriteArrowSerialize`, `WriteArrowZeroCopy` and `WriteTimestamps` as friend functions. Then I noticed that these functions could be consolidated into the column writer making the implementation simpler.

### What changes are included in this PR?

- Move `WriteArrowSerialize`, `WriteArrowZeroCopy` and `WriteTimestamps` to be methods on `TypedColumnWriterImpl`. 
- Remove the `column writer` argument, reorder their parameters to align with `WriteArrow` public method.
- Use more explicit type parameter names.

### Are these changes tested?

Existing tests should cover these.

### Are there any user-facing changes?

No, these are private functions and methods.

Resolves apache/arrow#45690
* GitHub Issue: #45690

Authored-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants