Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
256b83f
[C++][Python][Parquet] Implement Content-Defined Chunking for the Par…
kszucs Jan 16, 2025
e699e9a
always roll values
kszucs Jan 27, 2025
ba53621
add faster paths for flat arrays
kszucs Jan 27, 2025
bfa5cbd
normalize chunk sizes according to fastcdc algorithm
kszucs Jan 30, 2025
61617b4
missing header and fix level_offset incrementation
kszucs Jan 30, 2025
cc985bf
don't use normalization by default
kszucs Jan 31, 2025
1fedb89
use contexpr for gear hash tables
kszucs Feb 5, 2025
7c4d716
don't include loging
kszucs Feb 6, 2025
ffcea22
please msvc
kszucs Feb 6, 2025
9f3896e
increase the min/max bands around the avg chunk size
kszucs Feb 7, 2025
32ad613
use a chunk struct instead of a tuple to carry boundary information
kszucs Feb 14, 2025
2886e17
split implementation and header files
kszucs Feb 14, 2025
ee6a715
change the api to define min_chunk_size and max_chunk_size and automa…
kszucs Feb 17, 2025
616e76d
additional testing (more types, dictionary encoding, nullable types)
kszucs Feb 21, 2025
002a37d
test cases for binary-like types
kszucs Feb 21, 2025
1eb6f4c
reduce duplication in testing
kszucs Feb 21, 2025
c9b42b4
reduce duplication in testing
kszucs Feb 21, 2025
50ce77c
refactoring + testing + introduce norm_factor parameter
kszucs Feb 24, 2025
a20ebbf
reduce the testing data size to make the test cases quicker
kszucs Feb 24, 2025
47aa8b0
increase testing data size
kszucs Feb 24, 2025
86e348f
add a custom array generator to alwayw produce the same array
kszucs Feb 24, 2025
960883a
address review comments
kszucs Feb 27, 2025
3a92662
rename GEAR_HASH_TABLE to GEARHASH_TABLE
kszucs Mar 1, 2025
9208bd3
some docstrings about CDC
kszucs Mar 3, 2025
1237216
place the gearhash table to a separate header
kszucs Mar 3, 2025
5d187d5
more CDC docstrings
kszucs Mar 3, 2025
8b8722d
address review comments
kszucs Mar 5, 2025
6d63050
rename files to chunker_internal_* to avoid installing the headers
kszucs Mar 5, 2025
614f5df
prefer to throw parquet exception rather than returning arrow status
kszucs Mar 5, 2025
a2c15b0
add reference to chunk size normalization
kszucs Mar 5, 2025
dd21d23
add a comment about AddDataPage() at the end of each chunk
kszucs Mar 5, 2025
5154d01
address review comments
kszucs Mar 6, 2025
4cb991f
fix generated header name
kszucs Mar 6, 2025
0b868ca
more docstring for CDC arguments
kszucs Mar 6, 2025
02143fc
prefer templated GenerateArray rather than macro
kszucs Mar 6, 2025
34dbf5b
don't hash undefined null values; reduce generated code size by dispa…
kszucs Mar 7, 2025
9792acb
only hash non-null values in the nested case as well
kszucs Mar 7, 2025
7485762
add docstrings to the hashtable generating pythons script
kszucs Mar 7, 2025
9c3ea99
prefer to use signed integers as size arguments
kszucs Mar 7, 2025
c439e59
use type aliases for better readability in tests
kszucs Mar 7, 2025
3a31d93
use explicit struct instead of tuples for the test case configuration
kszucs Mar 7, 2025
b3b2b3e
add a boolean test case
kszucs Mar 7, 2025
1dd53e9
describe test utilities in more details
kszucs Mar 7, 2025
e39c243
fix use .getValue() for binary arrays
kszucs Mar 10, 2025
5a9dd37
add more details about calculating the mask
kszucs Mar 10, 2025
1908918
Address review comments
kszucs Mar 13, 2025
40b175c
Separate include groups with a new line
kszucs Mar 13, 2025
a9635b0
Remove Chunk constructor and hide implementation using PIMPL
kszucs Mar 14, 2025
983ade9
Prefer templated methods over macros
kszucs Mar 17, 2025
cc88a79
Use VisitType instead of manual switch based dispatching
kszucs Mar 17, 2025
2e38fc0
Refactor CDC settings + add python docstrings
kszucs Mar 17, 2025
4558a6c
Fix python linting error
kszucs Mar 17, 2025
119393a
Calculate mask bits using arrow bit utils
kszucs Mar 17, 2025
d7f3666
Raise from WriteBatch() and WriteBatchSpaced() if CDC is enabled
kszucs Mar 18, 2025
1b67e6b
Test that WriteBatch() and WriteBatchSpaced() raises with CDC enabled
kszucs Mar 19, 2025
53282cc
Add tests for the multi-row-group use case
kszucs Mar 19, 2025
7613929
Support extension types
kszucs Mar 20, 2025
4e7dc0b
Test sliced tables
kszucs Mar 20, 2025
804b00d
Add comments about the hash value generation
kszucs Mar 20, 2025
9735c4c
Test that dictionary fallback is being triggered during testing
kszucs Mar 20, 2025
9e2434a
Disabled unity build for mingw
kszucs Mar 22, 2025
f4a2869
Do more validation for the chunk size parameters and norm_factor
kszucs Mar 22, 2025
1d9cbc3
Reorder the validation to prevent UB in case of shifting with more th…
kszucs Mar 22, 2025
433d263
Mark ContentDefinedChunker as PARQUET_EXPORT to prevent link errors o…
kszucs Mar 22, 2025
496e2e5
Add test for the pyarrow API and fix UB error
kszucs Mar 22, 2025
c99e7cf
Test mask values calculated from the parameters
kszucs Mar 22, 2025
629d7c4
Add ValidateChunks() sanity checks in debug builds
kszucs Mar 26, 2025
4393e91
Simplify test assertion
kszucs Mar 26, 2025
4d61fbe
Do not trigger an AddDataPage after the last chunk
kszucs Mar 26, 2025
5604ab6
Re-enable unity in the mingw build
kszucs Mar 27, 2025
724d9b3
Remove unreachable branch
kszucs Mar 27, 2025
b2fc28b
Use unity build again on MinGW and include window fixup in schema.cc
kszucs Mar 27, 2025
8d6c8ec
Add docstring to WriterProperties; add documentation to the pyarrow p…
kszucs Mar 28, 2025
899823b
Mark the `use_content_defined_chunking` argument as experimental in t…
kszucs Mar 28, 2025
2b74c37
Add test configuration for ParquetDataPageVersion::V2
kszucs Mar 28, 2025
b9ef818
Address review comments
kszucs Mar 29, 2025
d49327e
Address review comments
kszucs Apr 2, 2025
7e04246
Use CDCOptions instead of arguments
kszucs Apr 2, 2025
3ddd529
Address review comments
kszucs Apr 2, 2025
e6ecef2
Address review comments
kszucs Apr 2, 2025
c6444c0
Use optional to store the cdc chunker in the column writer
kszucs Apr 2, 2025
52b7a40
Improve test assertions; check null values for fixed size types
kszucs Apr 3, 2025
d13c89f
Add prepend test case
kszucs Apr 3, 2025
7aec1cd
Rename CDCOptions to CdcOptions
kszucs Apr 3, 2025
e2229ce
Some more comments in the test suite
kszucs Apr 3, 2025
6fe0223
Some more comments
kszucs Apr 4, 2025
feee7e7
Remove redundant assertions
kszucs Apr 4, 2025
2032b3a
Remove the capture in CalculateBinaryLike closure
kszucs Apr 4, 2025
61731c6
Migrate from DCHECK to ARROW_DCHECK
kszucs Apr 4, 2025
4966f9c
Use PLAIN encoding in the pyarrow test so that we can have stricter a…
kszucs Apr 4, 2025
ae5c929
Mention to use the same cdc parameters
kszucs Apr 4, 2025
0aa90c4
Change the multi row-group tests to use more columns
kszucs Apr 4, 2025
cd27277
address review comments
kszucs May 10, 2025
5a78e86
Use anonymus namespace for CalculateMask
kszucs May 10, 2025
4721f00
Correct error message for min_chunk_size=0
kszucs May 10, 2025
9b4522d
Rename norm_factor to norm_level to better reflect that it is an inte…
kszucs May 10, 2025
8f56430
Add note about norm_level recommended range
kszucs May 10, 2025
893465a
Make content defined chunking branches unlikely
kszucs May 10, 2025
768743c
Address review comments
kszucs May 12, 2025
cb5e16c
Assert on exception message for WriteBatchSpaced and WriteBatch if CD…
kszucs May 12, 2025
ab3f86e
Test JSON extension type instead of UUID because since UUID is not a …
kszucs May 12, 2025
1cc2e4b
Reduce the number of test cases for ASAN/Valgrind builds and add more…
kszucs May 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
cpp/src/arrow/util/bpacking_*_generated.h linguist-generated=true
cpp/src/parquet/chunker_*_generated.h linguist-generated=true
cpp/src/generated/*.cpp linguist-generated=true
cpp/src/generated/*.h linguist-generated=true
go/**/*.s linguist-generated=true
Expand Down
3 changes: 3 additions & 0 deletions cpp/src/parquet/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ set(PARQUET_SRCS
arrow/writer.cc
bloom_filter.cc
bloom_filter_reader.cc
chunker_internal.cc
column_reader.cc
column_scanner.cc
column_writer.cc
Expand Down Expand Up @@ -399,6 +400,8 @@ add_parquet_test(writer-test
file_serialize_test.cc
stream_writer_test.cc)

add_parquet_test(chunker-test SOURCES chunker_internal_test.cc)

add_parquet_test(arrow-test
SOURCES
arrow/arrow_metadata_test.cc
Expand Down
429 changes: 429 additions & 0 deletions cpp/src/parquet/chunker_internal.cc

Large diffs are not rendered by default.

144 changes: 144 additions & 0 deletions cpp/src/parquet/chunker_internal.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include <cstdint>
#include <vector>

#include "arrow/array.h"
#include "parquet/level_conversion.h"

namespace parquet::internal {

// Represents a chunk of data with level offsets and value offsets due to the
// record shredding for nested data.
struct Chunk {
// The start offset of this chunk inside the given levels
int64_t level_offset;
// The start offset of this chunk inside the given values array
int64_t value_offset;
// The length of the chunk in levels
int64_t levels_to_write;
};

/// CDC (Content-Defined Chunking) is a technique that divides data into variable-sized
/// chunks based on the content of the data itself, rather than using fixed-size
/// boundaries.
///
/// For example, given this sequence of values in a column:
///
/// File1: [1,2,3, 4,5,6, 7,8,9]
/// chunk1 chunk2 chunk3
///
/// Assume there is an inserted value between 3 and 4:
///
/// File2: [1,2,3,0, 4,5,6, 7,8,9]
/// new-chunk chunk2 chunk3
///
/// The chunking process will adjust to maintain stable boundaries across data
/// modifications. Each chunk defines a new parquet data page which is contiguously
/// written out to the file. Since each page compressed independently, the files' contents
/// would look like the following with unique page identifiers:
///
/// File1: [Page1][Page2][Page3]...
/// File2: [Page4][Page2][Page3]...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't quite understand how does the rolling hash can perfectly produce page1, page4 as above. I need to read the paper and blogs more carefully but cannot promise that my math background allows me to totally understand it. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a made-up example, not actual data that you can reproduce with specific CDC settings :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is trying to highlight the behavior since I would need a lot more values for a reproducible example.

///
/// Then the parquet file is being uploaded to a content addressable storage (CAS) system
/// which splits the bytes stream into content defined blobs. The CAS system will
/// calculate a unique identifier for each blob, then store the blob in a key-value store.
/// If the same blob is encountered again, the system can refer to the hash instead of
/// physically storing the blob again. In the example above, the CAS system would store
/// Page1, Page2, Page3, and Page4 only once and the required metadata to reassemble the
/// files.
/// While the deduplication is performed by the CAS system, the parquet chunker makes it
/// possible to efficiently deduplicate the data by consistently dividing the data into
/// chunks.
///
/// Implementation details:
///
/// Only the parquet writer must be aware of the content defined chunking, the reader
/// doesn't need to know about it. Each parquet column writer holds a
/// ContentDefinedChunker instance depending on the writer's properties. The chunker's
/// state is maintained across the entire column without being reset between pages and row
/// groups.
///
/// The chunker receives the record shredded column data (def_levels, rep_levels, values)
/// and goes over the (def_level, rep_level, value) triplets one by one while adjusting
/// the column-global rolling hash based on the triplet. Whenever the rolling hash matches
/// a predefined mask, the chunker creates a new chunk. The chunker returns a vector of
/// Chunk objects that represent the boundaries of the chunks.
/// Note that the boundaries are deterministically calculated exclusively based on the
/// data itself, so the same data will always produce the same chunks - given the same
/// chunker configuration.
///
/// References:
/// - FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data
/// Deduplication
/// https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf
/// - Git is for Data (chunk size normalization used here is described in section 6.2.1):
/// https://www.cidrdb.org/cidr2023/papers/p43-low.pdf
class PARQUET_EXPORT ContentDefinedChunker {
public:
/// Create a new ContentDefinedChunker instance
///
/// @param level_info Information about definition and repetition levels
/// @param min_chunk_size Minimum chunk size in bytes
/// The rolling hash will not be updated until this size is reached for each chunk.
/// Note that all data sent through the hash function is counted towards the chunk
/// size, including definition and repetition levels if present.
/// @param max_chunk_size Maximum chunk size in bytes
/// The chunker creates a new chunk whenever the chunk size exceeds this value. The
/// chunk size distribution approximates a normal distribution between min_chunk_size
/// and max_chunk_size. Note that the parquet writer has a related `data_pagesize`
// property that controls the maximum size of a parquet data page after encoding.
/// While setting `data_pagesize` to a smaller value than `max_chunk_size` doesn't
/// affect the chunking effectiveness, it results in more small parquet data pages.
/// @param norm_level Normalization level to center the chunk size around the average
/// size more aggressively, default 0.
/// Increasing the normalization level increases the probability of finding a chunk
/// boundary, improving the deduplication ratio, but also increases the number of
/// small chunks resulting in many small parquet data pages. The default value
/// provides a good balance between deduplication ratio and fragmentation.
/// Use norm_level=1 or norm_level=2 to reach a higher deduplication ratio at the
/// expense of fragmentation.
ContentDefinedChunker(const LevelInfo& level_info, int64_t min_chunk_size,
int64_t max_chunk_size, int norm_level = 0);
~ContentDefinedChunker();

/// Get the chunk boundaries for the given column data
///
/// @param def_levels Definition levels
/// @param rep_levels Repetition levels
/// @param num_levels Number of levels
/// @param values Column values as an Arrow array
/// @return Vector of Chunk objects representing the chunk boundaries
std::vector<Chunk> GetChunks(const int16_t* def_levels, const int16_t* rep_levels,
int64_t num_levels, const ::arrow::Array& values);

private:
/// @brief Get the rolling hash mask used to determine chunk boundaries, used for
/// testing the mask calculation.
uint64_t GetRollingHashMask() const;

class Impl;
std::unique_ptr<Impl> impl_;

friend class TestCDC;
};

} // namespace parquet::internal
125 changes: 125 additions & 0 deletions cpp/src/parquet/chunker_internal_codegen.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
#!/usr/bin/env python

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

"""
Produce the given number gearhash tables for rolling hash calculations.

Each table consists of 256 64-bit integer values and by default 8 tables are
produced. The tables are written to a header file that can be included in the
C++ code.

The generated numbers are deterministic "random" numbers created by MD5 hashing
a fixed seed and the table index. This ensures that the tables are the same
across different runs and platforms. The function of generating the numbers is
less important as long as they have sufficiently uniform distribution.

Reference implementations:
- https://github.com/Borelset/destor/blob/master/src/chunking/fascdc_chunking.c
- https://github.com/nlfiedler/fastcdc-rs/blob/master/examples/table64.rs

Usage:
python chunker_internal_codegen.py [ntables]

ntables: Number of gearhash tables to generate (default 8), the
the C++ implementation expects 8 tables so this should not be
changed unless the C++ code is also updated.

The generated header file is written to ./chunker_internal_generated.h
"""

import hashlib
import pathlib
import sys
from io import StringIO


template = """\
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include <cstdint>

namespace parquet::internal {{

constexpr int64_t kNumGearhashTables = {ntables};

constexpr uint64_t kGearhashTable[{ntables}][256] = {{
{content}}};

}} // namespace parquet::internal
"""


def generate_hash(n: int, seed: int):
"""Produce predictable hash values for a given seed and n using MD5.

The value can be arbitrary as long as it is deterministic and has a uniform
distribution. The MD5 hash is used to produce a 16 character hexadecimal
string which is then converted to a 64-bit integer.
"""
value = bytes([seed] * 64 + [n] * 64)
hasher = hashlib.md5(value)
return hasher.hexdigest()[:16]


def generate_hashtable(seed: int, length=256):
"""Generate and render a single gearhash table."""
table = [generate_hash(n, seed=seed) for n in range(length)]

out = StringIO()
out.write(f" {{// seed = {seed}\n")
for i in range(0, length, 4):
values = [f"0x{value}" for value in table[i : i + 4]]
values = ", ".join(values)
out.write(f" {values}")
if i < length - 4:
out.write(",\n")
out.write("}")

return out.getvalue()


def generate_header(ntables=8, relative_path="chunker_internal_generated.h"):
"""Generate a header file with multiple gearhash tables."""
path = pathlib.Path(__file__).parent / relative_path
tables = [generate_hashtable(seed) for seed in range(ntables)]
content = ",\n".join(tables)
text = template.format(ntables=ntables, content=content)
path.write_text(text)


if __name__ == "__main__":
ntables = int(sys.argv[1]) if len(sys.argv) > 1 else 8
generate_header(ntables)
Loading
Loading