-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-37293: [C++][Parquet] Encoding: Add Benchmark for DELTA_BYTE_ARRAY #37641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/parquet/encoding.cc
Outdated
| buffer[i].ptr = data_ptr; | ||
| buffer[i].len += prefix_len_ptr[i]; | ||
| data_ptr += buffer[i].len; | ||
| // If the prefix length is zero, the prefix can be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an (maybe-unused) optimization. When prefix == 0, it avoid round of memcpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make benchmarks better or worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better, of course, specially when so many prefix == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I can separate a pr for this. I think we can optimize two cases:
- Prefix == 0
- Posfix == 0
Each of the case can be well optimize to avoid copying and memory allocation. I can separate for that
|
|
| std::vector<ByteArray> values; | ||
| std::vector<uint8_t> buf(max_length * array_size); | ||
| values.resize(array_size); | ||
| prefixed_random_byte_array(array_size, /*seed=*/0, buf.data(), values.data(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMIW, /*prefixed_probability=*/0.5 is a little bit low which means we can only expect two consecutive string values to share prefixes. It can easily generate a sequence that does not have good prefixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be interesting to also vary prefix length (max_size)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's enough to vary prefixed_probability. The benchmarking library supports integer parameters, so it can for example be passed as a percentage. Two or three values should be enough (for example 10,90,99).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, though, we should try to generate data for which the encoding is very space-efficient. Is it the case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should try to generate data for which the encoding is very space-efficient. Is it the case here?
Would 99 percent be "space-efficient"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I got it, let me mock some tests here.
cpp/src/parquet/encoding.cc
Outdated
| } | ||
| } | ||
| PARQUET_THROW_NOT_OK(buffered_data_->Resize(data_size)); | ||
| PARQUET_THROW_NOT_OK(buffered_data_->Resize(data_size, false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| PARQUET_THROW_NOT_OK(buffered_data_->Resize(data_size, false)); | |
| PARQUET_THROW_NOT_OK(buffered_data_->Resize(data_size, /*shrink_to_fit=*/false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we lose the chance to release bulk memory if shrink_to_fit is false. Changing the default behavior of buffered_prefix_length_ above may not make big difference but the actual size of buffered_data_ may vary a lot during the decoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me leave it first, maybe we need a better method to release memory?
Slower encoding is expected, what is the performance difference? 2x, 10x, 100x? |
Oh my fault. When I use 8-byte and pure random, the others are about 5x. I guess thats mainly because memcpy and more checking. I'll test more ranges this week |
|
Here are some results here. Decoding doesn't look bad, it's on par with dict decoding (in items/s, I'm not sure the bytes/s figure is consistently computed): |
@pitrou I've test the Decode/Encoding on MacOS, to avoid too much benchmarks, I configure the
|
c2bde25 to
971e2c9
Compare
971e2c9 to
c567ed6
Compare
|
Comparing with Plain Decoding and Delta length, Delta would be slower, because Plain/Delta length would not need to copy any data. Their speed is stable with different input. Comparing with Dict, Dict require to set dictionary, and parsing dictionary would get a memcpy. So in benchmark, they're similiar. However, during real using Dict, a ColumnChunk will only set dictionary once, so Dictionary is still faster than DELTA. Currently DELTA doesn't has speed advantage. However, if data is prefixed-data, like cc @rok |
|
@mapleFU Agreed! Users should be aware why and when to use which encoding. |
|
Yeah, just decoding/encoding cannot prove that the ability of delta(Though it's still important) I've test encoding with compression and DELTA gets ok |
|
@pitrou I've update the benchmark, would you mind take a look? |
cpp/src/parquet/encoding.cc
Outdated
| // all the prefix lengths are buffered in buffered_prefix_length_. | ||
| PARQUET_THROW_NOT_OK(buffered_prefix_length_->Resize(num_prefix * sizeof(int32_t))); | ||
| PARQUET_THROW_NOT_OK(buffered_prefix_length_->Resize(num_prefix * sizeof(int32_t), | ||
| /*shrink_to_fit=*/false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this desirable? If decoding first a large page and then a small page, this means that memory wouldn't be released. Does page size always stay similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm Personally I don't think the page will be too large, but user might set large page-size, I'll remove that
if (buffer[i].len == 0) {
// If the buffer length is zero, the memcpy to data_ptr can be ignored.
buffer[i].ptr = reinterpret_cast<const uint8_t*>(prefix.data());
buffer[i].len = prefix_len_ptr[i];
} else if (prefix_len_ptr[i] != 0) {
// If the prefix length is zero, the prefix can be ignored.
memcpy(data_ptr, prefix.data(), prefix_len_ptr[i]);
// buffer[i] currently points to the string suffix
memcpy(data_ptr + prefix_len_ptr[i], buffer[i].ptr, buffer[i].len);
buffer[i].ptr = data_ptr;
buffer[i].len += prefix_len_ptr[i];
data_ptr += buffer[i].len;
}
prefix = std::string_view{buffer[i]};I tent to use code like this, but let me leave it to later patch with more carefully design... |
|
I improved and simplified the benchmark code. Here are the current results here: |
|
Nice! So currently the memcpy will causing high CPU usage, let me optimize with #37641 (comment) after this is merged, thanks! |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, let's wait for CI
|
Ooops, I believe these failed CI is not caused by me... |
|
Yes, the CI failures are unrelated. |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 5978729. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…_ARRAY (apache#37641) ### Rationale for this change Add benchmark for DELTA_BYTE_ARRAY in parquet, and do tiny optimization. ### What changes are included in this PR? Add benchmark for DELTA_BYTE_ARRAY in parquet, and do tiny optimization. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#37293 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…_ARRAY (apache#37641) ### Rationale for this change Add benchmark for DELTA_BYTE_ARRAY in parquet, and do tiny optimization. ### What changes are included in this PR? Add benchmark for DELTA_BYTE_ARRAY in parquet, and do tiny optimization. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#37293 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Add benchmark for DELTA_BYTE_ARRAY in parquet, and do tiny optimization.
What changes are included in this PR?
Add benchmark for DELTA_BYTE_ARRAY in parquet, and do tiny optimization.
Are these changes tested?
no
Are there any user-facing changes?
no