-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-39583: [C++] Fix the issue of ExecBatchBuilder when appending consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) #39585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (column_metadata.fixed_length == 0) { | ||
| num_rows_left = std::max(num_rows_left, 8) - 8; | ||
| ++num_bytes_skipped; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was for boolean data, right? Is it ok to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Fixed size types of length 0/1/2/4/8 are using element-wise copying, i.e., no "word-wise then tail-bytes" copying. They don't go to this method:
arrow/cpp/src/arrow/compute/light_array.cc
Lines 513 to 546 in 3acc2ea
| case 0: | |
| CollectBits(source->buffers[1]->data(), source->offset, target->mutable_data(1), | |
| num_rows_before, num_rows_to_append, row_ids); | |
| break; | |
| case 1: | |
| Visit(source, num_rows_to_append, row_ids, | |
| [&](int i, const uint8_t* ptr, uint32_t num_bytes) { | |
| target->mutable_data(1)[num_rows_before + i] = *ptr; | |
| }); | |
| break; | |
| case 2: | |
| Visit( | |
| source, num_rows_to_append, row_ids, | |
| [&](int i, const uint8_t* ptr, uint32_t num_bytes) { | |
| reinterpret_cast<uint16_t*>(target->mutable_data(1))[num_rows_before + i] = | |
| *reinterpret_cast<const uint16_t*>(ptr); | |
| }); | |
| break; | |
| case 4: | |
| Visit( | |
| source, num_rows_to_append, row_ids, | |
| [&](int i, const uint8_t* ptr, uint32_t num_bytes) { | |
| reinterpret_cast<uint32_t*>(target->mutable_data(1))[num_rows_before + i] = | |
| *reinterpret_cast<const uint32_t*>(ptr); | |
| }); | |
| break; | |
| case 8: | |
| Visit( | |
| source, num_rows_to_append, row_ids, | |
| [&](int i, const uint8_t* ptr, uint32_t num_bytes) { | |
| reinterpret_cast<uint64_t*>(target->mutable_data(1))[num_rows_before + i] = | |
| *reinterpret_cast<const uint64_t*>(ptr); | |
| }); | |
| break; |
So I think we can simplify this method by not specially dealing with boolean type, thus the added ARROW_DCHECK several lines above.
|
@zanmato1984 Can you rebase on git main? |
b426fe4 to
20d9a38
Compare
Sure. Done. |
|
@github-actions crossbow submit -g cpp |
|
Revision: 20d9a38 Submitted crossbow builds: ursacomputing/crossbow @ actions-56dd6fe473 |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @zanmato1984 . The original code quality is quite bad but this improves things slightly.
|
@raulcd If you do another RC, this would be a good candidate fix to add. |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 1dc3b81. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…g consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (apache#39585) ### Rationale for this change apache#39583 is a subsequent issue of apache#32570 (fixed by apache#39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of apache#39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: apache#39583 Authored-by: zanmato1984 <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (#39585) ### Rationale for this change #39583 is a subsequent issue of #32570 (fixed by #39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue. ### What changes are included in this PR? Do the same fix of #39234 for fixed size types. ### Are these changes tested? UT included. ### Are there any user-facing changes? * Closes: #39583 Authored-by: zanmato1984 <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
#39583 is a subsequent issue of #32570 (fixed by #39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue.
What changes are included in this PR?
Do the same fix of #39234 for fixed size types.
Are these changes tested?
UT included.
Are there any user-facing changes?
ExecBatchBuilder#39583