-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers #35565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers #35565
Conversation
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for baseline = 14f9bf9 and contender = 3aed21eb2bd5546bbc74688d28d9f5b18ea67adf. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for baseline = 14f9bf9 and contender = ca44c360dc793066adc7a4d7557cd2ec3fefa713. Results will be available as each benchmark for each run completes. |
I'm running the benchmarks again but, as best I can tell, these regressions are noise, though it is quite difficult to say for sure. E.g. on the 9960x there are 5 non-js regressions and 21 non-js improvements. On arm I see the same improvements and a bunch of regressions unrelated to the PR. |
|
I have tested the reproduction case in the original issue. Without this change I get about 10x worse than arrow 11. With this change I get the same performance as arrow 11. |
|
CC @rtpsw / @sanjibansg |
cpp/src/arrow/util/align_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be recursive for RUN_END_ENCODED?
max(GetMallocValuesAlignment(*array.run_ends()),
GetMallocValuesAlignment(*array.values()));There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the function was named with something like RequiredSelfBuffersAlignment, I wouldn't make the bad assumption I've made above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been renamed to RequiredValueAlignmentForBuffer and it takes in a type now instead of an array.
cpp/src/arrow/util/align_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name suggestion: RequiredBuffersAlignment. I was confused by the "Get" when it's not getting the alignment of the existing buffers, but calculating the desired/required alignment for this array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to RequiredValueAlignmentForBuffer
I am not sure why it doesn't show up on the landing page linked from the bot comment (https://conbench.ursa.dev/compare/runs/80cbe13b10ca4d39b05e59e4b4d5037d...6eab503c9deb4e8ba81b02093e4604dc/), but looking at some of the individual impacted benchmarks, they seem to show a good speed-up for this specific run. For example: |
Whoops, they of course do show up, just have to sort by the z-score or change percentage in decreasing order, so that all the positive values show first. |
cpp/src/arrow/util/align_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since an array can have several buffers, I think this is too coarse-grained. Let's have something like:
int GetRequiredBufferAlignment(const DataType& type, int buffer_index) {
if (buffer_index == 0) {
// Either null bitmap or 8-bit union type ids
return 1;
}
switch (type.id()) {
case Type::INT16:
case Type::UINT16:
case Type::HALF_FLOAT:
return 2;
case Type::INT32:
case Type::UINT32:
case Type::FLOAT:
case Type::DATE32:
case Type::TIME32:
case Type::LIST: // Offsets may be cast to int32_t*, data is in child array
case Type::MAP: // This is a list array
case Type::DENSE_UNION: // Has an offsets buffer of int32_t*
case Type::INTERVAL_MONTHS: // Stored as int32_t*
return 4;
case Type::INT64:
case Type::UINT64:
case Type::DOUBLE:
case Type::LARGE_LIST: // Offsets may be cast to int64_t*
case Type::DATE64:
case Type::TIME64:
case Type::TIMESTAMP:
case Type::DURATION:
case Type::INTERVAL_DAY_TIME: // Stored as two contiguous 32-bit integers but may be
// cast to struct* containing both integers
return 8;
case Type::INTERVAL_MONTH_DAY_NANO: // Stored as two 32-bit integers and a 64-bit
// integer
return 16;
case Type::STRING:
case Type::BINARY: // Offsets may be cast to int32_t*, data is only uint8_t*
return (buffer_index == 1) ? 4 : 1;
case Type::LARGE_STRING:
case Type::LARGE_BINARY: // Offsets may be cast to int64_t*
return (buffer_index == 1) ? 8 : 1;
default:
// Everything else doesn't have buffers with non-trivial alignement requirements
return 1;
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, except I return 1 if buffer_index > 1 also (e.g. binary arrays)
I'm not sure where you got that from the C++ spec? If I try to understand https://en.cppreference.com/w/cpp/language/reinterpret_cast (especially the "Type Aliasing" paragraph), it seems only
|
|
Overall I'm rather lukewarm about the whole alignment concerns. "Alignment requirements" are generally extremely vague about why the requirements actually exist. Sometimes it's about not crashing on niche CPUs (such as SPARC), sometimes it's about not crashing on little-used SIMD instructions (x86 aligned loads), sometimes it's about avoiding undefined behaviour (which is mostly a language compliance concern, as far as alignment is concerned), sometimes it's about getting better performance (by avoiding memory accesses straddling cache lines or - worse - page boundaries). So ideally I think we should remove the entire code that reallocates buffers to fix their alignment, but short from that we should strive to be as conservative and granular as possible (see the suggestion I posted above). |
|
Also note that the discussion in #32276 was inconclusive. It started with claims of misalignment and then someone mentioned dereferencing a null pointer. Without a more concrete reproducer it is difficult to understand what happened there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So ideally I think we should remove the entire code that reallocates buffers to fix their alignment, but short from that we should strive to be as conservative and granular as possible (see the suggestion I posted above).
I tend to agree with narrowing the scope of alignment enforcement: it's part of the contract of the arrow format that buffers be aligned appropriately for the integers etc stored in them. Therefore to my mind we should rather avoid ever constructing an ArrayData which would fail CheckAlignment. If there is a producer which doesn't provide that guarantee (flight, GH-32276) then alignment should be enforced on those buffers before they escape the scope of the faulty producer. In the context of this PR, that'd mean: we only enforce alignment on buffers coming out of flight- and that may be the only producer for which we check alignment until some other producer is proven to occasionally emit misaligned data.
e.g. casting uint8_t* to uint32_t* is only safe if the buffer has 4-byte alignment
To be more pedantic about this: It is only legal to reinterpret_cast from a pointer to bytes to some other type T if those bytes are storage for an existing (lifetime has started) object of type T (or a type similar to T). The bytes which we're casting must therefore be aligned to alignof(T) because an object of T can never start its lifetime in unaligned storage. When we mmap some bytes which correspond to a buffer of signed 32 bit integers, we're implicitly treating those bytes as storage where integers have started their lifetime (since we're not in c++23 we can't make this explicit with start_lifetime_as_array). Since integers are fairly trivial (and this sort of technically-UB-but-what-could-go-wrong is so widely depended upon in C++ development), this is a liberty I don't expect any compiler to punish us for.
To be nitpicky, the Arrow format doesn't require anything about alignment. Implementations can choose to enforce some requirements if they wish to (which implies the kind of opportunistic realignment shenanigans discussed in this PR). Also, my view is that the problems with unaligned buffers are largely theoretical until a concrete problem is reliably diagnosed as such. |
|
I tried for a while to reproduce some kind of alignment related issue on godbolt and was unsuccessful. I then did some research. Most of the examples I could find either: A. Demonstrated potential performance penalties (which were often cpu-specific) associated with the unaligned loads I think A is mitigated by the fact that any performance benefit is unlikely to outweigh the cost of the allocation / copy required by EnsureAlignment. B is not important as we don't support 32-bit builds (outside of maybe windows 32-bit) that I'm aware of. The root of all this discussion is from some assertions in (what is now) So how do we feel about:
|
…e a compiler error
… unaligned buffer behavior
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
… we are properly looking for the offsets at index 2 and not index 1
8c23e4f to
ad6e3df
Compare
|
I believe I have now addressed all review comments. |
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM!
…4-byte aligned buffers to requiring value-aligned buffers (#35565) Various compute kernels and Acero internals rely on type punning. This is only safe when the buffer has appropriate alignment (e.g. casting uint8_t* to uint32_t* is only safe if the buffer has 4-byte alignment). To avoid errors we enforced 64-byte alignment in Acero. However, this is too strict. While Arrow's allocators will always generate 64-byte aligned buffers this is not the case for numpy's allocators (and presumably many others). This PR relaxes the constraint so that we only require value-aligned buffers. The main complexity here is determining which buffers need aligned and how much. A special flag kMallocAlignment is added which can be specified when calling CheckAlignment or EnforceAlignment to only require value-alignment and not a particular number. Yes No * Closes: #35498 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
|
@github-actions crossbow submit test-build-vcpkg-win |
|
Revision: c434bb2 Submitted crossbow builds: ursacomputing/crossbow @ actions-7cac2ee9c0
|
|
Benchmark runs are scheduled for baseline = 05fe0d2 and contender = 1951a1a. 1951a1a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Rationale for this change
Various compute kernels and Acero internals rely on type punning. This is only safe when the buffer has appropriate alignment (e.g. casting uint8_t* to uint32_t* is only safe if the buffer has 4-byte alignment). To avoid errors we enforced 64-byte alignment in Acero. However, this is too strict. While Arrow's allocators will always generate 64-byte aligned buffers this is not the case for numpy's allocators (and presumably many others). This PR relaxes the constraint so that we only require value-aligned buffers.
What changes are included in this PR?
The main complexity here is determining which buffers need aligned and how much. A special flag kMallocAlignment is added which can be specified when calling CheckAlignment or EnforceAlignment to only require value-alignment and not a particular number.
Are these changes tested?
Yes
Are there any user-facing changes?
No