-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-41974: [C++][Compute] Support more precise pre-allocation and more pre-allocated types for ScalarExecutor and VectorExecutor #41975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ComputeDataPreallocate can make sure all of the element in data_preallocated_ could satisfy the DCHECK.
It's unnecessary to keep this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keeping this because it's only a debug build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the DCHECK you suggested before ( https://github.com/apache/arrow/pull/41975/files#r1636669846). In essence, they detect the same content, otherwise there will be two duplicate DCHECKs.
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have_chunked_arrays only used in kernel_->can_execute_chunkwise=false's branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this code move. That i could only understand when looking commit by commit.
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use more precise pre-allocation strategy by check the input batches as what ScalarExecutor do.
We could prevent some unnecessary pre-allocation by this precise strategy (checked the input batches status). And this could make an improvement for the situations like below:
before:
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000 4395737 ns 4393473 ns 138 items_per_second=119.333M/s null_percent=0.1 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10 5109157 ns 5106714 ns 129 items_per_second=102.666M/s null_percent=10 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2 6586773 ns 6583881 ns 104 items_per_second=79.6321M/s null_percent=50 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1 1868864 ns 1867719 ns 364 items_per_second=280.71M/s null_percent=100 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0 3133277 ns 3131282 ns 248 items_per_second=167.436M/s null_percent=0 size=524.288kafter:
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000 1813275 ns 1812657 ns 392 items_per_second=289.237M/s null_percent=0.1 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10 2711716 ns 2710747 ns 255 items_per_second=193.411M/s null_percent=10 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2 4460442 ns 4459029 ns 154 items_per_second=117.579M/s null_percent=50 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1 248396 ns 248270 ns 2695 items_per_second=2.11176G/s null_percent=100 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0 878407 ns 878028 ns 769 items_per_second=597.12M/s null_percent=0 size=524.288kThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'm working on Take at the moment and setting the chunked exec kernels #41700
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder, I will take a look.
Current PR just optimize the execution logic of chunked-array from the NULL pre-allocation part.
bdcd19d to
e2f1a1f
Compare
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the name "invalid" ambigious here? Could it be ALL_NULL ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be replaced by ALL_NULL with ALL_NULL =3, so we could keep the original NullGeneralization::type of each chunk when doing & operation in type Get(const ChunkedArray& chunk_array) .
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would NA type do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_preallocated_ will be filled in ComputeDataPreallocate, and the NA type will not be added to data_preallocated_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems this could be a DCHECK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_size_binary<0> can have bit_width == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_size_binary<0>can havebit_width == 0
Yes, fixed_size_binary<0> can be added into data_preallocated_ normally in ComputeDataPreallocate, and the function call make sure all of element in data_preallocated_ should satisfy >=0. So we just add a DCHECK here.
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this introducing a undefined value here? (return 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This should start from 0 to give an easier time to the compiler (when it's proving boundaries in the optimizer).
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const ArrayData* curr_chunk = chunk_array.chunk(chunk_idx)->data().get(); | |
| value.SetArray(*curr_chunk); | |
| const ArrayData& curr_chunk = chunk_array.chunk(chunk_idx)->data(); | |
| value.SetArray(curr_chunk); |
Or construct a ArraySpan?
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems this could be a DCHECK?
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keeping this because it's only a debug build?
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the difference between it and SetupPreallocation in Scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very hard to review this code extraction. Was this code churn necessary to achieve the goals of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this code churn necessary to achieve the goals of this PR?
Yes, that's one of the goals of this pr, we could use a more precise pre-allocation way for vector-executor.
The main change is to support the detection of null values when kernel_->null_handling == NullHandling::INTERSECTION, so as to skip some unnecessary preallocations, including the null value detection of chunk-array, which was not done before. And also, the logic here is just same as ScalarExecutor::SetupPreallocation.
Because this logic is decoupled from the scheduling of VectorExecutor Execute, it is better to consider extracting it.
Whats the difference between it and
SetupPreallocationin Scalar?
The difference between this function and SetupPreallocation of ScalarExecutor is that ScalarExecutor does not have an independent chunk_exec, so it needs to use preallocate_contiguous_ to confirm whether it needs to execute PrepareOutput according to chunks through span_iterator_, which is not required in VectorExecutor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between this function and SetupPreallocation of ScalarExecutor is that ScalarExecutor does not have an independent chunk_exec, so it needs to use preallocate_contiguous_ to confirm whether it needs to execute PrepareOutput according to chunks through span_iterator_, which is not required in VectorExecutor.
Do you think we can extract SetupPreallocation to base class and extract common logic out, then implement handling for chunked array with:
void SetupPreallocation(...) final {
Base::SetupPreallocation(...)
...
}
?
And what if kernel_->null_handling == NullHandling::OUTPUT_NOT_NULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what if
kernel_->null_handling == NullHandling::OUTPUT_NOT_NULL?
Kernel output is never null and a validity bitmap does not need to be allocated.
validity_preallocated_ just preallocate the validity bitmap-buffer which unnecessary for OUTPUT_NOT_NULL.
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list-views don't need an extra offset/size, so you don't need the added_length=1.
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This should start from 0 to give an easier time to the compiler (when it's proving boundaries in the optimizer).
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a for loop here.
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if num_chunks() == 0, we should return ALL_VALID instead of ALL_NULL or PERHAPS_NULL like Get does above (it doesn't check .length() at all which I think was a missed opportunity of making ALL_VALID more likely).
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'm working on Take at the moment and setting the chunked exec kernels #41700
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very hard to review this code extraction. Was this code churn necessary to achieve the goals of this PR?
2979750 to
5e2207d
Compare
5e2207d to
c9a13a8
Compare
felipecrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will finish the review later. I need to check that code movement mor carefully. It would help to have a commit history with two commits:
- refactoring the setup preallocation logic in-place
- extracting that refactored logic to a function
Code movement + changes mixed together takes more time to review.
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_size_binary<0> can have bit_width == 0
d0a599b to
cf02207
Compare
It's my fault. Thanks so much for your review @felipecrv @mapleFU. |
felipecrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally understand the first goal of the PR. And I disapprove of it. I want kernel dispatching logic to depend solely on types and immutable properties of the kernels.
Extending the range of types that are pre-allocated is definitely a good idea though.
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this logic is safe. If null_handling is INTERSECTION the function might want to follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection.
And even if that is not the case (assuming all kernels are perfect at the moment), this extra logic introduces branching in the possible states of pre-allocated buffers based on runtime input data and not something statically defined in the kernel_ when it's configured.
If you want the speed benefits of not pre-allocating a validity bitmap when not necessary, declare the kernel as COMPUTED_NO_PREALLOCATE and work on the kernel logic to allocate only when needed. The generic binding and execution logic should be simple and not depend on values so much. It's too complicated as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection
Are you talking about PropagateNulls? Currently, seems this logic is used when kernel set INTERSECTION.
If not, i understand that if the kernel internal execute path must follow a single code path that uses the pre-allocated buffer, COMPUTED_PREALLOCATE can be set statically, which should be safer for this kernel?
Back to preallocation-validity buffer, if kernel-function sets NullHandling::INTERSECTION, does it mean that kernel-function may not be sure whether it needs a pre-allocated validity buffer and let the Executor decide for itself? At this time, the Executor can only dynamically determine whether to set the pre-allocate buffer based on the input data.
And even if that is not the case (assuming all kernels are perfect at the moment), this extra logic introduces branching in the possible states of pre-allocated buffers based on runtime input data and not something statically defined in the
kernel_when it's configured.
From a performance perspective, the current logic does introduce many branches at runtime for INTERSECTION, compared to PropagateNulls, it actually passes NullGeneralization::Get introduces very few branches.
The generic binding and execution logic should be simple and not depend on values so much. It's too complicated as it is.
Yes, the current pre-allocation logic for INTERSECTION introduced NullPropagator and NullGeneralization is complicated.
@pitrou What do you think of this PR for VectorExecutor's INTERSECTION validity-bitmap preallocation?
It seems that you have changed the logic here before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides that, current ScalarExecutor will pre-allocate with NullGeneralization in INTERSECTION mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, by branches I meant branches in the state-space the execution framework and the kernels might traverse. Meaning that without duplication of test cases for every kernel, we might hide a latent bug until an unlikely array configuration causes a SIGSEGV on the non-pre-allocated validity buffer.
Back to preallocation-validity buffer, if kernel-function sets
NullHandling::INTERSECTION, does it mean that kernel-function may not be sure whether it needs a pre-allocated validity buffer and let the Executor decide for itself? At this time, the Executor can only dynamically determine whether to set the pre-allocate buffer based on the input data.
By statically I mean a one-time configuration of the kernel_ when it's added to the function in the registry.
For a kernel implementer, it's simpler to assume NullHandling::INTERSECTION implies pre-allocated validity bitmap buffer. No matter what arrays are passed as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a kernel implementer, it's simpler to assume
NullHandling::INTERSECTIONimplies pre-allocated validity bitmap buffer. No matter what arrays are passed as input.
I see, thank you for your explanation @felipecrv .
Do you think the changes in the third commit of this PR are reasonable? If so, I will revert the first two commits. The third commit is mainly related to the pre-allocation of output data-buffer and refactored part of the code, which can bring the benefits mentioned by @mapleFU .
In addition, I will create an issue to track the pre-allocation of validity-buffer in INTERSECTION mode, because currently ScalarExecutor will do pre-allocation according to NullGeneralization in this mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the handling of list-view types in the pre-allocation code is good. The 3rd commit has a mix of things in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, except for [Large]ListView and chunked-array NullGeneration support, the others are some minor refactorings (some DCHECKs are unnecessary for now).
|
@felipecrv I just think about is there some techniques to denote an
|
It's tricky to re-use given that Buffers wrapped in shared_ptrs are supposed to be immutable. You never know if someone is also holding a reference to that shared |
c8616c6 to
70d239e
Compare
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not follow the same constraint as Array above: "Do not count the bits if they haven't been counted already".
I think you should instead just iterate on the chunks, such as (untested):
static type Get(const ArraySpan& array) {
// Do not count the bits if they haven't been counted already
if ((arr.null_count == 0) || (arr.buffers[0].data == nullptr)) {
return ALL_VALID;
}
if (arr.null_count == arr.length) {
return ALL_NULL;
}
return PERHAPS_NULL;
}
static type Get(const ChunkedArray& chunk_array) {
std::optional<type> current_gen;
for (const auto& chunk : chunk_array.chunks()) {
if (chunk->length() == 0) {
continue;
}
chunk_gen = Get(*chunk);
if (current_gen.has_value() && chunk_gen != *current_gen) {
return PERHAPS_NULL;
}
current_gen = chunk_gen;
}
return current_gen.value_or(ALL_VALID);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, addressed.
2. support NullGeneralization for chunked-array 3. simple refactor some execute logic and remove finished TODO
ffda81d to
7bbe96b
Compare
|
cc @pitrou, i think this can be moved forward? |
Rationale for this change
Mainly for below improvement and codes simplification:
NullGeneralizationnot support chunked-array's null status check which may prevent some optimization for preallocation.ComputeDataPreallocatenot support [Large]ListView's preallocate.What changes are included in this PR?
Described as above.
Are these changes tested?
Yes, covered by exists.
Are there any user-facing changes?
No