GH-41974: [C++][Compute] Support more precise pre-allocation and more pre-allocated types for ScalarExecutor and VectorExecutor #41975

ZhangHuiGui · 2024-06-05T06:37:45Z

Rationale for this change

Mainly for below improvement and codes simplification:

NullGeneralization not support chunked-array's null status check which may prevent some optimization for preallocation.
ComputeDataPreallocate not support [Large]ListView's preallocate.
Some DCHECK and branch check for preallocate the validity-bitmaps are unnecessary.

What changes are included in this PR?

Described as above.

Are these changes tested?

Yes, covered by exists.

Are there any user-facing changes?

No

GitHub Issue: [C++][Compute] ScalarExecutor and VectorExecutor's can be improved by support more precise pre-allocation and more pre-allocated types #41974

github-actions · 2024-06-05T06:38:09Z

⚠️ GitHub issue #41974 has been automatically assigned in GitHub to PR creator.

ZhangHuiGui · 2024-06-05T06:40:22Z

cpp/src/arrow/compute/exec.cc

ComputeDataPreallocate can make sure all of the element in data_preallocated_ could satisfy the DCHECK.
It's unnecessary to keep this here.

Maybe keeping this because it's only a debug build?

We can keep the DCHECK you suggested before ( https://github.com/apache/arrow/pull/41975/files#r1636669846). In essence, they detect the same content, otherwise there will be two duplicate DCHECKs.

ZhangHuiGui · 2024-06-05T06:42:30Z

cpp/src/arrow/compute/exec.cc

have_chunked_arrays only used in kernel_->can_execute_chunkwise=false's branch.

I like this code move. That i could only understand when looking commit by commit.

ZhangHuiGui · 2024-06-05T06:44:09Z

cpp/src/arrow/compute/exec.cc

We could use more precise pre-allocation strategy by check the input batches as what ScalarExecutor do.

We could prevent some unnecessary pre-allocation by this precise strategy (checked the input batches status). And this could make an improvement for the situations like below:
before:

TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000 4395737 ns 4393473 ns 138 items_per_second=119.333M/s null_percent=0.1 size=524.288k TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10 5109157 ns 5106714 ns 129 items_per_second=102.666M/s null_percent=10 size=524.288k TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2 6586773 ns 6583881 ns 104 items_per_second=79.6321M/s null_percent=50 size=524.288k TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1 1868864 ns 1867719 ns 364 items_per_second=280.71M/s null_percent=100 size=524.288k TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0 3133277 ns 3131282 ns 248 items_per_second=167.436M/s null_percent=0 size=524.288k

after:

TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000 1813275 ns 1812657 ns 392 items_per_second=289.237M/s null_percent=0.1 size=524.288k TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10 2711716 ns 2710747 ns 255 items_per_second=193.411M/s null_percent=10 size=524.288k TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2 4460442 ns 4459029 ns 154 items_per_second=117.579M/s null_percent=50 size=524.288k TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1 248396 ns 248270 ns 2695 items_per_second=2.11176G/s null_percent=100 size=524.288k TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0 878407 ns 878028 ns 769 items_per_second=597.12M/s null_percent=0 size=524.288k

FWIW I'm working on Take at the moment and setting the chunked exec kernels #41700

Thanks for the reminder, I will take a look.
Current PR just optimize the execution logic of chunked-array from the NULL pre-allocation part.

mapleFU · 2024-06-05T08:19:25Z

cpp/src/arrow/compute/exec.cc

would the name "invalid" ambigious here? Could it be ALL_NULL ?

Yes, it can be replaced by ALL_NULL with ALL_NULL =3, so we could keep the original NullGeneralization::type of each chunk when doing & operation in type Get(const ChunkedArray& chunk_array) .

mapleFU · 2024-06-05T08:27:41Z

cpp/src/arrow/compute/exec.cc

What would NA type do here?

data_preallocated_ will be filled in ComputeDataPreallocate, and the NA type will not be added to data_preallocated_.

Yes, seems this could be a DCHECK?

fixed_size_binary<0> can have bit_width == 0

fixed_size_binary<0> can have bit_width == 0

Yes, fixed_size_binary<0> can be added into data_preallocated_ normally in ComputeDataPreallocate, and the function call make sure all of element in data_preallocated_ should satisfy >=0. So we just add a DCHECK here.

mapleFU · 2024-06-12T15:08:01Z

cpp/src/arrow/compute/exec.cc

Would this introducing a undefined value here? (return 0)

Agreed. This should start from 0 to give an easier time to the compiler (when it's proving boundaries in the optimizer).

mapleFU · 2024-06-12T15:08:56Z

cpp/src/arrow/compute/exec.cc

Suggested change

const ArrayData* curr_chunk = chunk_array.chunk(chunk_idx)->data().get();

value.SetArray(*curr_chunk);

const ArrayData& curr_chunk = chunk_array.chunk(chunk_idx)->data();

value.SetArray(curr_chunk);

Or construct a ArraySpan?

mapleFU · 2024-06-12T15:18:20Z

cpp/src/arrow/compute/exec.cc

Yes, seems this could be a DCHECK?

mapleFU · 2024-06-12T15:18:47Z

cpp/src/arrow/compute/exec.cc

Maybe keeping this because it's only a debug build?

mapleFU · 2024-06-12T15:21:28Z

cpp/src/arrow/compute/exec.cc

Whats the difference between it and SetupPreallocation in Scalar?

Very hard to review this code extraction. Was this code churn necessary to achieve the goals of this PR?

Was this code churn necessary to achieve the goals of this PR?

Yes, that's one of the goals of this pr, we could use a more precise pre-allocation way for vector-executor.

The main change is to support the detection of null values when kernel_->null_handling == NullHandling::INTERSECTION, so as to skip some unnecessary preallocations, including the null value detection of chunk-array, which was not done before. And also, the logic here is just same as ScalarExecutor::SetupPreallocation.

Because this logic is decoupled from the scheduling of VectorExecutor Execute, it is better to consider extracting it.

Whats the difference between it and SetupPreallocation in Scalar?

The difference between this function and SetupPreallocation of ScalarExecutor is that ScalarExecutor does not have an independent chunk_exec, so it needs to use preallocate_contiguous_ to confirm whether it needs to execute PrepareOutput according to chunks through span_iterator_, which is not required in VectorExecutor.

The difference between this function and SetupPreallocation of ScalarExecutor is that ScalarExecutor does not have an independent chunk_exec, so it needs to use preallocate_contiguous_ to confirm whether it needs to execute PrepareOutput according to chunks through span_iterator_, which is not required in VectorExecutor.

Do you think we can extract SetupPreallocation to base class and extract common logic out, then implement handling for chunked array with:

void SetupPreallocation(...) final { Base::SetupPreallocation(...) ... }

?

And what if kernel_->null_handling == NullHandling::OUTPUT_NOT_NULL?

And what if kernel_->null_handling == NullHandling::OUTPUT_NOT_NULL?

Kernel output is never null and a validity bitmap does not need to be allocated.
validity_preallocated_ just preallocate the validity bitmap-buffer which unnecessary for OUTPUT_NOT_NULL.

felipecrv · 2024-06-13T02:19:32Z

cpp/src/arrow/compute/exec.cc

list-views don't need an extra offset/size, so you don't need the added_length=1.

felipecrv · 2024-06-13T02:20:29Z

cpp/src/arrow/compute/exec.cc

Agreed. This should start from 0 to give an easier time to the compiler (when it's proving boundaries in the optimizer).

felipecrv · 2024-06-13T02:23:41Z

cpp/src/arrow/compute/exec.cc

You can use a for loop here.

cpp/src/arrow/compute/exec.cc

felipecrv · 2024-06-13T02:31:34Z

cpp/src/arrow/compute/exec.cc

if num_chunks() == 0, we should return ALL_VALID instead of ALL_NULL or PERHAPS_NULL like Get does above (it doesn't check .length() at all which I think was a missed opportunity of making ALL_VALID more likely).

felipecrv · 2024-06-13T02:35:34Z

cpp/src/arrow/compute/exec.cc

FWIW I'm working on Take at the moment and setting the chunked exec kernels #41700

felipecrv · 2024-06-13T02:37:48Z

cpp/src/arrow/compute/exec.cc

Very hard to review this code extraction. Was this code churn necessary to achieve the goals of this PR?

cpp/src/arrow/compute/exec.cc

felipecrv

I will finish the review later. I need to check that code movement mor carefully. It would help to have a commit history with two commits:

refactoring the setup preallocation logic in-place
extracting that refactored logic to a function

Code movement + changes mixed together takes more time to review.

cpp/src/arrow/compute/exec.cc

felipecrv · 2024-06-13T13:57:05Z

cpp/src/arrow/compute/exec.cc

fixed_size_binary<0> can have bit_width == 0

ZhangHuiGui · 2024-06-13T15:29:12Z

I will finish the review later. I need to check that code movement mor carefully. It would help to have a commit history with two commits:

refactoring the setup preallocation logic in-place

extracting that refactored logic to a function

Code movement + changes mixed together takes more time to review.

It's my fault. Thanks so much for your review @felipecrv @mapleFU.
I reorganized the commits.

felipecrv

I finally understand the first goal of the PR. And I disapprove of it. I want kernel dispatching logic to depend solely on types and immutable properties of the kernels.

Extending the range of types that are pre-allocated is definitely a good idea though.

felipecrv · 2024-06-13T16:44:25Z

cpp/src/arrow/compute/exec.cc

I don't think this logic is safe. If null_handling is INTERSECTION the function might want to follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection.

And even if that is not the case (assuming all kernels are perfect at the moment), this extra logic introduces branching in the possible states of pre-allocated buffers based on runtime input data and not something statically defined in the kernel_ when it's configured.

If you want the speed benefits of not pre-allocating a validity bitmap when not necessary, declare the kernel as COMPUTED_NO_PREALLOCATE and work on the kernel logic to allocate only when needed. The generic binding and execution logic should be simple and not depend on values so much. It's too complicated as it is.

follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection

Are you talking about PropagateNulls? Currently, seems this logic is used when kernel set INTERSECTION.

If not, i understand that if the kernel internal execute path must follow a single code path that uses the pre-allocated buffer, COMPUTED_PREALLOCATE can be set statically, which should be safer for this kernel?

Back to preallocation-validity buffer, if kernel-function sets NullHandling::INTERSECTION, does it mean that kernel-function may not be sure whether it needs a pre-allocated validity buffer and let the Executor decide for itself? At this time, the Executor can only dynamically determine whether to set the pre-allocate buffer based on the input data.

And even if that is not the case (assuming all kernels are perfect at the moment), this extra logic introduces branching in the possible states of pre-allocated buffers based on runtime input data and not something statically defined in the kernel_ when it's configured.

From a performance perspective, the current logic does introduce many branches at runtime for INTERSECTION, compared to PropagateNulls, it actually passes NullGeneralization::Get introduces very few branches.

The generic binding and execution logic should be simple and not depend on values so much. It's too complicated as it is.

Yes, the current pre-allocation logic for INTERSECTION introduced NullPropagator and NullGeneralization is complicated.

@pitrou What do you think of this PR for VectorExecutor's INTERSECTION validity-bitmap preallocation?
It seems that you have changed the logic here before.

Besides that, current ScalarExecutor will pre-allocate with NullGeneralization in INTERSECTION mode.

Sorry, by branches I meant branches in the state-space the execution framework and the kernels might traverse. Meaning that without duplication of test cases for every kernel, we might hide a latent bug until an unlikely array configuration causes a SIGSEGV on the non-pre-allocated validity buffer.

Back to preallocation-validity buffer, if kernel-function sets NullHandling::INTERSECTION, does it mean that kernel-function may not be sure whether it needs a pre-allocated validity buffer and let the Executor decide for itself? At this time, the Executor can only dynamically determine whether to set the pre-allocate buffer based on the input data.

By statically I mean a one-time configuration of the kernel_ when it's added to the function in the registry.

For a kernel implementer, it's simpler to assume NullHandling::INTERSECTION implies pre-allocated validity bitmap buffer. No matter what arrays are passed as input.

For a kernel implementer, it's simpler to assume NullHandling::INTERSECTION implies pre-allocated validity bitmap buffer. No matter what arrays are passed as input.

I see, thank you for your explanation @felipecrv .

Do you think the changes in the third commit of this PR are reasonable? If so, I will revert the first two commits. The third commit is mainly related to the pre-allocation of output data-buffer and refactored part of the code, which can bring the benefits mentioned by @mapleFU .

In addition, I will create an issue to track the pre-allocation of validity-buffer in INTERSECTION mode, because currently ScalarExecutor will do pre-allocation according to NullGeneralization in this mode.

I think the handling of list-view types in the pre-allocation code is good. The 3rd commit has a mix of things in it.

Yes, except for [Large]ListView and chunked-array NullGeneration support, the others are some minor refactorings (some DCHECKs are unnecessary for now).

mapleFU · 2024-06-14T05:28:37Z

@felipecrv I just think about is there some techniques to denote an ArrayData is "owned and writable", like the output of PrepareOutput. This can do some techniques, like:

Optimizing data transform in Parquet reader
Do some optimization like reuse-input

felipecrv · 2024-06-18T23:53:33Z

@felipecrv I just think about is there some techniques to denote an ArrayData is "owned and writable", like the output of PrepareOutput. This can do some techniques, like:

Optimizing data transform in Parquet reader

Do some optimization like reuse-input

It's tricky to re-use given that Buffers wrapped in shared_ptrs are supposed to be immutable. You never know if someone is also holding a reference to that shared Buffer.

pitrou · 2024-06-25T17:16:04Z

cpp/src/arrow/compute/exec.cc

This does not follow the same constraint as Array above: "Do not count the bits if they haven't been counted already".

I think you should instead just iterate on the chunks, such as (untested):

static type Get(const ArraySpan& array) { // Do not count the bits if they haven't been counted already if ((arr.null_count == 0) || (arr.buffers[0].data == nullptr)) { return ALL_VALID; } if (arr.null_count == arr.length) { return ALL_NULL; } return PERHAPS_NULL; } static type Get(const ChunkedArray& chunk_array) { std::optional<type> current_gen; for (const auto& chunk : chunk_array.chunks()) { if (chunk->length() == 0) { continue; } chunk_gen = Get(*chunk); if (current_gen.has_value() && chunk_gen != *current_gen) { return PERHAPS_NULL; } current_gen = chunk_gen; } return current_gen.value_or(ALL_VALID); }

Thanks for this, addressed.

…utor

2. support NullGeneralization for chunked-array 3. simple refactor some execute logic and remove finished TODO

ZhangHuiGui · 2024-07-11T09:36:25Z

cc @pitrou, i think this can be moved forward?

github-actions bot added Component: C++ awaiting review Awaiting review labels Jun 5, 2024

ZhangHuiGui commented Jun 5, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 5, 2024

ZhangHuiGui commented Jun 5, 2024

View reviewed changes

ZhangHuiGui force-pushed the refactor-executor-kernel branch from bdcd19d to e2f1a1f Compare June 5, 2024 07:35

mapleFU reviewed Jun 5, 2024

View reviewed changes

ZhangHuiGui requested a review from felipecrv June 5, 2024 15:07

mapleFU reviewed Jun 12, 2024

View reviewed changes

felipecrv requested changes Jun 13, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 13, 2024

ZhangHuiGui force-pushed the refactor-executor-kernel branch from 2979750 to 5e2207d Compare June 13, 2024 04:57

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 13, 2024

ZhangHuiGui force-pushed the refactor-executor-kernel branch from 5e2207d to c9a13a8 Compare June 13, 2024 05:00

ZhangHuiGui commented Jun 13, 2024

View reviewed changes

cpp/src/arrow/compute/exec.cc Outdated Show resolved Hide resolved

felipecrv reviewed Jun 13, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 13, 2024

ZhangHuiGui force-pushed the refactor-executor-kernel branch from d0a599b to cf02207 Compare June 13, 2024 15:18

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 13, 2024

felipecrv requested changes Jun 13, 2024

View reviewed changes

github-actions bot removed the awaiting change review Awaiting change review label Jun 13, 2024

github-actions bot added the awaiting changes Awaiting changes label Jun 13, 2024

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 25, 2024

ZhangHuiGui mentioned this pull request Jun 25, 2024

[C++][Compute] A latent bug may caused by validity-buffer pre-allocation in ScalarExecutor with NullHandling::INTERSECTION #43036

Closed

ZhangHuiGui force-pushed the refactor-executor-kernel branch from c8616c6 to 70d239e Compare June 25, 2024 02:24

pitrou reviewed Jun 25, 2024

View reviewed changes

felipecrv requested a review from pitrou June 26, 2024 14:58

ZhangHuiGui added 5 commits July 11, 2024 17:35

refactoring the setup preallocation logic in-place

29e34f3

extracting the setup preallocation logic to a function for VectorExec…

d42eeb7

…utor

1. support preallocation for [Large]ListView's data type

0a358e7

2. support NullGeneralization for chunked-array 3. simple refactor some execute logic and remove finished TODO

revert more precise pre-allocation for VectorExecutor

4307d1e

keep ChunkedArray same constraint with Array in NullGeneration

7bbe96b

ZhangHuiGui force-pushed the refactor-executor-kernel branch from ffda81d to 7bbe96b Compare July 11, 2024 09:35

github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025

thisisnic removed the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025

-      const ArrayData* curr_chunk = chunk_array.chunk(chunk_idx)->data().get();
-      value.SetArray(*curr_chunk);
+      const ArrayData& curr_chunk = chunk_array.chunk(chunk_idx)->data();
+      value.SetArray(curr_chunk);

GH-41974: [C++][Compute] Support more precise pre-allocation and more pre-allocated types for ScalarExecutor and VectorExecutor #41975

Are you sure you want to change the base?

GH-41974: [C++][Compute] Support more precise pre-allocation and more pre-allocated types for ScalarExecutor and VectorExecutor #41975

Conversation

ZhangHuiGui commented Jun 5, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jun 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhangHuiGui Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhangHuiGui Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhangHuiGui Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhangHuiGui Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhangHuiGui commented Jun 5, 2024 •

edited by github-actions bot

Loading

ZhangHuiGui Jun 13, 2024 •

edited

Loading

ZhangHuiGui Jun 5, 2024 •

edited

Loading

ZhangHuiGui Jun 13, 2024 •

edited

Loading

ZhangHuiGui Jun 14, 2024 •

edited

Loading

ZhangHuiGui Jun 14, 2024 •

edited

Loading