Skip to content

Conversation

@zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Apr 23, 2024

Rationale for this change

Recently I've been working on some improvement for Grouper and I found adding Reset function could be beneficial. Then I trace down to #33484 from a TODO in code. Here comes this PR.

What changes are included in this PR?

Add Reset function for all the concrete Grouper implementations, and eliminate the recreation of Grouper in AnyKeysSegmenter.

Also add more RowSegmenter cases covering AnyKeysSegmenter.

Are these changes tested?

Yes. Legacy UTs should cover it well. Also added some new UTs.

Are there any user-facing changes?

None.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@zanmato1984 zanmato1984 changed the title Implement grouper reset GH-33484: [C++][Compute] Implement Grouper::Reset Apr 23, 2024
@zanmato1984 zanmato1984 marked this pull request as ready for review April 23, 2024 17:13
@zanmato1984 zanmato1984 requested a review from westonpace as a code owner April 23, 2024 17:13
@zanmato1984
Copy link
Contributor Author

cc @westonpace

@github-actions
Copy link

⚠️ GitHub issue #33484 has been automatically assigned in GitHub to PR creator.

rows_minibatch_.Clean();
map_.cleanup();
RETURN_NOT_OK(map_.init(encode_ctx_.hardware_flags, ctx_->memory_pool()));
return Status::OK();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should enlarge the size of the temp_stack_? Or the stack's internal states like top_ may no enough for the next bunch of operations after grouper reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the temp vectors allocated by the temp_stack_ would be released at the end the of the call to each individual grouper member function (IMO that's why they are designed to be "temp vector"). Meaning by the time of Reset, the stack is empty already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i see, thanks.
TempVectorStack will hold a resizable buffer allocated in TempVectorStack::Init function until the end of the grouper object.
But the TempVectorHolder will release the used space (already allocated in TempVectorStack ) at the call to each of individual grouper member function as you said.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a DCHECK that the temp_stack_ is empty, then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's probably reasonable. But considering current temp stack doesn't have a public method for size/empty check, and I have other open PRs for temp stack restructure, I'd add the necessary methods after other PRs are done and get this one rebased.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 26, 2024
@zanmato1984
Copy link
Contributor Author

Hi @pitrou , would you please help to take a look? Thanks.

: BaseRowSegmenter(key_types),
ctx_(ctx),
grouper_(nullptr),
grouper_(Grouper::Make(key_types, ctx).ValueOrDie()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable possibility that this might fail, for example due to input types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without extra context, there is. But the factory method (cleverly) avoids this by a dummy instantiation of Grouper:

ARROW_RETURN_NOT_OK(Grouper::Make(key_types, ctx)); // check types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't it pointless to construct the Grouper twice? Can't the constructor simply take the Grouper that was constructed in the factory method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you are saying. Will update. Thanks for pointing this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

map_.clear();
offsets_.clear();
key_bytes_.clear();
num_groups_ = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about encoders_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I saw in https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/row_encoder_internal.h, none of the encoders holds states that would change during the lifespan of the grouper.

rows_minibatch_.Clean();
map_.cleanup();
RETURN_NOT_OK(map_.init(encode_ctx_.hardware_flags, ctx_->memory_pool()));
return Status::OK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a DCHECK that the temp_stack_ is empty, then?

// Assert next is the last (empty) segment.
ASSERT_OK_AND_ASSIGN(auto segment, segmenter->GetNextSegment(batch, offset));
ASSERT_TRUE(segment.offset >= batch.length);
ASSERT_TRUE(segment.is_open);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something be checked about segment.length here? Or is undetermined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last segment.length is supposed to be 0. Updated.

zanmato1984 and others added 2 commits May 14, 2024 23:07
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zanmato1984 . I'll declare this good to merge, though I'd love if @westonpace could double-check this.

@pitrou pitrou merged commit ada965f into apache:main May 14, 2024
@pitrou pitrou removed the awaiting committer review Awaiting committer review label May 14, 2024
zanmato1984 added a commit to zanmato1984/arrow that referenced this pull request May 14, 2024
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit ada965f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
### Rationale for this change

Recently I've been working on some improvement for `Grouper` and I found adding `Reset` function could be beneficial. Then I trace down to apache#33484 from a TODO in code. Here comes this PR.

### What changes are included in this PR?

Add `Reset` function for all the concrete `Grouper` implementations, and eliminate the recreation of `Grouper` in `AnyKeysSegmenter`.

Also add more `RowSegmenter` cases covering `AnyKeysSegmenter`.

### Are these changes tested?

Yes. Legacy UTs should cover it well. Also added some new UTs.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#33484

Lead-authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Co-authored-by: Rossi Sun <zanmato1984@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants