Skip to content

[VL] Change isDistinct of distinct aggregateExpression to false#9364

Closed
zml1206 wants to merge 2 commits intoapache:mainfrom
zml1206:agg_distinct
Closed

[VL] Change isDistinct of distinct aggregateExpression to false#9364
zml1206 wants to merge 2 commits intoapache:mainfrom
zml1206:agg_distinct

Conversation

@zml1206
Copy link
Copy Markdown
Contributor

@zml1206 zml1206 commented Apr 18, 2025

What changes were proposed in this pull request?

Vanilla spark will add an aggregate to remove duplicates for the distinct aggregation function, so velox does not need to process distinct. Moreover, currently velox distinct hash aggregation does not support spill.

How was this patch tested?

Already existing UT.

@zml1206 zml1206 marked this pull request as draft April 18, 2025 09:13
@github-actions github-actions bot added the CORE works for Gluten Core label Apr 18, 2025
@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot added the VELOX label Apr 18, 2025
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Copy Markdown
Member

Perhaps this change can speed up distinct aggregation processing?

@zhztheplayer
Copy link
Copy Markdown
Member

Vanilla spark will add an aggregate to remove duplicates for the distinct aggregation function, so velox does not need to process distinct.

I am curious why Spark keeps the distinct flag after planing the distinct aggregation. Generally it's not needed by vanilla Spark as well. Have you investigated?

@zml1206 zml1206 marked this pull request as ready for review April 18, 2025 11:17
@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Apr 18, 2025

I am curious why Spark keeps the distinct flag after planing the distinct aggregation. Generally it's not needed by vanilla Spark as well. Have you investigated?

Maybe Spark just didn't process it anymore, isDistinct is only used to create aggregate.

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Apr 18, 2025

Perhaps this change can speed up distinct aggregation processing?

Yes, it should also avoid the memory problem caused by velox distinct hash aggregation does not support spill.

@zml1206 zml1206 requested a review from zhztheplayer April 22, 2025 05:53
@zhztheplayer
Copy link
Copy Markdown
Member

Perhaps this change can speed up distinct aggregation processing?

Yes, it should also avoid the memory problem caused by velox distinct hash aggregation does not support spill.

I am checking the code. Do we actually pass the flag to the substrait plan? I didn't find the relevant code. Would you help highlight the code here? Or else we never passed this flag to native? cc @rui-mo

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Apr 22, 2025

I am checking the code. Do we actually pass the flag to the substrait plan? I didn't find the relevant code. Would you help highlight the code here? Or else we never passed this flag to native? cc @rui-mo

Oh, I overlooked it. Yes, we did not pass isDistinct to the substrait plan. This change is unnecessary. Thank you. @zhztheplayer

@zml1206 zml1206 closed this Apr 22, 2025
@zhztheplayer
Copy link
Copy Markdown
Member

Got it. Thank you for the attempt anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants