Skip to content

[GLUTEN-9152][CORE] Avoid unnecessary serialization of hadoop conf#9153

Merged
zml1206 merged 9 commits intoapache:mainfrom
zml1206:9152
Apr 8, 2025
Merged

[GLUTEN-9152][CORE] Avoid unnecessary serialization of hadoop conf#9153
zml1206 merged 9 commits intoapache:mainfrom
zml1206:9152

Conversation

@zml1206
Copy link
Copy Markdown
Contributor

@zml1206 zml1206 commented Mar 27, 2025

What changes were proposed in this pull request?

Significantly reduce the closure serialization time of BasicScanExecTransformer and WholeStageTransformer.
(Fixes: #9152)

How was this patch tested?

@github-actions github-actions bot added CORE works for Gluten Core VELOX CLICKHOUSE labels Mar 27, 2025
@github-actions
Copy link
Copy Markdown

#9152

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zml1206 zml1206 requested review from JkSelf and philo-he March 27, 2025 09:48
rootPaths: Seq[String],
properties: Map[String, String],
serializableHadoopConf: Option[SerializableConfiguration] = None): ValidationResult = {
hadoopConf: Configuration): ValidationResult = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialize code is added for viewfs support as we need to pass some info to Velox. Not sure if the new code can be ser/de-ser and passed down.
The code itself is correct, but there is no viewfs related tests in GHA now so we may need to check this manually

Cc @wangyum @JkSelf

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But from the code logic, I did not find that velox uses the hadoop conf here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum Could you please help in verifying this PR in your ViewFS environment? My ViewFS test setup is currently not functioning properly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. will test it soon.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this patch on our cluster and it still works.


val sparkConf: SparkConf = sparkContext.getConf

val serializableHadoopConf: SerializableConfiguration = new SerializableConfiguration(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add @transient lazy could fix the issue

Copy link
Copy Markdown
Contributor Author

@zml1206 zml1206 Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If serialization is require, lazy is a solution to avoid serialization without enabling viewFS, but if serialization is not needed here, it would be simpler to use sparkContext.hadoopConfiguration directly.

Copy link
Copy Markdown
Member

@philo-he philo-he left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@JkSelf
Copy link
Copy Markdown
Contributor

JkSelf commented Mar 31, 2025

@zml1206 can you help to take a look the failed unit tests?

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Mar 31, 2025

@zml1206 can you help to take a look the failed unit tests?

It's strange that PR seems to cause Java heap oom, but JDK 17 passes. I haven't found the reason yet. I'm not sure whether it is related to PR or the test itself is unstable.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2025

Run Gluten Clickhouse CI on x86

@zml1206 zml1206 marked this pull request as draft April 3, 2025 22:22
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2025

Run Gluten Clickhouse CI on x86

@zml1206 zml1206 marked this pull request as ready for review April 8, 2025 06:07
@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Apr 8, 2025

I can't reproduce it locally. I tried many ways, including lazy, @transient, local variables, etc. WholeStageTransformer in spark 3.5 and jdk8 will cause OOM in HashExpressionSuite. I guess it is related to jdk8's GC. Now there is no spark 3.5-jdk8 workflow. Can it be merged or roll back the changes of WholeStageTransformer? @JkSelf

@wangyum
Copy link
Copy Markdown
Member

wangyum commented Apr 8, 2025

@zml1206 Can we increase JVM heap here, e.g.:

<argLine>${argLine} -Xmx4g ${extraJavaTestArgs}</argLine>

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Apr 8, 2025

@zml1206 Can we increase JVM heap here, e.g.:

<argLine>${argLine} -Xmx4g ${extraJavaTestArgs}</argLine>

Thank you, after #9209, there is no CI for spark3.5-jdk8.

@philo-he
Copy link
Copy Markdown
Member

philo-he commented Apr 8, 2025

I can't reproduce it locally. I tried many ways, including lazy, @transient, local variables, etc. WholeStageTransformer in spark 3.5 and jdk8 will cause OOM in HashExpressionSuite. I guess it is related to jdk8's GC. Now there is no spark 3.5-jdk8 workflow. Can it be merged or roll back the changes of WholeStageTransformer? @JkSelf

@zml1206, do you mean reverting the changes for WholeStageTransformer can make that test pass in the previous CI test? If yes, it's strange since that code path is for viewFS and viewFS is not verified in CI.

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Apr 8, 2025

I can't reproduce it locally. I tried many ways, including lazy, @transient, local variables, etc. WholeStageTransformer in spark 3.5 and jdk8 will cause OOM in HashExpressionSuite. I guess it is related to jdk8's GC. Now there is no spark 3.5-jdk8 workflow. Can it be merged or roll back the changes of WholeStageTransformer? @JkSelf

@zml1206, do you mean reverting the changes for WholeStageTransformer can make that test pass in the previous CI test? If yes, it's strange since that code path is for viewFS and viewFS is not verified in CI.

Yes, revert the changes for WholeStageTransformer can make that test pass in the previous CI test.

Copy link
Copy Markdown
Member

@philo-he philo-he left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge this patch first. If there is some issue found at user side, we can make a follow-up pr.

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Apr 8, 2025

Thanks for everyone's review, if no one else has any comments, I'll merge it later.

@zml1206 zml1206 merged commit d27959c into apache:main Apr 8, 2025
47 checks passed
@zml1206 zml1206 deleted the 9152 branch December 9, 2025 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VL] Avoid unnecessary serialization of hadoop conf

6 participants