Skip to content

[CORE] Refactor columnar noop write rule#8422

Merged
jackylee-ch merged 4 commits intoapache:mainfrom
jackylee-ch:refact_columnar_noop_write_rule
Jan 9, 2025
Merged

[CORE] Refactor columnar noop write rule#8422
jackylee-ch merged 4 commits intoapache:mainfrom
jackylee-ch:refact_columnar_noop_write_rule

Conversation

@jackylee-ch
Copy link
Copy Markdown
Contributor

@jackylee-ch jackylee-ch commented Jan 4, 2025

What changes were proposed in this pull request?

Refactor NoopWrite support, move NoopWrite rule from NativeWritePostRule to GlutenNoopWriteRule to support all Spark versions, and change class name check to pattern matching.

How was this patch tested?

CI and new added tests

@github-actions github-actions bot added CORE works for Gluten Core VELOX CLICKHOUSE labels Jan 4, 2025
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 4, 2025

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 4, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 8199ce6 to 39761ad Compare January 4, 2025 08:53
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 4, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch marked this pull request as draft January 4, 2025 18:01
@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 39761ad to 99ded60 Compare January 4, 2025 18:23
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 4, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 99ded60 to 21a5a58 Compare January 5, 2025 01:48
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 21a5a58 to 5beecab Compare January 5, 2025 02:29
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 5beecab to 6774a4e Compare January 5, 2025 07:05
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 6774a4e to 96c0985 Compare January 5, 2025 10:17
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 1cdf913 to 13119c4 Compare January 5, 2025 14:01
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@jackylee-ch
Copy link
Copy Markdown
Contributor Author

Run Gluten Clickhouse CI on x86

super.sparkConf
.set("spark.gluten.sql.columnar.forceShuffledHashJoin", "false")
.set(GlutenConfig.COLUMNAR_FORCE_SHUFFLED_HASH_JOIN_ENABLED.key, "false")
.set(GlutenConfig.NOOP_WRITER_ENABLED.key, "false")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test will report an error, as GlutenNoopWriterRule will add a FakeRowAdaptor node, which will cause the test check to fail, thus we default false here.

SPARK-30953: InsertAdaptiveSparkPlan should apply AQE on child plan of v2 write commands

@jackylee-ch jackylee-ch marked this pull request as ready for review January 6, 2025 03:50
@jackylee-ch
Copy link
Copy Markdown
Contributor Author

cc @JkSelf @philo-he

Copy link
Copy Markdown
Member

@philo-he philo-he left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Thanks! cc @JkSelf

case class NativeWritePostRule(session: SparkSession) extends Rule[SparkPlan] {
private[datasources] def injectFakeRowAdaptor(command: SparkPlan, child: SparkPlan): SparkPlan = {
child match {
// if the child is columnar, we can just wrap&transfer the columnar data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
wrap & transfer.


case class GlutenNoopWriterRule(session: SparkSession) extends Rule[SparkPlan] {
override def apply(p: SparkPlan): SparkPlan = p match {
case rc @ AppendDataExec(_, _, NoopWrite) if GlutenConfig.get.enableNoopWriter =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note the below check is removed. Could you clarify this change?
write.getClass.getName == NOOP_WRITE && BackendsApiManager.getSettings.enableNativeWriteFiles()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note the below check is removed. Could you clarify this change? write.getClass.getName == NOOP_WRITE && BackendsApiManager.getSettings.enableNativeWriteFiles()

We can directly check the NoopWrite here, so we don't need the class name check now. As for BackendsApiManager.getSettings.enableNativeWriteFiles(), we have a better config now.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 25903b1 to 805a115 Compare January 7, 2025 06:07
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 805a115 to aec0545 Compare January 7, 2025 06:32
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from aec0545 to b09a2d9 Compare January 7, 2025 09:00
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
qe.executedPlan match {
case plan @ (_: DataWritingCommandExec | _: V2TableWriteExec) =>
noLocalread = collect(plan) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the child plan check as we would add FackRowAdaptor, and the check has already been remove since 3.4.0.

assert(plan.isInstanceOf[V2TableWriteExec])
val childPlan = plan.asInstanceOf[V2TableWriteExec].child
assert(childPlan.isInstanceOf[FakeRowAdaptor])
assert(childPlan.asInstanceOf[FakeRowAdaptor].child.isInstanceOf[AdaptiveSparkPlanExec])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refine the child plan check

@jackylee-ch
Copy link
Copy Markdown
Contributor Author

noopWrite config has been removed and I've fixed the failed tests. PTAL @philo-he @JkSelf

var fakeRowAdaptor: Option[FakeRowAdaptor] = None

override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
fakeRowAdaptor = qe.executedPlan.collectFirst { case f: FakeRowAdaptor => f }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackylee-ch FakeRowAdaptor is used in spark 32 and 33. Why we need to add this check in spark 35 test folder?

Copy link
Copy Markdown
Contributor Author

@jackylee-ch jackylee-ch Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GlutenNoopWriterRule would add a FakeRowAdaptor after v2 write command while writing to noop source. This PR would let GlutenNoopWriterRule work for all Spark versions.

Copy link
Copy Markdown
Contributor

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your work.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 8, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 4c3153b to fd48d5b Compare January 8, 2025 08:55
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 8, 2025

Run Gluten Clickhouse CI on x86

@jackylee-ch
Copy link
Copy Markdown
Contributor Author

Any more question about this pr? @philo-he

@jackylee-ch jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from fd48d5b to 887e94b Compare January 9, 2025 02:04
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 9, 2025

Run Gluten Clickhouse CI on x86

@philo-he philo-he changed the title [CORE] Refact columnar noop write rule [CORE] Refactor columnar noop write rule Jan 9, 2025
@jackylee-ch jackylee-ch merged commit d101cb8 into apache:main Jan 9, 2025
* ColumnarToRow operation for NoopWrite. Since NoopWrite does not actually perform any data
* operations, it can accept input data in either row-based or columnar format.
*/
case class GlutenNoopWriterRule(session: SparkSession) extends Rule[SparkPlan] {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such rule could be placed in this folder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot move to that folder as the NoopWrite can only be accessed under org.apache.spark.sql.execution.datasources.noop

}

case class NativeWritePostRule(session: SparkSession) extends Rule[SparkPlan] {
private[datasources] def injectFakeRowAdaptor(command: SparkPlan, child: SparkPlan): SparkPlan = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this API only called by GlutenNoopWriterRule after the change? Could move to the rule file if so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is also needed in NativeWritePostRule

@baibaichen
Copy link
Copy Markdown
Contributor

@jackylee-ch would you pelase writing some comments for your PR? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants