[CORE] Refactor columnar noop write rule by jackylee-ch · Pull Request #8422 · apache/gluten

jackylee-ch · 2025-01-04T07:48:15Z

What changes were proposed in this pull request?

Refactor NoopWrite support, move NoopWrite rule from NativeWritePostRule to GlutenNoopWriteRule to support all Spark versions, and change class name check to pattern matching.

How was this patch tested?

CI and new added tests

github-actions · 2025-01-04T07:48:31Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2025-01-04T07:48:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-04T08:54:02Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-04T18:23:51Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-05T01:49:26Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-05T02:29:53Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-05T07:05:58Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-05T10:17:57Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-05T13:48:34Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-05T14:01:48Z

Run Gluten Clickhouse CI on x86

…p_write_rule

github-actions · 2025-01-05T14:23:41Z

Run Gluten Clickhouse CI on x86

jackylee-ch · 2025-01-06T01:47:01Z

Run Gluten Clickhouse CI on x86

jackylee-ch · 2025-01-06T03:50:08Z

...la/org/apache/spark/sql/execution/adaptive/clickhouse/ClickHouseAdaptiveQueryExecSuite.scala

    super.sparkConf
-      .set("spark.gluten.sql.columnar.forceShuffledHashJoin", "false")
+      .set(GlutenConfig.COLUMNAR_FORCE_SHUFFLED_HASH_JOIN_ENABLED.key, "false")
+      .set(GlutenConfig.NOOP_WRITER_ENABLED.key, "false")


The following test will report an error, as GlutenNoopWriterRule will add a FakeRowAdaptor node, which will cause the test check to fail, thus we default false here.

SPARK-30953: InsertAdaptiveSparkPlan should apply AQE on child plan of v2 write commands

jackylee-ch · 2025-01-06T03:50:59Z

cc @JkSelf @philo-he

philo-he

Some comments. Thanks! cc @JkSelf

philo-he · 2025-01-07T03:39:08Z

...it/src/main/scala/org/apache/spark/sql/execution/datasources/GlutenWriterColumnarRules.scala

-  case class NativeWritePostRule(session: SparkSession) extends Rule[SparkPlan] {
+  private[datasources] def injectFakeRowAdaptor(command: SparkPlan, child: SparkPlan): SparkPlan = {
+    child match {
+      // if the child is columnar, we can just wrap&transfer the columnar data


Nit:
wrap & transfer.

shims/common/src/main/scala/org/apache/gluten/config/GlutenConfig.scala

philo-he · 2025-01-07T03:50:15Z

...it/src/main/scala/org/apache/spark/sql/execution/datasources/noop/GlutenNoopWriterRule.scala

+
+case class GlutenNoopWriterRule(session: SparkSession) extends Rule[SparkPlan] {
+  override def apply(p: SparkPlan): SparkPlan = p match {
+    case rc @ AppendDataExec(_, _, NoopWrite) if GlutenConfig.get.enableNoopWriter =>


I note the below check is removed. Could you clarify this change?
write.getClass.getName == NOOP_WRITE && BackendsApiManager.getSettings.enableNativeWriteFiles()

I note the below check is removed. Could you clarify this change? write.getClass.getName == NOOP_WRITE && BackendsApiManager.getSettings.enableNativeWriteFiles()

We can directly check the NoopWrite here, so we don't need the class name check now. As for BackendsApiManager.getSettings.enableNativeWriteFiles(), we have a better config now.

github-actions · 2025-01-07T05:47:08Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-07T06:08:00Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-07T06:33:16Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-07T09:01:01Z

Run Gluten Clickhouse CI on x86

jackylee-ch · 2025-01-07T09:01:35Z

...la/org/apache/spark/sql/execution/adaptive/clickhouse/ClickHouseAdaptiveQueryExecSuite.scala

+        override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
+          qe.executedPlan match {
+            case plan @ (_: DataWritingCommandExec | _: V2TableWriteExec) =>
+              noLocalread = collect(plan) {


Remove the child plan check as we would add FackRowAdaptor, and the check has already been remove since 3.4.0.

jackylee-ch · 2025-01-07T09:01:58Z

...la/org/apache/spark/sql/execution/adaptive/clickhouse/ClickHouseAdaptiveQueryExecSuite.scala

+        assert(plan.isInstanceOf[V2TableWriteExec])
+        val childPlan = plan.asInstanceOf[V2TableWriteExec].child
+        assert(childPlan.isInstanceOf[FakeRowAdaptor])
+        assert(childPlan.asInstanceOf[FakeRowAdaptor].child.isInstanceOf[AdaptiveSparkPlanExec])


Refine the child plan check

jackylee-ch · 2025-01-07T11:23:46Z

noopWrite config has been removed and I've fixed the failed tests. PTAL @philo-he @JkSelf

JkSelf · 2025-01-08T01:52:26Z

...35/src/test/scala/org/apache/spark/sql/execution/datasources/GlutenNoopWriterRuleSuite.scala

+    var fakeRowAdaptor: Option[FakeRowAdaptor] = None
+
+    override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
+      fakeRowAdaptor = qe.executedPlan.collectFirst { case f: FakeRowAdaptor => f }


@jackylee-ch FakeRowAdaptor is used in spark 32 and 33. Why we need to add this check in spark 35 test folder?

The GlutenNoopWriterRule would add a FakeRowAdaptor after v2 write command while writing to noop source. This PR would let GlutenNoopWriterRule work for all Spark versions.

backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxRuleApi.scala

JkSelf

LGTM. Thanks for your work.

github-actions · 2025-01-08T08:35:13Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-08T08:56:23Z

Run Gluten Clickhouse CI on x86

jackylee-ch · 2025-01-09T01:11:26Z

Any more question about this pr? @philo-he

github-actions · 2025-01-09T02:04:31Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2025-01-09T12:28:55Z

...it/src/main/scala/org/apache/spark/sql/execution/datasources/noop/GlutenNoopWriterRule.scala

+ * ColumnarToRow operation for NoopWrite. Since NoopWrite does not actually perform any data
+ * operations, it can accept input data in either row-based or columnar format.
+ */
+case class GlutenNoopWriterRule(session: SparkSession) extends Rule[SparkPlan] {


Such rule could be placed in this folder.

We cannot move to that folder as the NoopWrite can only be accessed under org.apache.spark.sql.execution.datasources.noop

zhztheplayer · 2025-01-09T12:33:50Z

...it/src/main/scala/org/apache/spark/sql/execution/datasources/GlutenWriterColumnarRules.scala

  }

-  case class NativeWritePostRule(session: SparkSession) extends Rule[SparkPlan] {
+  private[datasources] def injectFakeRowAdaptor(command: SparkPlan, child: SparkPlan): SparkPlan = {


Is this API only called by GlutenNoopWriterRule after the change? Could move to the rule file if so.

This API is also needed in NativeWritePostRule

baibaichen · 2025-01-09T12:48:55Z

@jackylee-ch would you pelase writing some comments for your PR? thanks!

github-actions bot added CORE works for Gluten Core VELOX CLICKHOUSE labels Jan 4, 2025

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 8199ce6 to 39761ad Compare January 4, 2025 08:53

jackylee-ch marked this pull request as draft January 4, 2025 18:01

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 39761ad to 99ded60 Compare January 4, 2025 18:23

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 99ded60 to 21a5a58 Compare January 5, 2025 01:48

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 21a5a58 to 5beecab Compare January 5, 2025 02:29

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 5beecab to 6774a4e Compare January 5, 2025 07:05

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 6774a4e to 96c0985 Compare January 5, 2025 10:17

[CORE] Refact columnar noop write rule

13119c4

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 1cdf913 to 13119c4 Compare January 5, 2025 14:01

Merge remote-tracking branch 'upstream/main' into refact_columnar_noo…

5ac983e

…p_write_rule

jackylee-ch commented Jan 6, 2025

View reviewed changes

jackylee-ch marked this pull request as ready for review January 6, 2025 03:50

philo-he reviewed Jan 7, 2025

View reviewed changes

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 25903b1 to 805a115 Compare January 7, 2025 06:07

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 805a115 to aec0545 Compare January 7, 2025 06:32

remove noopWrite config and fix test failed

b09a2d9

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from aec0545 to b09a2d9 Compare January 7, 2025 09:00

jackylee-ch commented Jan 7, 2025

View reviewed changes

JkSelf reviewed Jan 8, 2025

View reviewed changes

backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxRuleApi.scala Show resolved Hide resolved

JkSelf approved these changes Jan 8, 2025

View reviewed changes

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from 4c3153b to fd48d5b Compare January 8, 2025 08:55

add comments

887e94b

jackylee-ch force-pushed the refact_columnar_noop_write_rule branch from fd48d5b to 887e94b Compare January 9, 2025 02:04

philo-he approved these changes Jan 9, 2025

View reviewed changes

philo-he changed the title ~~[CORE] Refact columnar noop write rule~~ [CORE] Refactor columnar noop write rule Jan 9, 2025

jackylee-ch merged commit d101cb8 into apache:main Jan 9, 2025

zhztheplayer reviewed Jan 9, 2025

View reviewed changes

Conversation

jackylee-ch commented Jan 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 4, 2025

Uh oh!

github-actions bot commented Jan 4, 2025

Uh oh!

github-actions bot commented Jan 4, 2025

Uh oh!

github-actions bot commented Jan 4, 2025

Uh oh!

github-actions bot commented Jan 5, 2025

Uh oh!

github-actions bot commented Jan 5, 2025

Uh oh!

github-actions bot commented Jan 5, 2025

Uh oh!

github-actions bot commented Jan 5, 2025

Uh oh!

github-actions bot commented Jan 5, 2025

Uh oh!

github-actions bot commented Jan 5, 2025

Uh oh!

github-actions bot commented Jan 5, 2025

Uh oh!

jackylee-ch commented Jan 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackylee-ch commented Jan 6, 2025

Uh oh!

philo-he left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackylee-ch commented Jan 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackylee-ch Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JkSelf left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 8, 2025

Uh oh!

github-actions bot commented Jan 8, 2025

Uh oh!

jackylee-ch commented Jan 9, 2025

Uh oh!

github-actions bot commented Jan 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackylee-ch commented Jan 4, 2025 •

edited

Loading

jackylee-ch Jan 8, 2025 •

edited

Loading