[GLUTEN-8580][CORE][Part-2] Don't validate project generated by PushDownInputFileExpression by zml1206 · Pull Request #8585 · apache/gluten

zml1206 · 2025-01-21T10:43:41Z

What changes were proposed in this pull request?

(Fixes: #8580)

How was this patch tested?

github-actions · 2025-01-21T10:43:57Z

#8580

github-actions · 2025-01-21T10:44:13Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2025-01-22T04:08:52Z

I'd help attach a query optimization example by the feature to help one better understand how #7124 works (it helped me on revisiting the code):

1. Input plan:

CollectLimit 100
+- Project [input_file_name() AS input_file_name()#208, a#195L]
   +- Union
      :- Project [a#195L]
      :  +- BatchScan json file:/tmp/spark-5de024cd-776a-4b52-bddc-d592d63abaf1[a#195L] JsonScan DataFilters: [], Format: json, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-5de024cd-776a-4b52-bddc-d592d63abaf1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint> RuntimeFilters: []
      +- Project [l_orderkey#76L AS a#207L]
         +- BatchScan parquet file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-data-parquet-velox/lineitem[l_orderkey#76L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<l_orderkey:bigint> RuntimeFilters: []

2. Plan after applying the pre-offload rule:

Project [input_file_name#169 AS input_file_name()#164, a#151L]
+- Union
   :- Project [a#151L, input_file_name#169]
   :  +- Project [a#151L, input_file_name() AS input_file_name#169]
   :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: [], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
   +- Project [l_orderkey#76L AS a#163L, input_file_name#170]
      +- Project [l_orderkey#76L, input_file_name() AS input_file_name#170]
         +- BatchScan[l_orderkey#76L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []

3. Plan after applying offload rules:

CollectLimit 100
+- ProjectExecTransformer [input_file_name#169 AS input_file_name()#164, a#151L]
   +- ColumnarUnion
      :- ProjectExecTransformer [a#151L, input_file_name#169]
      :  +- Project [a#151L, input_file_name() AS input_file_name#169]
      :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: [], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
      +- ProjectExecTransformer [l_orderkey#76L AS a#163L, input_file_name#170]
         +- Project [l_orderkey#76L, input_file_name() AS input_file_name#170]
            +- BatchScanExecTransformer[l_orderkey#76L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []

4. Plan after applying the post-offload rule:

CollectLimit 100
+- ProjectExecTransformer [input_file_name#169 AS input_file_name()#164, a#151L]
   +- ColumnarUnion
      :- ProjectExecTransformer [a#151L, input_file_name#169]
      :  +- Project [a#151L, input_file_name() AS input_file_name#169]
      :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: [], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
      +- ProjectExecTransformer [l_orderkey#76L AS a#163L, input_file_name#170]
         +- BatchScanExecTransformer[l_orderkey#76L, input_file_name#170] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []

zhztheplayer · 2025-01-22T04:10:11Z

+    plan.foreachUp {
+      case p if FallbackTags.maybeOffloadable(p) => addFallbackTag(p)
+      case _ =>
+    }


Why this change is needed? Thanks.

Those that have been tagged do not need to validate again , and #8580 issus1 can be resolved.

I see. Could see whether exclusive tag can help?

I don’t quite understand. The exclusive tag is not related to validate. The warning log is output by validate of addFallbackTag.

Exclusive tag could be added by PushDownInputFileExpression so validator doesn't add another tag.

Though I will be fine to both approaches.

Do you have any other comments @zhztheplayer thank you

zhztheplayer · 2025-01-22T04:11:29Z

      }
+
+    def addFallbackTag(plan: SparkPlan): SparkPlan = {
+      FallbackTags.add(plan, "fallback input file expression")


Can we rephrase with The Project was added by rule PushDownInputFileExpression, it's not offload-able by design or so? Thanks.

I thought about it, this project will eventually be removed or collapsed, so is it more appropriate to keep the original one?

OK. I am fine with the message then.

github-actions · 2025-01-22T05:00:07Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-22T05:13:37Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-22T10:16:40Z

Run Gluten Clickhouse CI on x86

zml1206 · 2025-01-22T10:20:13Z

Another problem was discovered. SparkPlan without logical link will not display the fallback reason on the UI. This problem will be solved in next pr. See how to modify the GlutenFallbackReporter to see whether it is necessary to copy the fallback reason of the physical plan to the logical.

github-actions · 2025-01-22T10:31:35Z

Run Gluten Clickhouse CI on x86

zml1206 · 2025-01-22T17:16:33Z

Run Gluten Clickhouse CI on x86

…ownInputFileExpression (apache#8585)

Don't validate project generated by PushDownInputFileExpression

a419f66

github-actions bot added the CORE works for Gluten Core label Jan 21, 2025

zml1206 changed the title ~~[GLUTEN-8580][CORE][Part-1] Don't validate project generated by PushDownInputFileExpression~~ [GLUTEN-8580][CORE][Part-2] Don't validate project generated by PushDownInputFileExpression Jan 21, 2025

zml1206 requested a review from zhztheplayer January 22, 2025 01:17

zhztheplayer reviewed Jan 22, 2025

View reviewed changes

zml1206 force-pushed the 8580-2 branch from bbd3209 to a419f66 Compare January 22, 2025 05:13

copy will lose tag

14d02a9

fix style

c7d5849

zhztheplayer approved these changes Jan 23, 2025

View reviewed changes

zhztheplayer added the ready to merge label Jan 23, 2025

zhztheplayer merged commit 2e27a52 into apache:main Jan 23, 2025

baibaichen pushed a commit to baibaichen/gluten that referenced this pull request Feb 1, 2025

[GLUTEN-8580][CORE][Part-2] Don't validate project generated by PushD…

ad28655

…ownInputFileExpression (apache#8585)

zml1206 deleted the 8580-2 branch December 9, 2025 08:13

Conversation

zml1206 commented Jan 21, 2025

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 21, 2025

Uh oh!

github-actions bot commented Jan 21, 2025

Uh oh!

zhztheplayer commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhztheplayer Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

zml1206 Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

zml1206 Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

zml1206 Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zml1206 Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 22, 2025

Uh oh!

github-actions bot commented Jan 22, 2025

Uh oh!

github-actions bot commented Jan 22, 2025

Uh oh!

zml1206 commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 22, 2025

Uh oh!

zml1206 commented Jan 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhztheplayer commented Jan 22, 2025 •

edited

Loading

zml1206 Jan 22, 2025 •

edited

Loading

zhztheplayer Jan 22, 2025 •

edited

Loading

zml1206 Jan 22, 2025 •

edited

Loading

zml1206 commented Jan 22, 2025 •

edited

Loading