Skip to content

[GLUTEN-8580][CORE][Part-2] Don't validate project generated by PushDownInputFileExpression#8585

Merged
zhztheplayer merged 3 commits intoapache:mainfrom
zml1206:8580-2
Jan 23, 2025
Merged

[GLUTEN-8580][CORE][Part-2] Don't validate project generated by PushDownInputFileExpression#8585
zhztheplayer merged 3 commits intoapache:mainfrom
zml1206:8580-2

Conversation

@zml1206
Copy link
Copy Markdown
Contributor

@zml1206 zml1206 commented Jan 21, 2025

What changes were proposed in this pull request?

(Fixes: #8580)

How was this patch tested?

@github-actions github-actions bot added the CORE works for Gluten Core label Jan 21, 2025
@github-actions
Copy link
Copy Markdown

#8580

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zml1206 zml1206 changed the title [GLUTEN-8580][CORE][Part-1] Don't validate project generated by PushDownInputFileExpression [GLUTEN-8580][CORE][Part-2] Don't validate project generated by PushDownInputFileExpression Jan 21, 2025
@zml1206 zml1206 requested a review from zhztheplayer January 22, 2025 01:17
@zhztheplayer
Copy link
Copy Markdown
Member

zhztheplayer commented Jan 22, 2025

I'd help attach a query optimization example by the feature to help one better understand how #7124 works (it helped me on revisiting the code):

1. Input plan:

CollectLimit 100
+- Project [input_file_name() AS input_file_name()#208, a#195L]
   +- Union
      :- Project [a#195L]
      :  +- BatchScan json file:/tmp/spark-5de024cd-776a-4b52-bddc-d592d63abaf1[a#195L] JsonScan DataFilters: [], Format: json, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-5de024cd-776a-4b52-bddc-d592d63abaf1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint> RuntimeFilters: []
      +- Project [l_orderkey#76L AS a#207L]
         +- BatchScan parquet file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-data-parquet-velox/lineitem[l_orderkey#76L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<l_orderkey:bigint> RuntimeFilters: []

2. Plan after applying the pre-offload rule:

Project [input_file_name#169 AS input_file_name()#164, a#151L]
+- Union
   :- Project [a#151L, input_file_name#169]
   :  +- Project [a#151L, input_file_name() AS input_file_name#169]
   :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: [], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
   +- Project [l_orderkey#76L AS a#163L, input_file_name#170]
      +- Project [l_orderkey#76L, input_file_name() AS input_file_name#170]
         +- BatchScan[l_orderkey#76L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []

3. Plan after applying offload rules:

CollectLimit 100
+- ProjectExecTransformer [input_file_name#169 AS input_file_name()#164, a#151L]
   +- ColumnarUnion
      :- ProjectExecTransformer [a#151L, input_file_name#169]
      :  +- Project [a#151L, input_file_name() AS input_file_name#169]
      :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: [], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
      +- ProjectExecTransformer [l_orderkey#76L AS a#163L, input_file_name#170]
         +- Project [l_orderkey#76L, input_file_name() AS input_file_name#170]
            +- BatchScanExecTransformer[l_orderkey#76L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []

4. Plan after applying the post-offload rule:

CollectLimit 100
+- ProjectExecTransformer [input_file_name#169 AS input_file_name()#164, a#151L]
   +- ColumnarUnion
      :- ProjectExecTransformer [a#151L, input_file_name#169]
      :  +- Project [a#151L, input_file_name() AS input_file_name#169]
      :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: [], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
      +- ProjectExecTransformer [l_orderkey#76L AS a#163L, input_file_name#170]
         +- BatchScanExecTransformer[l_orderkey#76L, input_file_name#170] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []

Comment on lines +28 to +31
plan.foreachUp {
case p if FallbackTags.maybeOffloadable(p) => addFallbackTag(p)
case _ =>
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change is needed? Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those that have been tagged do not need to validate again , and #8580 issus1 can be resolved.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could see whether exclusive tag can help?

Copy link
Copy Markdown
Contributor Author

@zml1206 zml1206 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t quite understand. The exclusive tag is not related to validate. The warning log is output by validate of addFallbackTag.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exclusive tag could be added by PushDownInputFileExpression so validator doesn't add another tag.

Though I will be fine to both approaches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any other comments @zhztheplayer thank you

}

def addFallbackTag(plan: SparkPlan): SparkPlan = {
FallbackTags.add(plan, "fallback input file expression")
Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rephrase with The Project was added by rule PushDownInputFileExpression, it's not offload-able by design or so? Thanks.

Copy link
Copy Markdown
Contributor Author

@zml1206 zml1206 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, this project will eventually be removed or collapsed, so is it more appropriate to keep the original one?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I am fine with the message then.

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Jan 22, 2025

Another problem was discovered. SparkPlan without logical link will not display the fallback reason on the UI. This problem will be solved in next pr. See how to modify the GlutenFallbackReporter to see whether it is necessary to copy the fallback reason of the physical plan to the logical.

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Jan 22, 2025

Run Gluten Clickhouse CI on x86

@zhztheplayer zhztheplayer merged commit 2e27a52 into apache:main Jan 23, 2025
baibaichen pushed a commit to baibaichen/gluten that referenced this pull request Feb 1, 2025
@zml1206 zml1206 deleted the 8580-2 branch December 9, 2025 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Don't report 'Not supported to map spark function name to substrait function name: input_file_name(), class name: InputFileName.'

2 participants