Skip to content

[VL] Fix function input_file_name() outputs empty string in certain query plan patterns#7124

Merged
zhztheplayer merged 4 commits intoapache:mainfrom
zml1206:input_file
Sep 6, 2024
Merged

[VL] Fix function input_file_name() outputs empty string in certain query plan patterns#7124
zhztheplayer merged 4 commits intoapache:mainfrom
zml1206:input_file

Conversation

@zml1206
Copy link
Copy Markdown
Contributor

@zml1206 zml1206 commented Sep 5, 2024

What changes were proposed in this pull request?

The Spark implementations of input_file_name/input_file_block_start/input_file_block_length uses a thread local to stash the file name and retrieve it from the function. If there is a transformer node between project input_file_function and scan, the result of input_file_name is an empty string. So we should push down input_file_function to transformer scan or add fallback project of input_file_function before fallback scan.

The processing logic is as follows:
1.Before offload, add new project before leaf node and push down input file expression to the new project
2.Normal offload project and scan
3.After offload, if scan be offloaded, push down input file expression into scan and remove project

How was this patch tested?

UT

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Sep 5, 2024
@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 5, 2024

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 5, 2024

Run Gluten Clickhouse CI

@zml1206 zml1206 marked this pull request as draft September 5, 2024 07:06
@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 5, 2024

Run Gluten Clickhouse CI

@zml1206 zml1206 marked this pull request as ready for review September 5, 2024 10:14
@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Sep 5, 2024

cc @zhztheplayer Thank you.

Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good to me. Thank you @zml1206.

cc @liuneng1994, if you are considering switching to this way, perhaps we need a PR for CH individually to add the rules to CH rule list after this PR gets landed.

ProjectExec(newProjectList, newChild)
}

private def rewriteExpr(expr: Expression, replacedExprs: Map[String, Alias]): Expression = {
Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be replacedExprs: mutable.Map[String, Alias]

}
}

def addMetadataCol(plan: SparkPlan, replacedExprs: Map[String, Alias]): SparkPlan = plan match {
Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be replacedExprs: mutable.Map[String, Alias]

def injectLegacy(injector: LegacyInjector): Unit = {
// Gluten columnar: Transform rules.
injector.injectTransform(_ => RemoveTransitions)
injector.injectTransform(_ => PushDownInputFileExpressionBeforeLeaf)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do a rename for the two rules. Since they seem to always be used together as a rule pair.

Suggested change
injector.injectTransform(_ => PushDownInputFileExpressionBeforeLeaf)
injector.injectTransform(_ => PushDownInputFileExpression.PreOffload)

injector.injectTransform(_ => TransformPreOverrides())
injector.injectTransform(_ => RemoveNativeWriteFilesSortAndProject())
injector.injectTransform(c => RewriteTransformer.apply(c.session))
injector.injectTransform(_ => PushDownInputFileExpressionToScan)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly,

Suggested change
injector.injectTransform(_ => PushDownInputFileExpressionToScan)
injector.injectTransform(_ => PushDownInputFileExpression.PostOffload)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 6, 2024

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 6, 2024

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer changed the title [VL] Fix input_file_name results in empty string [VL] Fix function input_file_name() outputs empty string in certain query plan patterns Sep 6, 2024
@zhztheplayer zhztheplayer merged commit 8078f24 into apache:main Sep 6, 2024
dcoliversun pushed a commit to dcoliversun/gluten that referenced this pull request Sep 11, 2024
zml1206 added a commit to zml1206/gluten that referenced this pull request Sep 19, 2024
zml1206 added a commit to zml1206/gluten that referenced this pull request Sep 20, 2024
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
@zml1206 zml1206 deleted the input_file branch December 9, 2025 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants