[VL] Fix input_file_name results in empty string#6517
[VL] Fix input_file_name results in empty string#6517zml1206 wants to merge 1 commit intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/apache/incubator-gluten/issues Then could you also rename commit message and pull request title in the following format? See also: |
|
Run Gluten Clickhouse CI |
2 similar comments
|
Run Gluten Clickhouse CI |
|
Run Gluten Clickhouse CI |
|
cc @gaoyangxiaozhu thanks |
|
cc @gaoyangxiaozhu Can you help take a look if you have time? Thank you. |
|
@JkSelf Can you help take a look if you have time? Thank you. |
|
Run Gluten Clickhouse CI |
838b5d4 to
52bdaf1
Compare
|
Run Gluten Clickhouse CI |
| } else { | ||
| b.copy(output = genNewOutput(b.output).asInstanceOf[Seq[AttributeReference]]) | ||
| } | ||
| case b: BatchScanExecTransformer => |
There was a problem hiding this comment.
Is there real case that we see a BatchScanExecTransformer in OffloadProject? Since OffloadOthers is executed after OffloadProject.
There was a problem hiding this comment.
I agree, this is the previous PR code, should it be removed in this PR?
|
I encountered the same problem. In delta, input_file_name and monotonically_increasing_id are used at the same time. Monotonically_increasing_id is a state function, which is not easy to support natively. The existing logic will lose the fallback tag of the child of input_file_name, resulting in incorrect fallback. example plan |
|
@zhztheplayer Thank you for review. This pr can be reviewed later. I am reconstructing this logic.
|
|
@zml1206 OK, so let's mark the PR as draft before it's ready? |
|
new PR #7124 |
What changes were proposed in this pull request?
The Spark implementation of input_file_name uses a thread local to stash the file name and retrieve it from the function.
If there is a transformer node between project input_file_name and scan, the result of input_file_name is an empty string.
For example, read delta lake table need union checkpoint parquet file and json file, then order by
input_file_nameto get parquet data files, it will get wrong parquet file list.So we should push down input_file_name to transformer scan or add fallback project before fallback scan
How was this patch tested?
UT