Skip to content

[VL] Fix input_file_name results in empty string#6517

Closed
zml1206 wants to merge 1 commit intoapache:mainfrom
zml1206:fix_input_file_name_empty_string
Closed

[VL] Fix input_file_name results in empty string#6517
zml1206 wants to merge 1 commit intoapache:mainfrom
zml1206:fix_input_file_name_empty_string

Conversation

@zml1206
Copy link
Copy Markdown
Contributor

@zml1206 zml1206 commented Jul 19, 2024

What changes were proposed in this pull request?

The Spark implementation of input_file_name uses a thread local to stash the file name and retrieve it from the function.

If there is a transformer node between project input_file_name and scan, the result of input_file_name is an empty string.
For example, read delta lake table need union checkpoint parquet file and json file, then order by input_file_name to get parquet data files, it will get wrong parquet file list.
So we should push down input_file_name to transformer scan or add fallback project before fallback scan

How was this patch tested?

UT

@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

2 similar comments
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Jul 19, 2024

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer changed the title [VL] Fix input_file_name results empty string [VL] Fix input_file_name results in empty string Jul 22, 2024
@zhztheplayer
Copy link
Copy Markdown
Member

cc @gaoyangxiaozhu thanks

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Jul 24, 2024

cc @gaoyangxiaozhu Can you help take a look if you have time? Thank you.

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Jul 31, 2024

@JkSelf Can you help take a look if you have time? Thank you.

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Aug 6, 2024
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 6, 2024

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer self-requested a review August 28, 2024 07:46
@zhztheplayer zhztheplayer force-pushed the fix_input_file_name_empty_string branch from 838b5d4 to 52bdaf1 Compare September 4, 2024 08:19
@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 4, 2024

Run Gluten Clickhouse CI

} else {
b.copy(output = genNewOutput(b.output).asInstanceOf[Seq[AttributeReference]])
}
case b: BatchScanExecTransformer =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there real case that we see a BatchScanExecTransformer in OffloadProject? Since OffloadOthers is executed after OffloadProject.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is the previous PR code, should it be removed in this PR?

@liuneng1994
Copy link
Copy Markdown
Contributor

I encountered the same problem. In delta, input_file_name and monotonically_increasing_id are used at the same time. Monotonically_increasing_id is a state function, which is not easy to support natively. The existing logic will lose the fallback tag of the child of input_file_name, resulting in incorrect fallback.

example plan

BroadcastHashJoin [l_orderkey#363L], [l_orderkey#906L], Inner, BuildRight, false
:- Project [l_orderkey#363L]
:  +- Filter isnotnull(l_orderkey#363L)
:     +- Filter UDF()
:        +- Scan ExistingRDD mergeMaterializedSource[l_orderkey#363L,l_partkey#364L,l_suppkey#365L,l_linenumber#366L,l_quantity#367,l_extendedprice#368,l_discount#369,l_tax#370,l_returnflag#903,l_linestatus#372,l_shipdate#373,l_commitdate#374,l_receiptdate#375,l_shipinstruct#376,l_shipmode#377,l_comment#378]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [plan_id=1263]
   +- Filter isnotnull(l_orderkey#906L)
      +- Project [l_orderkey#906L, _row_id_#1183L, input_file_name() AS _file_name_#1201]
         +- Project [l_orderkey#906L, monotonically_increasing_id() AS _row_id_#1183L]
            +- FileScan mergetree [l_orderkey#906L] Batched: true, DataFilters: [], Format: MergeTree, Location: TahoeBatchFileIndex(1 paths)[file:/home/admin1/github/gazelle-jni/backends-clickhouse/target/scal..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint>

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Sep 5, 2024

@zhztheplayer Thank you for review. This pr can be reviewed later. I am reconstructing this logic.

  1. Add a new before offload rule to push down the input file to before scan, and add an additional project. The output is the output+inputfile of the scan.
  2. Normal offload project, the inputfile project before scan will not be offloaded.
  3. Add a new rule after offload. If scan offloads, push the inputfile to scan and delete the project.
    In this way, the overall logic is clearer, the offload logic will not be invaded, and RAS can also be used directly.
    cc @liuneng1994

@zhztheplayer
Copy link
Copy Markdown
Member

@zml1206 OK, so let's mark the PR as draft before it's ready?

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Sep 5, 2024

new PR #7124

@zml1206 zml1206 closed this Sep 5, 2024
@zml1206 zml1206 deleted the fix_input_file_name_empty_string branch December 9, 2025 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants