[VL] Fix input_file_name results in empty string by zml1206 · Pull Request #6517 · apache/gluten

zml1206 · 2024-07-19T08:19:30Z

What changes were proposed in this pull request?

The Spark implementation of input_file_name uses a thread local to stash the file name and retrieve it from the function.

If there is a transformer node between project input_file_name and scan, the result of input_file_name is an empty string.
For example, read delta lake table need union checkpoint parquet file and json file, then order by input_file_name to get parquet data files, it will get wrong parquet file list.
So we should push down input_file_name to transformer scan or add fallback project before fallback scan

How was this patch tested?

UT

github-actions · 2024-07-19T08:19:46Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2024-07-19T08:20:01Z

Run Gluten Clickhouse CI

github-actions · 2024-07-19T09:31:02Z

Run Gluten Clickhouse CI

zml1206 · 2024-07-19T13:52:11Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-07-22T03:31:08Z

cc @gaoyangxiaozhu thanks

zml1206 · 2024-07-24T02:11:21Z

cc @gaoyangxiaozhu Can you help take a look if you have time? Thank you.

zml1206 · 2024-07-31T07:19:59Z

@JkSelf Can you help take a look if you have time? Thank you.

github-actions · 2024-08-06T01:34:12Z

Run Gluten Clickhouse CI

github-actions · 2024-09-04T08:20:31Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-09-04T08:06:28Z

+        } else {
+          b.copy(output = genNewOutput(b.output).asInstanceOf[Seq[AttributeReference]])
+        }
      case b: BatchScanExecTransformer =>


Is there real case that we see a BatchScanExecTransformer in OffloadProject? Since OffloadOthers is executed after OffloadProject.

I agree, this is the previous PR code, should it be removed in this PR?

liuneng1994 · 2024-09-04T09:35:17Z

I encountered the same problem. In delta, input_file_name and monotonically_increasing_id are used at the same time. Monotonically_increasing_id is a state function, which is not easy to support natively. The existing logic will lose the fallback tag of the child of input_file_name, resulting in incorrect fallback.

example plan

BroadcastHashJoin [l_orderkey#363L], [l_orderkey#906L], Inner, BuildRight, false
:- Project [l_orderkey#363L]
:  +- Filter isnotnull(l_orderkey#363L)
:     +- Filter UDF()
:        +- Scan ExistingRDD mergeMaterializedSource[l_orderkey#363L,l_partkey#364L,l_suppkey#365L,l_linenumber#366L,l_quantity#367,l_extendedprice#368,l_discount#369,l_tax#370,l_returnflag#903,l_linestatus#372,l_shipdate#373,l_commitdate#374,l_receiptdate#375,l_shipinstruct#376,l_shipmode#377,l_comment#378]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [plan_id=1263]
   +- Filter isnotnull(l_orderkey#906L)
      +- Project [l_orderkey#906L, _row_id_#1183L, input_file_name() AS _file_name_#1201]
         +- Project [l_orderkey#906L, monotonically_increasing_id() AS _row_id_#1183L]
            +- FileScan mergetree [l_orderkey#906L] Batched: true, DataFilters: [], Format: MergeTree, Location: TahoeBatchFileIndex(1 paths)[file:/home/admin1/github/gazelle-jni/backends-clickhouse/target/scal..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint>

zml1206 · 2024-09-05T03:11:17Z

@zhztheplayer Thank you for review. This pr can be reviewed later. I am reconstructing this logic.

Add a new before offload rule to push down the input file to before scan, and add an additional project. The output is the output+inputfile of the scan.
Normal offload project, the inputfile project before scan will not be offloaded.
Add a new rule after offload. If scan offloads, push the inputfile to scan and delete the project.
In this way, the overall logic is clearer, the offload logic will not be invaded, and RAS can also be used directly.
cc @liuneng1994

zhztheplayer · 2024-09-05T05:31:27Z

@zml1206 OK, so let's mark the PR as draft before it's ready?

zml1206 · 2024-09-05T05:39:00Z

new PR #7124

zhztheplayer changed the title ~~[VL] Fix input_file_name results empty string~~ [VL] Fix input_file_name results in empty string Jul 22, 2024

github-actions bot added CORE works for Gluten Core VELOX labels Aug 6, 2024

zhztheplayer self-requested a review August 28, 2024 07:46

[VL] Fix input_file_name results empty string

52bdaf1

zhztheplayer force-pushed the fix_input_file_name_empty_string branch from 838b5d4 to 52bdaf1 Compare September 4, 2024 08:19

zhztheplayer reviewed Sep 4, 2024

View reviewed changes

liuneng1994 mentioned this pull request Sep 5, 2024

[CH] Shuffle writer connects to CH pipeline #6723

Merged

zml1206 closed this Sep 5, 2024

zml1206 deleted the fix_input_file_name_empty_string branch December 9, 2025 08:13

Conversation

zml1206 commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Jul 19, 2024

Uh oh!

github-actions bot commented Jul 19, 2024

Uh oh!

github-actions bot commented Jul 19, 2024

Uh oh!

zml1206 commented Jul 19, 2024

Uh oh!

zhztheplayer commented Jul 22, 2024

Uh oh!

zml1206 commented Jul 24, 2024

Uh oh!

zml1206 commented Jul 31, 2024

Uh oh!

github-actions bot commented Aug 6, 2024

Uh oh!

github-actions bot commented Sep 4, 2024

Uh oh!

zhztheplayer Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

zml1206 Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

liuneng1994 commented Sep 4, 2024

Uh oh!

zml1206 commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhztheplayer commented Sep 5, 2024

Uh oh!

zml1206 commented Sep 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zml1206 commented Jul 19, 2024 •

edited

Loading

zml1206 commented Sep 5, 2024 •

edited

Loading