[SPARK-42331][SQL] Fix metadata col can not been resolved #39870

ulysses-you · 2023-02-03T06:37:08Z

What changes were proposed in this pull request?

This pr makes metadata output consistent during analysis by checking the output and reuse these if exists.

This pr also deduplicates the metadata output when merging into the output.

Why are the changes needed?

Let's say a process of resolving metadata:

Project (_metadata.file_size)
  File (_metadata.file_size > 0)
    Relation

ResolveReferences resolves _metadata.file_size for Filter
ResolveReferences can not resolve _metadata.file_size for Project, due to Filter is not resolved (data type does not match)
then AddMetadataColumns will merge metadata output into output

the next round of ResolveReferences can not resolve _metadata.file_size for Project since we filter not the confict names(output already contains the metadata output), see code:

    def isOutputColumn(col: MetadataColumn): Boolean = {
      outputNames.exists(name => resolve(col.name, name))
    }
    // filter out metadata columns that have names conflicting with output columns. if the table
    // has a column "line" and the table can produce a metadata column called "line", then the
    // data column should be returned, not the metadata column.
    hasMeta.metadataColumns.filterNot(isOutputColumn).toAttributes

And we also can not skip metadata column during filter confict name, otherwise the new generated metadata attribute will have different expr id with previous.

One failed example:

SELECT _metadata.row_index  FROM t WHERE _metadata.row_index >= 0;

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

add test for v1, v2 and streaming relation

ulysses-you · 2023-02-03T09:01:28Z

cc @Yaohua628 @cloud-fan

cloud-fan · 2023-02-03T12:59:03Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

I'm a bit confused, why do we care about metadata output anymore if it has been added to the output? The column in project will just be resolved as a normal attribute, right?

yes, but there is a conflict in AddMetadataColumns. We will add a new project(original output), see

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Lines 995 to 1002 in 6bb68b5

val newNode = addMetadataCol(node, metaCols.map(_.exprId).toSet)

// We should not change the output schema of the plan. We should project away the extra

// metadata columns if necessary.

if (newNode.sameOutput(node)) {

newNode

} else {

Project(node.output, newNode)

}

then with this case, it will be:

Project (_metadata.file_size) Project(c) File (_metadata.file_size > 0) Relation c, metadataoutput

I guess the original idea is we should always use children.metadataoutputs to do resolving to prevent exposing unnecessary metadata columns, isn't it ?

Fixed in #39895

It's acutally the different issue, my test cases still fail with your pr.

After addMetadataCol, the output of newNode is always different with original node(metadata output merges into output), so an extra project can not be avoid. (Your pr fixs a special case of hiddenOutputTag with NaturalJoin)
Then, the only way to resolve rest metadata columns is using metadata output. But before this pr, the metadata output is lost once call addMetadataCol.

I see, the problem here is we reference metadata col twice in two different nodes. I think the issue is we add the extra project too early. We should only do it once for the root node.

I'm not sure it's safe to add a project for root node. One issue is we do not know the output of root node because it's not resolved.

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

cloud-fan · 2023-02-07T12:45:11Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

to be safe, let's only do it under case hasMeta: SupportsMetadataColumns

also add some comments to explain when it happens

cloud-fan · 2023-02-07T13:03:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala

ditto, move it under case relation: HadoopFsRelation =>

cloud-fan · 2023-02-07T13:03:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala

ditto, and add comments as well

Maybe we can add a util method in object FileFormat to reduce code duplication

ulysses-you · 2023-02-07T14:12:12Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

-      // has a column "line" and the table can produce a metadata column called "line", then the
-      // data column should be returned, not the metadata column.
-      hasMeta.metadataColumns.filterNot(isOutputColumn).toAttributes
+      metadataOutputWithOutConflicts(hasMeta.metadataColumns.toAttributes)


One small difference is we call toAttributes before filter out conflicts. Seems fine.

ulysses-you · 2023-02-10T05:07:34Z

@cloud-fan any comments ?

cloud-fan · 2023-02-13T04:43:32Z

thanks, merging to master/3.4!

### What changes were proposed in this pull request? This pr makes metadata output consistent during analysis by checking the output and reuse these if exists. This pr also deduplicates the metadata output when merging into the output. ### Why are the changes needed? Let's say a process of resolving metadata: ``` Project (_metadata.file_size) File (_metadata.file_size > 0) Relation ``` 1. `ResolveReferences` resolves _metadata.file_size for `Filter` 2. `ResolveReferences` can not resolve _metadata.file_size for `Project`, due to Filter is not resolved (data type does not match) 3. then `AddMetadataColumns` will merge metadata output into output 4. the next round of `ResolveReferences` can not resolve _metadata.file_size for `Project` since we filter not the confict names(output already contains the metadata output), see code: ``` def isOutputColumn(col: MetadataColumn): Boolean = { outputNames.exists(name => resolve(col.name, name)) } // filter out metadata columns that have names conflicting with output columns. if the table // has a column "line" and the table can produce a metadata column called "line", then the // data column should be returned, not the metadata column. hasMeta.metadataColumns.filterNot(isOutputColumn).toAttributes ``` And we also can not skip metadata column during filter confict name, otherwise the new generated metadata attribute will have different expr id with previous. One failed example: ```scala SELECT _metadata.row_index FROM t WHERE _metadata.row_index >= 0; ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test for v1, v2 and streaming relation Closes #39870 from ulysses-you/SPARK-42331. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5705436) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

ulysses-you · 2023-02-13T11:22:38Z

thank you @cloud-fan !

### What changes were proposed in this pull request? This pr makes metadata output consistent during analysis by checking the output and reuse these if exists. This pr also deduplicates the metadata output when merging into the output. ### Why are the changes needed? Let's say a process of resolving metadata: ``` Project (_metadata.file_size) File (_metadata.file_size > 0) Relation ``` 1. `ResolveReferences` resolves _metadata.file_size for `Filter` 2. `ResolveReferences` can not resolve _metadata.file_size for `Project`, due to Filter is not resolved (data type does not match) 3. then `AddMetadataColumns` will merge metadata output into output 4. the next round of `ResolveReferences` can not resolve _metadata.file_size for `Project` since we filter not the confict names(output already contains the metadata output), see code: ``` def isOutputColumn(col: MetadataColumn): Boolean = { outputNames.exists(name => resolve(col.name, name)) } // filter out metadata columns that have names conflicting with output columns. if the table // has a column "line" and the table can produce a metadata column called "line", then the // data column should be returned, not the metadata column. hasMeta.metadataColumns.filterNot(isOutputColumn).toAttributes ``` And we also can not skip metadata column during filter confict name, otherwise the new generated metadata attribute will have different expr id with previous. One failed example: ```scala SELECT _metadata.row_index FROM t WHERE _metadata.row_index >= 0; ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test for v1, v2 and streaming relation Closes apache#39870 from ulysses-you/SPARK-42331. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5705436) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added SQL STRUCTURED STREAMING labels Feb 3, 2023

ulysses-you force-pushed the SPARK-42331 branch 3 times, most recently from 9fb9a46 to c5a7a37 Compare February 3, 2023 06:52

cloud-fan reviewed Feb 3, 2023

View reviewed changes

cloud-fan reviewed Feb 7, 2023

View reviewed changes

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala Outdated Show resolved Hide resolved

ulysses-you force-pushed the SPARK-42331 branch from c5a7a37 to e91a119 Compare February 7, 2023 12:40

cloud-fan reviewed Feb 7, 2023

View reviewed changes

ulysses-you force-pushed the SPARK-42331 branch from cff691d to 4e703d1 Compare February 7, 2023 14:09

Fix metadata col can not been resolved

5ba6565

ulysses-you force-pushed the SPARK-42331 branch from 4e703d1 to 5ba6565 Compare February 7, 2023 14:09

ulysses-you commented Feb 7, 2023

View reviewed changes

cloud-fan approved these changes Feb 13, 2023

View reviewed changes

cloud-fan closed this in 5705436 Feb 13, 2023

ulysses-you deleted the SPARK-42331 branch February 13, 2023 11:22

penghuo mentioned this pull request Feb 26, 2024

[FEATURE] Support partial indexing for skipping and covering index opensearch-project/opensearch-spark#89

Open

	val newNode = addMetadataCol(node, metaCols.map(_.exprId).toSet)
	// We should not change the output schema of the plan. We should project away the extra
	// metadata columns if necessary.
	if (newNode.sameOutput(node)) {
	newNode
	} else {
	Project(node.output, newNode)
	}

[SPARK-42331][SQL] Fix metadata col can not been resolved #39870

[SPARK-42331][SQL] Fix metadata col can not been resolved #39870

Uh oh!

Conversation

ulysses-you commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ulysses-you commented Feb 3, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Feb 10, 2023

Uh oh!

cloud-fan commented Feb 13, 2023

Uh oh!

ulysses-you commented Feb 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ulysses-you commented Feb 3, 2023 •

edited

Loading

ulysses-you Feb 7, 2023 •

edited

Loading

cloud-fan Feb 7, 2023 •

edited

Loading