-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42331][SQL] Fix metadata col can not been resolved #39870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9fb9a46 to
c5a7a37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused, why do we care about metadata output anymore if it has been added to the output? The column in project will just be resolved as a normal attribute, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but there is a conflict in AddMetadataColumns. We will add a new project(original output), see
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Lines 995 to 1002 in 6bb68b5
| val newNode = addMetadataCol(node, metaCols.map(_.exprId).toSet) | |
| // We should not change the output schema of the plan. We should project away the extra | |
| // metadata columns if necessary. | |
| if (newNode.sameOutput(node)) { | |
| newNode | |
| } else { | |
| Project(node.output, newNode) | |
| } |
then with this case, it will be:
Project (_metadata.file_size)
Project(c)
File (_metadata.file_size > 0)
Relation c, metadataoutputI guess the original idea is we should always use children.metadataoutputs to do resolving to prevent exposing unnecessary metadata columns, isn't it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #39895
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's acutally the different issue, my test cases still fail with your pr.
After addMetadataCol, the output of newNode is always different with original node(metadata output merges into output), so an extra project can not be avoid. (Your pr fixs a special case of hiddenOutputTag with NaturalJoin)
Then, the only way to resolve rest metadata columns is using metadata output. But before this pr, the metadata output is lost once call addMetadataCol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, the problem here is we reference metadata col twice in two different nodes. I think the issue is we add the extra project too early. We should only do it once for the root node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's safe to add a project for root node. One issue is we do not know the output of root node because it's not resolved.
...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
Outdated
Show resolved
Hide resolved
c5a7a37 to
e91a119
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be safe, let's only do it under case hasMeta: SupportsMetadataColumns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add some comments to explain when it happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, move it under case relation: HadoopFsRelation =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, and add comments as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a util method in object FileFormat to reduce code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
cff691d to
4e703d1
Compare
4e703d1 to
5ba6565
Compare
| // has a column "line" and the table can produce a metadata column called "line", then the | ||
| // data column should be returned, not the metadata column. | ||
| hasMeta.metadataColumns.filterNot(isOutputColumn).toAttributes | ||
| metadataOutputWithOutConflicts(hasMeta.metadataColumns.toAttributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small difference is we call toAttributes before filter out conflicts. Seems fine.
|
@cloud-fan any comments ? |
|
thanks, merging to master/3.4! |
### What changes were proposed in this pull request?
This pr makes metadata output consistent during analysis by checking the output and reuse these if exists.
This pr also deduplicates the metadata output when merging into the output.
### Why are the changes needed?
Let's say a process of resolving metadata:
```
Project (_metadata.file_size)
File (_metadata.file_size > 0)
Relation
```
1. `ResolveReferences` resolves _metadata.file_size for `Filter`
2. `ResolveReferences` can not resolve _metadata.file_size for `Project`, due to Filter is not resolved (data type does not match)
3. then `AddMetadataColumns` will merge metadata output into output
4. the next round of `ResolveReferences` can not resolve _metadata.file_size for `Project` since we filter not the confict names(output already contains the metadata output), see code:
```
def isOutputColumn(col: MetadataColumn): Boolean = {
outputNames.exists(name => resolve(col.name, name))
}
// filter out metadata columns that have names conflicting with output columns. if the table
// has a column "line" and the table can produce a metadata column called "line", then the
// data column should be returned, not the metadata column.
hasMeta.metadataColumns.filterNot(isOutputColumn).toAttributes
```
And we also can not skip metadata column during filter confict name, otherwise the new generated metadata attribute will have different expr id with previous.
One failed example:
```scala
SELECT _metadata.row_index FROM t WHERE _metadata.row_index >= 0;
```
### Does this PR introduce _any_ user-facing change?
yes, bug fix
### How was this patch tested?
add test for v1, v2 and streaming relation
Closes #39870 from ulysses-you/SPARK-42331.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 5705436)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
thank you @cloud-fan ! |
### What changes were proposed in this pull request?
This pr makes metadata output consistent during analysis by checking the output and reuse these if exists.
This pr also deduplicates the metadata output when merging into the output.
### Why are the changes needed?
Let's say a process of resolving metadata:
```
Project (_metadata.file_size)
File (_metadata.file_size > 0)
Relation
```
1. `ResolveReferences` resolves _metadata.file_size for `Filter`
2. `ResolveReferences` can not resolve _metadata.file_size for `Project`, due to Filter is not resolved (data type does not match)
3. then `AddMetadataColumns` will merge metadata output into output
4. the next round of `ResolveReferences` can not resolve _metadata.file_size for `Project` since we filter not the confict names(output already contains the metadata output), see code:
```
def isOutputColumn(col: MetadataColumn): Boolean = {
outputNames.exists(name => resolve(col.name, name))
}
// filter out metadata columns that have names conflicting with output columns. if the table
// has a column "line" and the table can produce a metadata column called "line", then the
// data column should be returned, not the metadata column.
hasMeta.metadataColumns.filterNot(isOutputColumn).toAttributes
```
And we also can not skip metadata column during filter confict name, otherwise the new generated metadata attribute will have different expr id with previous.
One failed example:
```scala
SELECT _metadata.row_index FROM t WHERE _metadata.row_index >= 0;
```
### Does this PR introduce _any_ user-facing change?
yes, bug fix
### How was this patch tested?
add test for v1, v2 and streaming relation
Closes apache#39870 from ulysses-you/SPARK-42331.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 5705436)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This pr makes metadata output consistent during analysis by checking the output and reuse these if exists.
This pr also deduplicates the metadata output when merging into the output.
Why are the changes needed?
Let's say a process of resolving metadata:
ResolveReferencesresolves _metadata.file_size forFilterResolveReferencescan not resolve _metadata.file_size forProject, due to Filter is not resolved (data type does not match)AddMetadataColumnswill merge metadata output into outputResolveReferencescan not resolve _metadata.file_size forProjectsince we filter not the confict names(output already contains the metadata output), see code:One failed example:
Does this PR introduce any user-facing change?
yes, bug fix
How was this patch tested?
add test for v1, v2 and streaming relation