Skip to content

Conversation

@karenfeng
Copy link
Contributor

What changes were proposed in this pull request?

Sets references for NamedLambdaVariable and LambdaFunction.

Expression NamedLambdaVariable LambdaFunction
References before None All function references
References after self.toAttribute Function references minus arguments' references

In NestedColumnAliasing, this means that ExtractValue(ExtractValue(attr, lv: NamedLambdaVariable), ...) now references both attr and lv, rather than just attr. As a result, it will not be included in the nested column references.

Why are the changes needed?

Before, lambda key was referenced outside of lambda function.

Example 1

Before:

Project [transform(keys#0, lambdafunction(_extract_v1#0, lambda key#0, false)) AS a#0]
+- 'Join Cross
   :- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0]
   :  +- LocalRelation <empty>, [kvs#0]
   +- LocalRelation <empty>, [keys#0]

After:

Project [transform(keys#418, lambdafunction(kvs#417[lambda key#420].v1, lambda key#420, false)) AS a#419]
+- Join Cross
   :- LocalRelation <empty>, [kvs#417]
   +- LocalRelation <empty>, [keys#418]

Example 2

Before:

Project [transform(keys#0, lambdafunction(kvs#0[lambda key#0].v1, lambda key#0, false)) AS a#0]
+- GlobalLimit 5
  +- LocalLimit 5
    +- Project [keys#0, _extract_v1#0 AS _extract_v1#0]
      +- GlobalLimit 5
        +- LocalLimit 5
          +- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0, keys#0]
            +- LocalRelation <empty>, [kvs#0, keys#0]

After:

Project [transform(keys#428, lambdafunction(kvs#427[lambda key#430].v1, lambda key#430, false)) AS a#429]
+- GlobalLimit 5
  +- LocalLimit 5
    +- Project [keys#428, kvs#427]
      +- GlobalLimit 5
        +- LocalLimit 5
          +- LocalRelation <empty>, [kvs#427, keys#428]

Does this PR introduce any user-facing change?

No

How was this patch tested?

Scala unit tests for the examples above

Signed-off-by: Karen Feng <karen.feng@databricks.com>
@github-actions github-actions bot added the SQL label Jun 3, 2021
@SparkQA
Copy link

SparkQA commented Jun 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43822/

@SparkQA
Copy link

SparkQA commented Jun 3, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43822/

@HyukjinKwon HyukjinKwon changed the title [SPARK-35636] Lambda keys should not be referenced outside of the lambda function [SPARK-35636][SQL] Lambda keys should not be referenced outside of the lambda function Jun 4, 2021
@SparkQA
Copy link

SparkQA commented Jun 4, 2021

Test build #139299 has finished for PR 32773 at commit 9671af6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

also cc @viirya @maropu

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

HyukjinKwon pushed a commit that referenced this pull request Jun 4, 2021
…e lambda function

Sets `references` for `NamedLambdaVariable` and `LambdaFunction`.

| Expression  | NamedLambdaVariable | LambdaFunction |
| --- | --- | --- |
| References before | None | All function references |
| References after | self.toAttribute | Function references minus arguments' references |

In `NestedColumnAliasing`, this means that `ExtractValue(ExtractValue(attr, lv: NamedLambdaVariable), ...)` now references both `attr` and `lv`, rather than just `attr`. As a result, it will not be included in the nested column references.

Before, lambda key was referenced outside of lambda function.

Before:
```
Project [transform(keys#0, lambdafunction(_extract_v1#0, lambda key#0, false)) AS a#0]
+- 'Join Cross
   :- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0]
   :  +- LocalRelation <empty>, [kvs#0]
   +- LocalRelation <empty>, [keys#0]
```

After:
```
Project [transform(keys#418, lambdafunction(kvs#417[lambda key#420].v1, lambda key#420, false)) AS a#419]
+- Join Cross
   :- LocalRelation <empty>, [kvs#417]
   +- LocalRelation <empty>, [keys#418]
```

Before:
```
Project [transform(keys#0, lambdafunction(kvs#0[lambda key#0].v1, lambda key#0, false)) AS a#0]
+- GlobalLimit 5
  +- LocalLimit 5
    +- Project [keys#0, _extract_v1#0 AS _extract_v1#0]
      +- GlobalLimit 5
        +- LocalLimit 5
          +- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0, keys#0]
            +- LocalRelation <empty>, [kvs#0, keys#0]
```

After:
```
Project [transform(keys#428, lambdafunction(kvs#427[lambda key#430].v1, lambda key#430, false)) AS a#429]
+- GlobalLimit 5
  +- LocalLimit 5
    +- Project [keys#428, kvs#427]
      +- GlobalLimit 5
        +- LocalLimit 5
          +- LocalRelation <empty>, [kvs#427, keys#428]
```

No

Scala unit tests for the examples above

Closes #32773 from karenfeng/SPARK-35636.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 53a758b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.1.

@HyukjinKwon
Copy link
Member

Reverted in branch-3.1 due to broken tests.

viirya pushed a commit that referenced this pull request Jan 12, 2022
…nside a nested struct

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since #32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes #35170 from ulysses-you/SPARK-37855.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya pushed a commit that referenced this pull request Jan 12, 2022
…ray inside a nested struct

This is a backport of #35170 for branch-3.2.

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since #32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes #35175 from ulysses-you/SPARK-37855-branch-3.2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
dchvn pushed a commit to dchvn/spark that referenced this pull request Jan 19, 2022
…nside a nested struct

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since apache#32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes apache#35170 from ulysses-you/SPARK-37855.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Feb 22, 2022
…ray inside a nested struct

This is a backport of apache#35170 for branch-3.2.

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since apache#32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes apache#35175 from ulysses-you/SPARK-37855-branch-3.2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Mar 4, 2022
…ray inside a nested struct

This is a backport of apache#35170 for branch-3.2.

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since apache#32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes apache#35175 from ulysses-you/SPARK-37855-branch-3.2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…ray inside a nested struct

This is a backport of apache#35170 for branch-3.2.

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since apache#32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes apache#35175 from ulysses-you/SPARK-37855-branch-3.2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit a58b8a8)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants