-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16804][SQL] Correlated subqueries containing non-deterministic operations return incorrect results #14411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.
…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.
|
cc @hvanhovell |
|
@nsyca Could you update the PR description? Thanks! |
|
We also need more test cases to prove it work as expected. |
|
I include two examples of "good" case in the JIRA to show that this fix only blocks cases where Spark will produce incorrect results. I need to find a place to host those "good" cases. Don't think AnalysisErrorSuite.scala is the right place. |
|
I think the positive cases can move here. |
|
@gatorsmile: thanks. I will add them in SubquerySuite. |
|
Two good cases, which return the same result set, with and without this proposed fix: sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2) and exists (select 1 from t2 LIMIT 1)").show The above query will return both rows from T1. sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show This one above will return 1 row but which row will return is non-deterministic depending on what the first row from T2 will return from the innermost subquery. |
|
It sounds great to me! |
|
Ok, this looks pretty good. One overall comment: We are basically blacklisting operators here (which is fine IMO), should we check if there any other operators we should care about? If there are we might need to generalize the blacklisting (instead of working case by case) |
|
Test build #3199 has finished for PR 14411 at commit
|
|
Thank you for your comment. There are quite a few patterns being blacklisted already, such as correlation under set operators (UNION, EXCEPT, INTERSECT), correlation outside of WHERE/HAVING context, correlation in the right table of a LEFT [OUTER] JOIN (and the left table of a RIGHT [OUTER] JOIN]). I am working on discovering more issues in this area but it looks like a bigger project to me. I have a general idea that the rewrite of correlation subquery to join should not happen in the Analysis phase. We should build a Logical plan to represent the subquery and perform the rewrite at the Optimizer phase instead. I am new to the Spark code and this is my first PR. So I'd like to make it a small, self-contained project to gain my confidence in working with the code. |
|
retest this please |
|
@nsyca We do not rewrite the subquery into a join during analysis. We rewrite subqueries into joins during optimization. We do two things during analysis:
select b.key, min(b.value)
from src b
group by b.key
having exists ( select a.key
from src a
where a.value > 'val_9' and a.value = min(b.value)
)I think we should also limit the use of |
|
Test build #3200 has finished for PR 14411 at commit
|
|
First, my apologies for delaying the replies. I am travelling this week, only getting spontaneous connections. Thank you for your explanation of the implementation and the reason behind the choice of the implementation. It is very helpful for a beginner like me. My bad! What I meant in my previous comment on rewriting of subqueries to join is actually the moving of the positions of the correlated predicates from their original positions to outside of the scopes of subqueries, specifically, the call to the function pullOutCorrelatedPredicates() -- I hope I got it right this time. I see this as one of the root causes of many problems. Bear with me, I don't have a good solution as I am still getting myself familiar with the code. Here is an example of the problems, in my opinion. With the rewrite, we cannot distinct between the EXISTS form and IN form of the original SQL. select * from t1 where exists (select 1 from t2 where t1.c1=t2.c2) are represented after Analysis phase. This does not have issue because they are semantically equivalent. However, when we add the NOT in select * from t1 where not exists (select 1 from t2 where t1.c1=t2.c2) are NOT semantically equivalent when T2.C2 can produce NULL values. Lastly, your comment on the operator SAMPLE seems right. I will give it shot on adding it to this PR. Thanks again for your patience. |
|
@nsyca No problem. We actually support I have written a blogpost/notebook on the current implementation and state of subqueries in Spark 2.0 (including known issues): SQL Subqueries in Apache Spark 2.0 |
|
@hvanhovell , thanks for sharing the blog. I will read thru. It's nice to see the implementation of NOT IN this way. I have an idea to do it differently but let's move this to another place. On the SAMPLE issue you raised, I think we should not flag an error. Here is what I tested: From the parser, the correlated predicate in the Filter operation is after the sampling operation. We should be able to treat the semantic of the sampling as an one-time execution and being reused for every input from the outer table. Using the analogy I used for LIMIT as described in the JIRA SPARK=16804, the SAMPLE operation is not on the correlation path and therefore the move of the correlated predicate above the scope of the subquery does not change the semantic of the query. Your thoughts, please! |
|
Have you had a chance to review my last update? Are there anything I should add/change in this PR? |
|
@nsyca I think we do need to prevent sampling from being used. I have the following example: range(0, 10).createOrReplaceTempView("tbl_a")
range(0, 10).select($"id", $"id" % 10 as "grp_id").createOrReplaceTempView("tbl_b")
range(0, 10).select($"id", $"id" % 10 as "grp_id").createOrReplaceTempView("tbl_c")
val plan = sql("""
select *
from tbl_a
where not exists(
select 1
from tbl_b
join (select *
from tbl_c
where tbl_c.id = tbl_a.id) tablesample(0.01 percent) c
on c.grp_id = tbl_b.grp_id)
""")This results in the following analyzed plan: Clearly the predicate has been pulled out of a sampled relation. I don't think we want this. I am looking forward to discuss your |
|
Yes. I agree that we need to block this case. I was under the impression that the tablesample clause is supported only when referenced to a base table, not a derived table. It's clearly not in Spark. I will add code to prevent it. On the thanks. |
|
Code and test case for blocking |
| case e: Expand => | ||
| failOnOuterReferenceInSubTree(e, "an EXPAND") | ||
| e | ||
| case l @ LocalLimit(_, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: use l: LocalLimit instead of l @ LocalLimit(_, _) it makes it a bit more readable. Same for GlobalLimit and Sample.
|
Done. Thanks for the comment. It looks more compact and does not break if those 3 operators change their arguments in the future. |
|
LGTM - pending Jenkins. |
|
@nsyca I have triggered a manual build. I'll merge as soon as it completes successfully. |
|
Test build #3206 has finished for PR 14411 at commit
|
|
Test build #3207 has finished for PR 14411 at commit
|
|
Before you submitting the PR, you can run this command to check the scala style: |
|
Thanks, @gatorsmile. This time I ran |
|
retest this please |
|
Test build #3208 has finished for PR 14411 at commit
|
|
Merging to master. Thanks for working on this! Ping me as soon as you open a JIRA for null aware anti joins. |
|
Thanks for getting the PR merged and sorry for causing a few hiccups before I got it right. It's my first PR. I have opened a new JIRA, SPARK-16951, to track the NOT IN issue. Currently it still contains little information but I will start to fill in in the next few days. Btw, would you mind assigning me (nsyca) the Assignee of SPARK-16804? Will look forward to collaborating with you in future issues. |
… PRs ## What changes were proposed in this pull request? This PR backports two subquery related PRs to branch-2.0: - #14411 - #15761 ## How was this patch tested? Added a tests to `SubquerySuite`. Author: Nattavut Sutyanyong <nsy.can@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15772 from hvanhovell/SPARK-17337-2.0.
What changes were proposed in this pull request?
This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase by returning an error message when the LIMIT is found in the path from the parent table to the correlated predicate in the subquery.
How was this patch tested?
./dev/run-tests
a new unit test on the problematic pattern.