[SPARK-16804][SQL] Correlated subqueries containing non-deterministic operations return incorrect results #14411

nsyca · 2016-07-29T21:56:47Z

What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase by returning an error message when the LIMIT is found in the path from the parent table to the correlated predicate in the subquery.

How was this patch tested?

./dev/run-tests
a new unit test on the problematic pattern.

…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.

gatorsmile · 2016-07-29T21:58:49Z

cc @hvanhovell

gatorsmile · 2016-07-29T22:01:24Z

@nsyca Could you update the PR description? Thanks!

gatorsmile · 2016-07-29T22:02:51Z

We also need more test cases to prove it work as expected.

nsyca · 2016-07-29T22:30:48Z

I include two examples of "good" case in the JIRA to show that this fix only blocks cases where Spark will produce incorrect results. I need to find a place to host those "good" cases. Don't think AnalysisErrorSuite.scala is the right place.

gatorsmile · 2016-07-29T22:36:54Z

I think the positive cases can move here.
sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

nsyca · 2016-07-29T22:40:41Z

@gatorsmile: thanks. I will add them in SubquerySuite.

nsyca · 2016-07-29T22:43:37Z

Two good cases, which return the same result set, with and without this proposed fix:

sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2) and exists (select 1 from t2 LIMIT 1)").show

The above query will return both rows from T1.

sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show

This one above will return 1 row but which row will return is non-deterministic depending on what the first row from T2 will return from the innermost subquery.

gatorsmile · 2016-07-29T22:56:51Z

It sounds great to me!

hvanhovell · 2016-07-31T22:22:50Z

Ok, this looks pretty good. One overall comment: We are basically blacklisting operators here (which is fine IMO), should we check if there any other operators we should care about? If there are we might need to generalize the blacklisting (instead of working case by case)

SparkQA · 2016-07-31T23:51:39Z

Test build #3199 has finished for PR 14411 at commit edca333.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nsyca · 2016-08-01T13:41:04Z

@hvanhovell,

Thank you for your comment. There are quite a few patterns being blacklisted already, such as correlation under set operators (UNION, EXCEPT, INTERSECT), correlation outside of WHERE/HAVING context, correlation in the right table of a LEFT [OUTER] JOIN (and the left table of a RIGHT [OUTER] JOIN]). I am working on discovering more issues in this area but it looks like a bigger project to me. I have a general idea that the rewrite of correlation subquery to join should not happen in the Analysis phase. We should build a Logical plan to represent the subquery and perform the rewrite at the Optimizer phase instead.

I am new to the Spark code and this is my first PR. So I'd like to make it a small, self-contained project to gain my confidence in working with the code.

gatorsmile · 2016-08-01T16:01:38Z

retest this please

hvanhovell · 2016-08-01T16:27:47Z

@nsyca We do not rewrite the subquery into a join during analysis. We rewrite subqueries into joins during optimization. We do two things during analysis:

We check if the subquery expression is valid. In order to do this we need to check if the query resolves (given the outer query), and that no outer references are used in (the children of) nodes for which the joining behavior is ill defined (UNION for instance).
We also rewrite IN/EXISTS/Scalar subquery expressions into a PredicateSubquery. We do this by extracting correlated predicates and by rewriting the intermediate tree. One could argue that this could also be done during optimization, but this was needed to get correlated predicates with aggregate functions (referencing the outer query) working (see the example below). For this we needed to push the complete outer condition into the Aggregate below the Having clause. Perhaps there is a simpler way of doing this though.

select b.key, min(b.value)
from src b
group by b.key
having exists ( select a.key
                from src a
                where a.value > 'val_9' and a.value = min(b.value)
                )

I think we should also limit the use of Sample, which also filters non-deterministically and might give us very wrong results as well.

SparkQA · 2016-08-01T18:25:41Z

Test build #3200 has finished for PR 14411 at commit 64184fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nsyca · 2016-08-01T23:45:00Z

@hvanhovell,

First, my apologies for delaying the replies. I am travelling this week, only getting spontaneous connections. Thank you for your explanation of the implementation and the reason behind the choice of the implementation. It is very helpful for a beginner like me.

My bad! What I meant in my previous comment on rewriting of subqueries to join is actually the moving of the positions of the correlated predicates from their original positions to outside of the scopes of subqueries, specifically, the call to the function pullOutCorrelatedPredicates() -- I hope I got it right this time. I see this as one of the root causes of many problems. Bear with me, I don't have a good solution as I am still getting myself familiar with the code. Here is an example of the problems, in my opinion. With the rewrite, we cannot distinct between the EXISTS form and IN form of the original SQL.

select * from t1 where exists (select 1 from t2 where t1.c1=t2.c2)
-and-
select * from t1 where t1.c1 in (select t2.c2 from t2)

are represented after Analysis phase. This does not have issue because they are semantically equivalent. However, when we add the NOT in

select * from t1 where not exists (select 1 from t2 where t1.c1=t2.c2)
-and-
select * from t1 where t1.c1 not in (select t2.c2 from t2)

are NOT semantically equivalent when T2.C2 can produce NULL values.

Lastly, your comment on the operator SAMPLE seems right. I will give it shot on adding it to this PR.

Thanks again for your patience.

hvanhovell · 2016-08-02T09:04:29Z

@nsyca No problem. We actually support NOT IN queries. We set the the PredicateSubquery.nullAware flag to true if we encounter an IN subquery expression. NOT IN is planned using a NULL aware anti-join in the optimizer.

I have written a blogpost/notebook on the current implementation and state of subqueries in Spark 2.0 (including known issues): SQL Subqueries in Apache Spark 2.0

nsyca · 2016-08-02T20:45:37Z

@hvanhovell , thanks for sharing the blog. I will read thru. It's nice to see the implementation of NOT IN this way. I have an idea to do it differently but let's move this to another place.

On the SAMPLE issue you raised, I think we should not flag an error. Here is what I tested:

Seq((1,1), (2,2)).toDF("c1","c2").createOrReplaceTempView("t1")
Seq((1,1), (2,2)).toDF("c3","c4").createOrReplaceTempView("t2")

scala> sql("select * from t1 where exists (select 1 from t2 tablesample(10 percent) s where c3=c1)").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'Filter exists#29
   :  +- 'SubqueryAlias exists#29
   :     +- 'Project [unresolvedalias(1, None)]
   :        +- 'Filter ('c3 = 'c1)
   :           +- 'Sample 0.0, 0.1, false, 159
   :              +- 'UnresolvedRelation `t2`, s
   +- 'UnresolvedRelation `t1`

From the parser, the correlated predicate in the Filter operation is after the sampling operation. We should be able to treat the semantic of the sampling as an one-time execution and being reused for every input from the outer table. Using the analogy I used for LIMIT as described in the JIRA SPARK=16804, the SAMPLE operation is not on the correlation path and therefore the move of the correlated predicate above the scope of the subquery does not change the semantic of the query.

Your thoughts, please!

nsyca · 2016-08-05T09:53:08Z

@hvanhovell,

Have you had a chance to review my last update? Are there anything I should add/change in this PR?

hvanhovell · 2016-08-05T10:43:01Z

@nsyca I think we do need to prevent sampling from being used. I have the following example:

range(0, 10).createOrReplaceTempView("tbl_a")
range(0, 10).select($"id", $"id" % 10 as "grp_id").createOrReplaceTempView("tbl_b")
range(0, 10).select($"id", $"id" % 10 as "grp_id").createOrReplaceTempView("tbl_c")

val plan = sql("""
select *
from   tbl_a
where  not exists(
        select 1
        from   tbl_b
               join (select *
                     from   tbl_c
                     where  tbl_c.id = tbl_a.id) tablesample(0.01 percent) c
                on c.grp_id = tbl_b.grp_id)
""")

This results in the following analyzed plan:

Project [id#8L]
+- Filter NOT predicate-subquery#34 [(id#24L = id#8L)]
   :  +- SubqueryAlias predicate-subquery#34 [(id#24L = id#8L)]
   :     +- Project [1 AS 1#42, id#24L]
   :        +- Join Inner, (grp_id#29L = grp_id#19L)
   :           :- SubqueryAlias tbl_b
   :           :  +- Project [id#14L, (id#14L % cast(10 as bigint)) AS grp_id#19L]
   :           :     +- Range (0, 10, splits=8)
   :           +- SubqueryAlias c
   :              +- Sample 0.0, 1.0E-4, false, 968
   :                 +- Project [id#24L, grp_id#29L]
   :                    +- SubqueryAlias tbl_c
   :                       +- Project [id#24L, (id#24L % cast(10 as bigint)) AS grp_id#29L]
   :                          +- Range (0, 10, splits=8)
   +- SubqueryAlias tbl_a
      +- Range (0, 10, splits=8)

Clearly the predicate has been pulled out of a sampled relation. I don't think we want this.

I am looking forward to discuss your NOT IN approach. Could you open a JIRA for that?

nsyca · 2016-08-05T14:12:23Z

@hvanhovell ,

Yes. I agree that we need to block this case. I was under the impression that the tablesample clause is supported only when referenced to a base table, not a derived table. It's clearly not in Spark. I will add code to prevent it.

On the NOT IN topic, let me spend some time arranging my thought then I will open a JIRA.

thanks.

nsyca · 2016-08-05T17:50:08Z

@hvanhovell,

Code and test case for blocking TABLESAMPLE is in. Could you please review? Thanks.

hvanhovell · 2016-08-05T19:22:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

        case e: Expand =>
          failOnOuterReferenceInSubTree(e, "an EXPAND")
          e
+        case l @ LocalLimit(_, _) =>


Style: use l: LocalLimit instead of l @ LocalLimit(_, _) it makes it a bit more readable. Same for GlobalLimit and Sample.

nsyca · 2016-08-05T21:13:04Z

Done. Thanks for the comment. It looks more compact and does not break if those 3 operators change their arguments in the future.

hvanhovell · 2016-08-05T21:15:38Z

LGTM - pending Jenkins.

hvanhovell · 2016-08-07T17:00:04Z

@nsyca I have triggered a manual build. I'll merge as soon as it completes successfully.

SparkQA · 2016-08-07T17:01:15Z

Test build #3206 has finished for PR 14411 at commit ac43ab4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-07T22:03:40Z

Test build #3207 has finished for PR 14411 at commit 631d396.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-08-07T22:27:44Z

Before you submitting the PR, you can run this command to check the scala style:

dev/lint-scala

nsyca · 2016-08-08T02:32:41Z

Thanks, @gatorsmile. This time I ran dev/lint-scala. Hope it's my last attempt to get this work thru.

gatorsmile · 2016-08-08T05:06:01Z

retest this please

SparkQA · 2016-08-08T09:17:15Z

Test build #3208 has finished for PR 14411 at commit 7eb9b2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-08-08T10:13:24Z

Merging to master. Thanks for working on this!

Ping me as soon as you open a JIRA for null aware anti joins.

nsyca · 2016-08-08T16:31:45Z

@hvanhovell,

Thanks for getting the PR merged and sorry for causing a few hiccups before I got it right. It's my first PR.

I have opened a new JIRA, SPARK-16951, to track the NOT IN issue. Currently it still contains little information but I will start to fill in in the next few days.

Btw, would you mind assigning me (nsyca) the Assignee of SPARK-16804?

Will look forward to collaborating with you in future issues.

… PRs ## What changes were proposed in this pull request? This PR backports two subquery related PRs to branch-2.0: - #14411 - #15761 ## How was this patch tested? Added a tests to `SubquerySuite`. Author: Nattavut Sutyanyong <nsy.can@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15772 from hvanhovell/SPARK-17337-2.0.

nsyca added 2 commits July 29, 2016 17:43

New positive test cases

edca333

Fix unit test case failure

64184fd

nsyca changed the title ~~[SPARK-16804][SQL] Correlated subqueries containing LIMIT return incorrect results~~ [SPARK-16804][SQL] Correlated subqueries containing non-deterministic operations return incorrect results Aug 5, 2016

blocking TABLESAMPLE

29f82b0

hvanhovell reviewed Aug 5, 2016
View reviewed changes

Fixing code styling

ac43ab4

Correcting Scala test style

631d396

One (last) attempt to correct the Scala style tests

7eb9b2d

asfgit closed this in 06f5dc8 Aug 8, 2016

hvanhovell mentioned this pull request Nov 4, 2016

[SPARK-17337][SPARK-16804][SQL] Backport subquery related PRs [BRANCH-2.0] #15772

Closed

[SPARK-16804][SQL] Correlated subqueries containing non-deterministic operations return incorrect results #14411

[SPARK-16804][SQL] Correlated subqueries containing non-deterministic operations return incorrect results #14411

Uh oh!

Conversation

nsyca commented Jul 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Jul 29, 2016

Uh oh!

gatorsmile commented Jul 29, 2016

Uh oh!

gatorsmile commented Jul 29, 2016

Uh oh!

nsyca commented Jul 29, 2016

Uh oh!

gatorsmile commented Jul 29, 2016

Uh oh!

nsyca commented Jul 29, 2016

Uh oh!

nsyca commented Jul 29, 2016

Uh oh!

gatorsmile commented Jul 29, 2016

Uh oh!

hvanhovell commented Jul 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 31, 2016

Uh oh!

nsyca commented Aug 1, 2016

Uh oh!

gatorsmile commented Aug 1, 2016

Uh oh!

hvanhovell commented Aug 1, 2016

Uh oh!

SparkQA commented Aug 1, 2016

Uh oh!

nsyca commented Aug 1, 2016

Uh oh!

hvanhovell commented Aug 2, 2016

Uh oh!

nsyca commented Aug 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsyca commented Aug 5, 2016

Uh oh!

hvanhovell commented Aug 5, 2016

Uh oh!

nsyca commented Aug 5, 2016

Uh oh!

nsyca commented Aug 5, 2016

Uh oh!

hvanhovell Aug 5, 2016

Choose a reason for hiding this comment

Uh oh!

nsyca commented Aug 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Aug 5, 2016

Uh oh!

hvanhovell commented Aug 7, 2016

Uh oh!

SparkQA commented Aug 7, 2016

Uh oh!

SparkQA commented Aug 7, 2016

Uh oh!

gatorsmile commented Aug 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsyca commented Aug 8, 2016

Uh oh!

gatorsmile commented Aug 8, 2016

Uh oh!

SparkQA commented Aug 8, 2016

Uh oh!

hvanhovell commented Aug 8, 2016

Uh oh!

nsyca commented Aug 8, 2016

Uh oh!

nsyca commented Jul 29, 2016 •

edited

Loading

hvanhovell commented Jul 31, 2016 •

edited

Loading

nsyca commented Aug 2, 2016 •

edited

Loading

nsyca commented Aug 5, 2016 •

edited

Loading

gatorsmile commented Aug 7, 2016 •

edited

Loading