-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17758][SQL] Last returns wrong result in case of empty partition #15348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #66329 has finished for PR 15348 at commit
|
| // Update - Merge - Eval (empty partition at the end) | ||
| val m2 = evaluator.merge(p2, p1, p0) | ||
| assert(evaluator.eval(m2) === InternalRow(-99)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a test to call update using a null input and then check the answer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
Thank you for fixing this! It is great to have unit tests to test individual aggregate functions. We can start to add more tests for other functions. |
|
Test build #66351 has finished for PR 15348 at commit
|
|
Test build #66349 has finished for PR 15348 at commit
|
|
Spark does not preserve the order of partitions, how can we make sure the result is Row3, rather than Row2? |
|
You cannot. Last is not deterministic outside of ordered window functions. The only thing is that you can have inconsistent results if the last physical partition happens to be an empty one. |
|
LGTM. Merging to master and branch 2.0. |
## What changes were proposed in this pull request? The result of the `Last` function can be wrong when the last partition processed is empty. It can return `null` instead of the expected value. For example, this can happen when we process partitions in the following order: ``` - Partition 1 [Row1, Row2] - Partition 2 [Row3] - Partition 3 [] ``` In this case the `Last` function will currently return a null, instead of the value of `Row3`. This PR fixes this by adding a `valueSet` flag to the `Last` function. ## How was this patch tested? We only used end to end tests for `DeclarativeAggregateFunction`s. I have added an evaluator for these functions so we can tests them in catalyst. I have added a `LastTestSuite` to test the `Last` aggregate function. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15348 from hvanhovell/SPARK-17758. (cherry picked from commit 5fd54b9) Signed-off-by: Yin Huai <yhuai@databricks.com>
## What changes were proposed in this pull request? The result of the `Last` function can be wrong when the last partition processed is empty. It can return `null` instead of the expected value. For example, this can happen when we process partitions in the following order: ``` - Partition 1 [Row1, Row2] - Partition 2 [Row3] - Partition 3 [] ``` In this case the `Last` function will currently return a null, instead of the value of `Row3`. This PR fixes this by adding a `valueSet` flag to the `Last` function. ## How was this patch tested? We only used end to end tests for `DeclarativeAggregateFunction`s. I have added an evaluator for these functions so we can tests them in catalyst. I have added a `LastTestSuite` to test the `Last` aggregate function. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#15348 from hvanhovell/SPARK-17758.
What changes were proposed in this pull request?
The result of the
Lastfunction can be wrong when the last partition processed is empty. It can returnnullinstead of the expected value. For example, this can happen when we process partitions in the following order:In this case the
Lastfunction will currently return a null, instead of the value ofRow3.This PR fixes this by adding a
valueSetflag to theLastfunction.How was this patch tested?
We only used end to end tests for
DeclarativeAggregateFunctions. I have added an evaluator for these functions so we can tests them in catalyst. I have added aLastTestSuiteto test theLastaggregate function.