Skip to content

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

The result of the Last function can be wrong when the last partition processed is empty. It can return null instead of the expected value. For example, this can happen when we process partitions in the following order:

- Partition 1 [Row1, Row2]
- Partition 2 [Row3]
- Partition 3 []

In this case the Last function will currently return a null, instead of the value of Row3.

This PR fixes this by adding a valueSet flag to the Last function.

How was this patch tested?

We only used end to end tests for DeclarativeAggregateFunctions. I have added an evaluator for these functions so we can tests them in catalyst. I have added a LastTestSuite to test the Last aggregate function.

@SparkQA
Copy link

SparkQA commented Oct 4, 2016

Test build #66329 has finished for PR 15348 at commit 893fff5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DeclarativeAggregateEvaluator(function: DeclarativeAggregate, input: Seq[Attribute])
    • class LastTestSuite extends SparkFunSuite

// Update - Merge - Eval (empty partition at the end)
val m2 = evaluator.merge(p2, p1, p0)
assert(evaluator.eval(m2) === InternalRow(-99))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a test to call update using a null input and then check the answer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@yhuai
Copy link
Contributor

yhuai commented Oct 5, 2016

Thank you for fixing this! It is great to have unit tests to test individual aggregate functions. We can start to add more tests for other functions.

@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66351 has finished for PR 15348 at commit 8b442de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66349 has finished for PR 15348 at commit 5ae49ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Oct 5, 2016

Spark does not preserve the order of partitions, how can we make sure the result is Row3, rather than Row2?

@hvanhovell
Copy link
Contributor Author

You cannot. Last is not deterministic outside of ordered window functions. The only thing is that you can have inconsistent results if the last physical partition happens to be an empty one.

@yhuai
Copy link
Contributor

yhuai commented Oct 5, 2016

LGTM. Merging to master and branch 2.0.

asfgit pushed a commit that referenced this pull request Oct 5, 2016
## What changes were proposed in this pull request?
The result of the `Last` function can be wrong when the last partition processed is empty. It can return `null` instead of the expected value. For example, this can happen when we process partitions in the following order:
```
- Partition 1 [Row1, Row2]
- Partition 2 [Row3]
- Partition 3 []
```
In this case the `Last` function will currently return a null, instead of the value of `Row3`.

This PR fixes this by adding a `valueSet` flag to the `Last` function.

## How was this patch tested?
We only used end to end tests for `DeclarativeAggregateFunction`s. I have added an evaluator for these functions so we can tests them in catalyst. I have added a `LastTestSuite` to test the `Last` aggregate function.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15348 from hvanhovell/SPARK-17758.

(cherry picked from commit 5fd54b9)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in 5fd54b9 Oct 5, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
The result of the `Last` function can be wrong when the last partition processed is empty. It can return `null` instead of the expected value. For example, this can happen when we process partitions in the following order:
```
- Partition 1 [Row1, Row2]
- Partition 2 [Row3]
- Partition 3 []
```
In this case the `Last` function will currently return a null, instead of the value of `Row3`.

This PR fixes this by adding a `valueSet` flag to the `Last` function.

## How was this patch tested?
We only used end to end tests for `DeclarativeAggregateFunction`s. I have added an evaluator for these functions so we can tests them in catalyst. I have added a `LastTestSuite` to test the `Last` aggregate function.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes apache#15348 from hvanhovell/SPARK-17758.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants