[SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting #15288

BryanCutler · 2016-09-28T23:06:22Z

What changes were proposed in this pull request?

In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError. This fix casts the Label column to a DoubleType to ensure there is no MatchError.

How was this patch tested?

Added unit tests to call evaluate with a dataset that has Label as other numeric types.

…than double

BryanCutler · 2016-09-28T23:10:06Z

@jkbradley I checked the other algorithms for similar issues. ml.LinearRegression and ml.evaluation.* already cast Labels to DoubleType, but GeneralizedLinearRegression did have errors also when making some of the calculations, so I fixed that here as well.

SparkQA · 2016-09-29T00:19:17Z

Test build #66072 has finished for PR 15288 at commit 5c50861.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-09-29T15:17:47Z

I'll take a look, thanks!

jkbradley

Thanks! Just a few tiny comments

jkbradley · 2016-09-29T17:36:18Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

+  test("evaluate with labels that are not doubles") {
+    // Evaluate a test set with Label that is a numeric type other than Double
+    val lr = new LogisticRegression()
+        .setMaxIter(10)


How about just 1 iteration to be a little faster? Also no need to set threshold.

indent 2 spaces, not 4

jkbradley · 2016-09-29T17:36:21Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+
+  test("evaluate with labels that are not doubles") {
+    // Evaulate with a dataset that contains Labels not as doubles to verify correct casting
+    val datasetWithWeight = Seq(


There's no need to have weights in this test, so it could be simplified a bit.

BryanCutler · 2016-09-29T18:09:04Z

Thanks for the review @jkbradley! I simplified the tests as you suggested

jkbradley · 2016-09-29T18:43:41Z

LGTM pending tests
Thanks @BryanCutler !

SparkQA · 2016-09-29T19:03:40Z

Test build #66113 has finished for PR 15288 at commit 640f84b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-09-29T23:31:05Z

Merging with master and branch-2.0

…ranch-2.0 [SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError. This fix casts the Label column to a DoubleType to ensure there is no MatchError. Added unit tests to call evaluate with a dataset that has Label as other numeric types. Author: Bryan Cutler <cutlerb@gmail.com> Closes #15288 from BryanCutler/binaryLOR-numericCheck-SPARK-17697. (cherry picked from commit 2f73956) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

jkbradley · 2016-09-29T23:57:28Z

OK done with cherry-pick to 2.0.

Also, I just noticed there are some other select() calls for labelCol in GeneralizedLinearRegression.scala without casts. Would you mind sending a follow-up PR for those? Thank you!

BryanCutler · 2016-09-30T17:48:47Z

Thanks @jkbradley. Are you referring to other select() calls like that used in devianceResiduals calc here? Those seem to be cast internally and don't cause an error, which is why I left them out here. Do you think there should still be a cast for extra precaution?

jkbradley · 2016-10-01T18:02:11Z

Oh, you're right! I didn't realize that the UDF would handle casting automatically. I think it's fine then. I'll mark the JIRA as resolved. Thanks!

BryanCutler added 2 commits September 28, 2016 15:15

added fix in LogisticRegression to evaluate with numeric types other …

91f5be8

…than double

fixed same issue in GeneralizedLinearRegression

5c50861

jkbradley reviewed Sep 29, 2016

View reviewed changes

simplified unit tests

640f84b

asfgit closed this in 2f73956 Sep 29, 2016

BryanCutler deleted the binaryLOR-numericCheck-SPARK-17697 branch December 2, 2016 01:01

[SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting #15288

[SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting #15288

Uh oh!

Conversation

BryanCutler commented Sep 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

BryanCutler commented Sep 28, 2016

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

jkbradley commented Sep 29, 2016

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

jkbradley Sep 29, 2016

Choose a reason for hiding this comment

Uh oh!

jkbradley Sep 29, 2016

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Sep 29, 2016

Uh oh!

jkbradley commented Sep 29, 2016

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

jkbradley commented Sep 29, 2016

Uh oh!

jkbradley commented Sep 29, 2016

Uh oh!

BryanCutler commented Sep 30, 2016

Uh oh!

jkbradley commented Oct 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants