Skip to content

Conversation

@BryanCutler
Copy link
Member

What changes were proposed in this pull request?

In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError. This fix casts the Label column to a DoubleType to ensure there is no MatchError.

How was this patch tested?

Added unit tests to call evaluate with a dataset that has Label as other numeric types.

@BryanCutler
Copy link
Member Author

@jkbradley I checked the other algorithms for similar issues. ml.LinearRegression and ml.evaluation.* already cast Labels to DoubleType, but GeneralizedLinearRegression did have errors also when making some of the calculations, so I fixed that here as well.

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66072 has finished for PR 15288 at commit 5c50861.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I'll take a look, thanks!

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few tiny comments

test("evaluate with labels that are not doubles") {
// Evaluate a test set with Label that is a numeric type other than Double
val lr = new LogisticRegression()
.setMaxIter(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just 1 iteration to be a little faster? Also no need to set threshold.

indent 2 spaces, not 4


test("evaluate with labels that are not doubles") {
// Evaulate with a dataset that contains Labels not as doubles to verify correct casting
val datasetWithWeight = Seq(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to have weights in this test, so it could be simplified a bit.

@BryanCutler
Copy link
Member Author

Thanks for the review @jkbradley! I simplified the tests as you suggested

@jkbradley
Copy link
Member

LGTM pending tests
Thanks @BryanCutler !

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66113 has finished for PR 15288 at commit 640f84b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master and branch-2.0

@asfgit asfgit closed this in 2f73956 Sep 29, 2016
asfgit pushed a commit that referenced this pull request Sep 29, 2016
…ranch-2.0

[SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting

In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError.  This fix casts the Label column to a DoubleType to ensure there is no MatchError.

Added unit tests to call evaluate with a dataset that has Label as other numeric types.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #15288 from BryanCutler/binaryLOR-numericCheck-SPARK-17697.

(cherry picked from commit 2f73956)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@jkbradley
Copy link
Member

OK done with cherry-pick to 2.0.

Also, I just noticed there are some other select() calls for labelCol in GeneralizedLinearRegression.scala without casts. Would you mind sending a follow-up PR for those? Thank you!

@BryanCutler
Copy link
Member Author

Thanks @jkbradley. Are you referring to other select() calls like that used in devianceResiduals calc here? Those seem to be cast internally and don't cause an error, which is why I left them out here. Do you think there should still be a cast for extra precaution?

@jkbradley
Copy link
Member

Oh, you're right! I didn't realize that the UDF would handle casting automatically. I think it's fine then. I'll mark the JIRA as resolved. Thanks!

@BryanCutler BryanCutler deleted the binaryLOR-numericCheck-SPARK-17697 branch December 2, 2016 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants