-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22883] ML test for StructuredStreaming: spark.ml.feature, I-M #20964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| (10.0, 10.0), | ||
| (Double.NaN, 8.0), | ||
| (null, 8.0) | ||
| ).toDF("value", "expected_mean_value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the "value" column use java.lang.Double type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's nullable
| assert(vector1.equals(vector2), s"MaxAbsScaler ut error: $vector2 should be $vector1") | ||
| testTransformer[(Vector, Vector)](df, model, "expected", "scaled") { | ||
| case Row(vector1: Vector, vector2: Vector) => | ||
| assert(vector1.equals(vector2), s"MaxAbsScaler error: $vector2 should be $vector1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector1 === vector2
and the error message looks strange. ==> s"scaled value $vector2 do not equal expected value "$vector1" ?
| output.foreach { | ||
| case hashOutput: Vector => assert(hashOutput.size === 1) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have "expected" column" here to compare with ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary for testing that this works with structured streaming. (I can't see how streaming would mess up the correctness of the algorithm.)
| model.transform(df).select("expected", "scaled").collect() | ||
| .foreach { case Row(vector1: Vector, vector2: Vector) => | ||
| testTransformer[(Vector, Vector)](df, model, "expected", "scaled") { | ||
| case Row(vector1: Vector, vector2: Vector) => | ||
| assert(vector1.equals(vector2), "Transformed vector is different with expected.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector1 === vector2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, === is more standard (though both are fine)
| @@ -84,7 +84,7 @@ class NGramSuite extends MLTest with DefaultReadWriteTest { | |||
|
|
|||
| def testNGram(t: NGram, dataFrame: DataFrame): Unit = { | |||
| testTransformer[(Seq[String], Seq[String])](dataFrame, t, "nGrams", "wantedNGrams") { | |||
| case Row(actualNGrams : Seq[String], wantedNGrams: Seq[String]) => | |||
| case Row(actualNGrams : Seq[_], wantedNGrams: Seq[_]) => | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why change Seq[String] to Seq[_], is it for fixing some potential issue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String is not actually checked because of erasure, so IntelliJ complained with a style warning before this change.
|
Test build #88851 has finished for PR 20964 at commit
|
|
LGTM. 👍 |
|
Thanks! I'll rerun tests since they are stale and merge after they pass. |
|
Test build #4151 has finished for PR 20964 at commit
|
6dff94e to
592670c
Compare
|
I rebased off of master because of the merge warning in the last tests. |
|
Test build #89174 has finished for PR 20964 at commit
|
|
Merging with master |
## What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: * IDF * Imputer * Interaction * MaxAbsScaler * MinHashLSH * MinMaxScaler * NGram ## How was this patch tested? It is a bunch of tests! Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#20964 from jkbradley/SPARK-22883-part2.
This backports #20964 to branch-2.3. ## What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: * IDF * Imputer * Interaction * MaxAbsScaler * MinHashLSH * MinMaxScaler * NGram ## How was this patch tested? It is a bunch of tests! Author: Joseph K. Bradley <josephdatabricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #21042 from jkbradley/SPARK-22883-part2-2.3backport.
What changes were proposed in this pull request?
Adds structured streaming tests using testTransformer for these suites:
How was this patch tested?
It is a bunch of tests!