[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` #32245

harupy · 2021-04-20T02:33:51Z

What changes were proposed in this pull request?

Fixes incorrect return type for rawPredictionUDF in OneVsRestModel.

Why are the changes needed?

Bugfix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

python/pyspark/ml/classification.py

harupy · 2021-04-20T03:12:45Z

python/pyspark/ml/classification.py

                    predArray.append(x)
                return Vectors.dense(predArray)

-            rawPredictionUDF = udf(func)


Should I add a test here to ensure that the rawPrediction column is no longer string

spark/python/pyspark/ml/tests/test_algorithms.py

Lines 108 to 117 in 0494dc9

def test_output_columns(self):

df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8)),

(1.0, Vectors.sparse(2, [], [])),

(2.0, Vectors.dense(0.5, 0.5))],

["label", "features"])

lr = LogisticRegression(maxIter=5, regParam=0.01)

ovr = OneVsRest(classifier=lr, parallelism=1)

model = ovr.fit(df)

output = model.transform(df)

self.assertEqual(output.columns, ["label", "features", "rawPrediction", "prediction"])

Yeah, I think we should better add a test if possible.

Got it, added a test

@HyukjinKwon
why only transformed_df.head() trigger this error ?
does it indicate bugs in pyspark-sql udf ?

Seems like pred.show() triggers an exception too? what does it return in other methods?

HyukjinKwon · 2021-04-20T03:14:39Z

ok to test

HyukjinKwon · 2021-04-20T03:14:44Z

add to whitelist

HyukjinKwon · 2021-04-20T03:14:51Z

cc @WeichenXu123 FYI

SparkQA · 2021-04-20T03:44:52Z

Test build #137665 has finished for PR 32245 at commit 3f75ab2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-20T04:00:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42193/

SparkQA · 2021-04-20T04:00:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42193/

SparkQA · 2021-04-20T04:02:06Z

Test build #137666 has finished for PR 32245 at commit 5e05b50.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-20T05:01:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42194/

SparkQA · 2021-04-20T05:01:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42194/

WeichenXu123 · 2021-04-20T05:38:49Z

CC @zhengruifeng

SparkQA · 2021-04-20T06:24:58Z

Test build #137668 has finished for PR 32245 at commit 3c2ac95.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-20T06:56:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42196/

SparkQA · 2021-04-20T06:56:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42196/

WeichenXu123

LGTM

python/pyspark/ml/tests/test_algorithms.py

SparkQA · 2021-04-21T02:20:31Z

Test build #137708 has finished for PR 32245 at commit b6fabb3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-21T02:50:33Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42236/

SparkQA · 2021-04-21T04:25:28Z

Test build #137713 has finished for PR 32245 at commit ed26d2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-21T04:49:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42241/

SparkQA · 2021-04-21T04:54:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42241/

WeichenXu123 · 2021-04-21T07:33:30Z

LGTM

HyukjinKwon · 2021-04-21T07:34:42Z

Looks good. @harupy, would you mind filling the PR description per the template?

HyukjinKwon · 2021-04-21T07:43:09Z

@viirya, are you preparing Spark 2.4 RC now? This is supposed to be in Spark 2.4 too but this isn't a regression so it doesn't block. It's just a good to have so if you're preparing, it should be fine to don't backport.

viirya · 2021-04-21T07:48:27Z

@viirya, are you preparing Spark 2.4 RC now? This is supposed to be in Spark 2.4 too but this isn't a regression so it doesn't block. It's just a good to have so if you're preparing, it should be fine to don't backport.

#32256 was just merged, so I have not started new RC yet. I can wait for this.

HyukjinKwon · 2021-04-21T07:55:43Z

BTW, the tests passed at https://github.com/harupy/spark/actions/runs/769366516. GitHub Actions didn't work properly for linking that run for some reasons ..

I will leave it to @WeichenXu123 then.

…nUDF` in `OneVsRestModel` ### What changes were proposed in this pull request? Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32245 from harupy/SPARK-35142. Authored-by: harupy <17039389+harupy@users.noreply.github.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit b6350f5) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2021-04-21T08:31:40Z

@harupy

Backport to branch-3.1 cause conflicts.
Could you create a PR against apache/spark branch-3.1 ?

++<<<<<<< HEAD
 +    def test_parallelism_doesnt_change_output(self):
++=======
+     def test_raw_prediction_column_is_of_vector_type(self):
+         # SPARK-35142: `OneVsRestModel` outputs raw prediction as a string column
+         df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8)),
+                                          (1.0, Vectors.sparse(2, [], [])),
+                                          (2.0, Vectors.dense(0.5, 0.5))],
+                                         ["label", "features"])
+         lr = LogisticRegression(maxIter=5, regParam=0.01)
+         ovr = OneVsRest(classifier=lr, parallelism=1)
+         model = ovr.fit(df)
+         row = model.transform(df).head()
+         self.assertIsInstance(row["rawPrediction"], DenseVector)
+ 
+     def test_parallelism_does_not_change_output(self):
++>>>>>>> b6350f5bb0... [SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel`

harupy · 2021-04-21T08:40:44Z

@WeichenXu123 Opened a PR: #32269

viirya · 2021-04-21T08:59:28Z

I don't see backport to 2.4. Do you plan to backport it? @WeichenXu123 @harupy?

harupy · 2021-04-21T09:04:23Z

@viirya Got it. I'll open another PR for 2.4.

Wait, does OneVsRestModel in 2.4 output the raw prediction column? Looks like it doesn't.

spark/python/pyspark/ml/classification.py

Lines 1964 to 2009 in 1630d64

    
           def _transform(self, dataset): 
        
               # determine the input columns: these need to be passed through 
        
               origCols = dataset.columns 
        
               # add an accumulator column to store predictions of all the models 
        
               accColName = "mbc$acc" + str(uuid.uuid4()) 
        
               initUDF = udf(lambda _: [], ArrayType(DoubleType())) 
        
               newDataset = dataset.withColumn(accColName, initUDF(dataset[origCols[0]])) 
        
               # persist if underlying dataset is not persistent. 
        
               handlePersistence = dataset.storageLevel == StorageLevel(False, False, False, False) 
        
               if handlePersistence: 
        
                   newDataset.persist(StorageLevel.MEMORY_AND_DISK) 
        
               # update the accumulator column with the result of prediction of models 
        
               aggregatedDataset = newDataset 
        
               for index, model in enumerate(self.models): 
        
                   rawPredictionCol = model._call_java("getRawPredictionCol") 
        
                   columns = origCols + [rawPredictionCol, accColName] 
        
                   # add temporary column to store intermediate scores and update 
        
                   tmpColName = "mbc$tmp" + str(uuid.uuid4()) 
        
                   updateUDF = udf( 
        
                       lambda predictions, prediction: predictions + [prediction.tolist()[1]], 
        
                       ArrayType(DoubleType())) 
        
                   transformedDataset = model.transform(aggregatedDataset).select(*columns) 
        
                   updatedDataset = transformedDataset.withColumn( 
        
                       tmpColName, 
        
                       updateUDF(transformedDataset[accColName], transformedDataset[rawPredictionCol])) 
        
                   newColumns = origCols + [tmpColName] 
        
                   # switch out the intermediate column with the accumulator column 
        
                   aggregatedDataset = updatedDataset\ 
        
                       .select(*newColumns).withColumnRenamed(tmpColName, accColName) 
        
               if handlePersistence: 
        
                   newDataset.unpersist() 
        
               # output the index of the classifier with highest confidence as prediction 
        
               labelUDF = udf( 
        
                   lambda predictions: float(max(enumerate(predictions), key=operator.itemgetter(1))[0]), 
        
                   DoubleType()) 
        
               # output label and label metadata as prediction 
        
               return aggregatedDataset.withColumn( 
        
                   self.getPredictionCol(), labelUDF(aggregatedDataset[accColName])).drop(accColName)

HyukjinKwon · 2021-04-21T11:44:43Z

Okay, looks like we can skip Spark 2.4.

viirya · 2021-04-21T15:45:11Z

Thanks for confirming. @harupy @HyukjinKwon

…ictionUDF` in `OneVsRestModel` ### What changes were proposed in this pull request? This PR backports #32245. Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32275 from harupy/backport-35142-3.0. Authored-by: harupy <17039389+harupy@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…nUDF` in `OneVsRestModel` ### What changes were proposed in this pull request? Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes apache#32245 from harupy/SPARK-35142. Authored-by: harupy <17039389+harupy@users.noreply.github.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit b6350f5) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

specify return type for rawPredictionUDF

3f75ab2

github-actions bot added CORE ML PYTHON labels Apr 20, 2021

harupy force-pushed the SPARK-35142 branch from bb641f5 to 3f75ab2 Compare April 20, 2021 02:51

harupy marked this pull request as ready for review April 20, 2021 03:06

harupy commented Apr 20, 2021

View reviewed changes

python/pyspark/ml/classification.py Show resolved Hide resolved

harupy commented Apr 20, 2021

View reviewed changes

harupy added 2 commits April 20, 2021 12:24

Add test

383c84f

Fix incorrect variable name

5e05b50

import VectorUDT

3c2ac95

WeichenXu123 approved these changes Apr 21, 2021

View reviewed changes

WeichenXu123 reviewed Apr 21, 2021

View reviewed changes

python/pyspark/ml/tests/test_algorithms.py Outdated Show resolved Hide resolved

harupy added 3 commits April 21, 2021 10:24

Create a separate test

98d241e

rename test

2f12765

add comment

b6fabb3

Fix test failure

ed26d2c

HyukjinKwon changed the title ~~[SPARK-35142][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel~~ [SPARK-35142][PYTHON][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel Apr 21, 2021

WeichenXu123 closed this in b6350f5 Apr 21, 2021

harupy mentioned this pull request Apr 21, 2021

[SPARK-35142][PYTHON][ML][3.1] Fix incorrect return type for rawPredictionUDF in OneVsRestModel #32269

Closed

harupy mentioned this pull request Apr 21, 2021

[SPARK-35142][PYTHON][ML][3.0] Fix incorrect return type for rawPredictionUDF in OneVsRestModel #32275

Closed

	def test_output_columns(self):
	df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8)),
	(1.0, Vectors.sparse(2, [], [])),
	(2.0, Vectors.dense(0.5, 0.5))],
	["label", "features"])
	lr = LogisticRegression(maxIter=5, regParam=0.01)
	ovr = OneVsRest(classifier=lr, parallelism=1)
	model = ovr.fit(df)
	output = model.transform(df)
	self.assertEqual(output.columns, ["label", "features", "rawPrediction", "prediction"])

[SPARK-35142][PYTHON][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel #32245

[SPARK-35142][PYTHON][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel #32245

Uh oh!

Conversation

harupy commented Apr 20, 2021 • edited by WeichenXu123 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

harupy Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 20, 2021

Choose a reason for hiding this comment

Uh oh!

harupy Apr 20, 2021

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Apr 20, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 21, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 20, 2021

Uh oh!

HyukjinKwon commented Apr 20, 2021

Uh oh!

HyukjinKwon commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

WeichenXu123 commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

WeichenXu123 commented Apr 21, 2021

Uh oh!

HyukjinKwon commented Apr 21, 2021

Uh oh!

HyukjinKwon commented Apr 21, 2021

Uh oh!

viirya commented Apr 21, 2021

Uh oh!

HyukjinKwon commented Apr 21, 2021

Uh oh!

WeichenXu123 commented Apr 21, 2021

Uh oh!

[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` #32245

[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` #32245

harupy commented Apr 20, 2021 •

edited by WeichenXu123

Loading

harupy Apr 20, 2021 •

edited

Loading

harupy commented Apr 21, 2021 •

edited

Loading