[SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF stop iteration wrapping from driver to executor #21538

e-dorigatti · 2018-06-12T09:47:23Z

SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the StopIteration bug in the worker.

The root of the problem is that when an user-supplied function raises a StopIteration, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch StopIterations exceptions and re-raise them as RuntimeErrors, so that the execution fails and the error is reported to the user. This is done using the fail_on_stopiteration wrapper, in different ways depending on where the function is used:

In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.

@HyukjinKwon

Make sure that `StopIteration`s raised in users' code do not silently interrupt processing by spark, but are raised as exceptions to the users. The users' functions are wrapped in `safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as `RuntimeError`s Unit tests, making sure that the exceptions are indeed raised. I am not sure how to check whether a `Py4JJavaError` contains my exception, so I simply looked for the exception message in the java exception's `toString`. Can you propose a better way? This is my original work, licensed in the same way as spark Author: e-dorigatti <emilio.dorigatti@gmail.com> Closes apache#21383 from e-dorigatti/fix_spark_23754. (cherry picked from commit 0ebb0c0)

… driver to executor SPARK-23754 was fixed in apache#21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used: - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself. - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack. Same tests, plus tests for pandas UDFs Author: edorigatti <emilio.dorigatti@gmail.com> Closes apache#21467 from e-dorigatti/fix_udf_hack.

viirya · 2018-06-12T09:56:17Z

@e-dorigatti Can you add [BACKPORT-2.3] in the PR title? Thanks.

HyukjinKwon · 2018-06-12T10:22:21Z

add to whitelist

SparkQA · 2018-06-12T10:23:01Z

Test build #91701 has finished for PR 21538 at commit 217e730.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

e-dorigatti · 2018-06-12T10:25:48Z

Seems like it skipped the pandas tests, for both python2.7 and pypy

Will skip Pandas related features against Python executable  ...

HyukjinKwon · 2018-06-12T10:30:43Z

Yea, it's unfortunate .. we should fix and set up the Jenkins env too.

SparkQA · 2018-06-12T10:59:13Z

Test build #91704 has finished for PR 21538 at commit 217e730.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-12T16:28:25Z

python/pyspark/worker.py


+    # make sure StopIteration's raised in the user code are not ignored
+    # when they are processed in a for loop, raise them as RuntimeError's instead
+    row_func = fail_on_stopiteration(row_func)


@e-dorigatti, I think it's fine to name it func as fixed in master. Let's reduce the diff so that other backports make less conflicts in the future.

SparkQA · 2018-06-12T18:05:09Z

Test build #91716 has finished for PR 21538 at commit 612781a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM

HyukjinKwon

LGTM too

HyukjinKwon · 2018-06-13T01:05:56Z

Merged to branch-2.3.

… wrapping from driver to executor SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker. The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used: - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself. - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack. HyukjinKwon Author: edorigatti <emilio.dorigatti@gmail.com> Author: e-dorigatti <emilio.dorigatti@gmail.com> Closes #21538 from e-dorigatti/branch-2.3.

HyukjinKwon · 2018-06-13T01:17:57Z

@e-dorigatti, this got merged into branch-2.3. Likewise, this also should be manually closed. Thanks for working on this.

e-dorigatti · 2018-06-13T07:54:00Z

@HyukjinKwon thank you so much for your patience :)

e-dorigatti added 3 commits May 30, 2018 16:50

Merge remote-tracking branch 'upstream/branch-2.3' into branch-2.3

e7db468

e-dorigatti changed the title ~~[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor~~ [SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF stop iteration wrapping from driver to executor Jun 12, 2018

HyukjinKwon reviewed Jun 12, 2018

View reviewed changes

renamed row_func to func

612781a

BryanCutler approved these changes Jun 12, 2018

View reviewed changes

HyukjinKwon approved these changes Jun 13, 2018

View reviewed changes

e-dorigatti closed this Jun 13, 2018

[SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF stop iteration wrapping from driver to executor #21538

[SPARK-23754][PYTHON][FOLLOWUP][BACKPORT-2.3] Move UDF stop iteration wrapping from driver to executor #21538

Uh oh!

Conversation

e-dorigatti commented Jun 12, 2018

Uh oh!

viirya commented Jun 12, 2018

Uh oh!

HyukjinKwon commented Jun 12, 2018

Uh oh!

SparkQA commented Jun 12, 2018

Uh oh!

e-dorigatti commented Jun 12, 2018

Uh oh!

HyukjinKwon commented Jun 12, 2018

Uh oh!

SparkQA commented Jun 12, 2018

Uh oh!

HyukjinKwon Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 12, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 13, 2018

Uh oh!

HyukjinKwon commented Jun 13, 2018

Uh oh!

e-dorigatti commented Jun 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon Jun 12, 2018 •

edited

Loading