-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3478] [PySpark] Profile the Python tasks #2351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2351 at commit
|
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
python/pyspark/accumulators.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be clearer to name this ProfilingStatsParam or PStatsParam?
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
|
@JoshRosen I had addressed your comment, also added docs for configs and tests. I realized that the profile result also can be showed interactively, by rdd.show_profile(), I had updated the PR description for this. |
|
QA tests have started for PR 2351 at commit
|
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
Conflicts: docs/configuration.md
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
Conflicts: python/pyspark/worker.py
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
Conflicts: python/pyspark/worker.py
|
QA tests have started for PR 2351 at commit
|
|
(I killed the test here so that I could re-run it with the newer commits). |
|
QA tests have started for PR 2351 at commit
|
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
|
Test FAILed. |
|
jenkins, retest this please |
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
|
Test PASSed. |
|
This looks good to me. Thanks! |
|
I noticed that we don't have any automated tests for from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.set("spark.python.profile", "true")
sc = SparkContext(appName="test", conf=conf)
count = sc.parallelize(range(10000)).count()
sc.show_profiles()This results in: Can we add a test for this, too? |
|
@JoshRosen sorry for this mistake, fixed. |
|
Test FAILed. |
|
QA tests have started for PR 2351 at commit
|
|
QA tests have finished for PR 2351 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still true? It looks like we now use a showed flag to detect whether they've been printed instead of clearing the profiles array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we clear _profile_stats when we perform manual dump_profiles() calls, but not when we call show_profiles(), so it seems like this is half-true (unless I've overlooked something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if showed is true, it will not be displayed again, but will be dumped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. If it's been manually dumped, the it won't be dumped again when exiting. If it's been manually dumped or displayed, then it won't be displayed when exiting.
This makes sense; sorry for the confusion.
|
Thanks for review this, your comments made it much better. |
|
Whoops, looks like this failed unit tests and caused a build-break. I'm going to revert it to un-break the build while we investigate. |
This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:
```
============================================================
Profile of RDD<id=3>
============================================================
5146507 function calls (5146487 primitive calls) in 71.094 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
5144576 68.331 0.000 68.331 0.000 statcounter.py:44(merge)
20 2.735 0.137 71.071 3.554 statcounter.py:33(__init__)
20 0.017 0.001 0.017 0.001 {cPickle.dumps}
1024 0.003 0.000 0.003 0.000 t.py:16(<lambda>)
20 0.001 0.000 0.001 0.000 {reduce}
21 0.001 0.000 0.001 0.000 {cPickle.loads}
20 0.001 0.000 0.001 0.000 copy_reg.py:95(_slotnames)
41 0.001 0.000 0.001 0.000 serializers.py:461(read_int)
40 0.001 0.000 0.002 0.000 serializers.py:179(_batched)
62 0.000 0.000 0.000 0.000 {method 'read' of 'file' objects}
20 0.000 0.000 71.072 3.554 rdd.py:863(<lambda>)
20 0.000 0.000 0.001 0.000 serializers.py:198(load_stream)
40/20 0.000 0.000 71.072 3.554 rdd.py:2093(pipeline_func)
41 0.000 0.000 0.002 0.000 serializers.py:130(load_stream)
40 0.000 0.000 71.072 1.777 rdd.py:304(func)
20 0.000 0.000 71.094 3.555 worker.py:82(process)
```
Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
by `sc.dump_profiles(path)`, such as
```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
284 function calls (276 primitive calls) in 0.001 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
4 0.000 0.000 0.000 0.000 serializers.py:198(load_stream)
4 0.000 0.000 0.000 0.000 {reduce}
12/4 0.000 0.000 0.001 0.000 rdd.py:2092(pipeline_func)
4 0.000 0.000 0.000 0.000 {cPickle.loads}
4 0.000 0.000 0.000 0.000 {cPickle.dumps}
104 0.000 0.000 0.000 0.000 rdd.py:852(<genexpr>)
8 0.000 0.000 0.000 0.000 serializers.py:461(read_int)
12 0.000 0.000 0.000 0.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by "spark.python.profile=true".
Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
This is bugfix of #2351 cc JoshRosen
Author: Davies Liu <davies.liu@gmail.com>
Closes #2556 from davies/profiler and squashes the following commits:
e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
858e74c [Davies Liu] compatitable with python 2.6
7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
2b0daf2 [Davies Liu] fix docs
7a56c24 [Davies Liu] bugfix
cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
09d02c3 [Davies Liu] Merge branch 'master' into profiler
c23865c [Davies Liu] Merge branch 'master' into profiler
15d6f18 [Davies Liu] add docs for two configs
dadee1a [Davies Liu] add docs string and clear profiles after show or dump
4f8309d [Davies Liu] address comment, add tests
0a5b6eb [Davies Liu] fix Python UDF
4b20494 [Davies Liu] add profile for python
This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:
Also, use can show profile result manually by
sc.show_profiles()or dump it into diskby
sc.dump_profiles(path), such asThe profiling is disabled by default, can be enabled by "spark.python.profile=true".
Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"