-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark #12056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #54496 has finished for PR 12056 at commit
|
| This generalizes the idea of "Gini" importance to other losses, | ||
| following the explanation of Gini importance from "Random Forests" documentation | ||
| by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn. | ||
| Each feature's importance is the average of its importance across all trees in the ensemble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the doc changing? Is it incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some discussion on this here. Basically, since the random forest and GBT importance is just an average of the importances for single trees, we can just state that here and link to the doc for single trees, which explains how those are computed. Otherwise, we copy/paste the same explanation 6 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough - though the Scala doc should be updated likewise?
On Wed, 30 Mar 2016 at 17:21, Seth Hendrickson notifications@github.com
wrote:
In python/pyspark/ml/classification.py
#12056 (comment):@@ -500,16 +500,12 @@ def featureImportances(self):
"""
Estimate of the importance of each feature.
This generalizes the idea of "Gini" importance to other losses,following the explanation of Gini importance from "Random Forests" documentationby Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.Each feature's importance is the average of its importance across all trees in the ensembleThere was some discussion on this here
#11961. Basically, since the random
forest and GBT importance is just an average of the importances for single
trees, we can just state that here and link to the doc for single trees,
which explains how those are computed. Otherwise, we copy/paste the same
explanation 6 times.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12056/files/892d30123a192cd3796892c0f64a5cf2993e1f09#r57906908
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was update in the PR I linked to above, so everything should be in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I missed that.
On Wed, 30 Mar 2016 at 17:32, Seth Hendrickson notifications@github.com
wrote:
In python/pyspark/ml/classification.py
#12056 (comment):@@ -500,16 +500,12 @@ def featureImportances(self):
"""
Estimate of the importance of each feature.
This generalizes the idea of "Gini" importance to other losses,following the explanation of Gini importance from "Random Forests" documentationby Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.Each feature's importance is the average of its importance across all trees in the ensembleIt was update in the PR I linked to above, so everything should be in sync.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12056/files/892d30123a192cd3796892c0f64a5cf2993e1f09#r57909234
|
Test build #54548 has finished for PR 12056 at commit
|
|
LGTM |
What changes were proposed in this pull request?
Feature importances are exposed in the python API for GBTs.
Other changes:
How was this patch tested?
Python doc tests were updated to validate GBT feature importance.