-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11730][ML] Add feature importances for GBTs. #11961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc explaining how the feature importances were computed was copied in the code ~6 times previously. I changed it to reduce the redundancy. Now that DTs have a public scala doc explaining how single tree importance is computed, I thought it would be good to leave a reference to that doc here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, but this should still note that the importances are averaged over trees and that this matches scikit-learn. I also like saying explicitly that the importances sum to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
cc @jkbradley whenever you get a chance, could you take a look? Thanks! |
|
Test build #54186 has finished for PR 11961 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, I'd like to note that feature importances are calculated in the same way as in scikit-learn and the Friedman paper, and that they sum to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
Test build #54334 has finished for PR 11961 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For RFs, I'd just say it follows the sklearn implementation (since the Friedman paper is for boosting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed both GBT and RF to cite "Elements of Statistical Learning" which proposes the same averaging method for both boosting and bagging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, SGTM
|
(one response to an outdated diff above: "I did mean to move the implementation of feature importances to TreeEnsembleModel, just for organization.") Thanks for the updates! |
|
Test build #54366 has finished for PR 11961 at commit
|
|
Test build #54368 has finished for PR 11961 at commit
|
|
LGTM |
What changes were proposed in this pull request?
Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a
featureImportancesattribute toGBTClassifierandGBTRegressorand adds tests for each.GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in this paper.
How was this patch tested?
Unit tests were added to
GBTClassifierSuiteandGBTRegressorSuiteto validate feature importances.