Skip to content

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented Mar 25, 2016

What changes were proposed in this pull request?

Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a featureImportances attribute to GBTClassifier and GBTRegressor and adds tests for each.

GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in this paper.

How was this patch tested?

Unit tests were added to GBTClassifierSuite and GBTRegressorSuite to validate feature importances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc explaining how the feature importances were computed was copied in the code ~6 times previously. I changed it to reduce the redundancy. Now that DTs have a public scala doc explaining how single tree importance is computed, I thought it would be good to leave a reference to that doc here instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, but this should still note that the importances are averaged over trees and that this matches scikit-learn. I also like saying explicitly that the importances sum to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sethah
Copy link
Contributor Author

sethah commented Mar 25, 2016

cc @jkbradley whenever you get a chance, could you take a look? Thanks!

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54186 has finished for PR 11961 at commit 080a506.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, I'd like to note that feature importances are calculated in the same way as in scikit-learn and the Friedman paper, and that they sum to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SparkQA
Copy link

SparkQA commented Mar 28, 2016

Test build #54334 has finished for PR 11961 at commit 879c97f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For RFs, I'd just say it follows the sklearn implementation (since the Friedman paper is for boosting).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed both GBT and RF to cite "Elements of Statistical Learning" which proposes the same averaging method for both boosting and bagging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, SGTM

@jkbradley
Copy link
Member

(one response to an outdated diff above: "I did mean to move the implementation of feature importances to TreeEnsembleModel, just for organization.")

Thanks for the updates!

@SparkQA
Copy link

SparkQA commented Mar 28, 2016

Test build #54366 has finished for PR 11961 at commit 2d726eb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2016

Test build #54368 has finished for PR 11961 at commit 083b037.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants