[SPARK-11730][ML] Add feature importances for GBTs. #11961

sethah · 2016-03-25T16:38:04Z

What changes were proposed in this pull request?

Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a featureImportances attribute to GBTClassifier and GBTRegressor and adds tests for each.

GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in this paper.

How was this patch tested?

Unit tests were added to GBTClassifierSuite and GBTRegressorSuite to validate feature importances.

sethah · 2016-03-25T16:50:21Z

mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

This doc explaining how the feature importances were computed was copied in the code ~6 times previously. I changed it to reduce the redundancy. Now that DTs have a public scala doc explaining how single tree importance is computed, I thought it would be good to leave a reference to that doc here instead.

SGTM, but this should still note that the importances are averaged over trees and that this matches scikit-learn. I also like saying explicitly that the importances sum to 1.

sethah · 2016-03-25T16:54:14Z

cc @jkbradley whenever you get a chance, could you take a look? Thanks!

SparkQA · 2016-03-25T17:20:35Z

Test build #54186 has finished for PR 11961 at commit 080a506.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-28T02:36:05Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

Here too, I'd like to note that feature importances are calculated in the same way as in scikit-learn and the Friedman paper, and that they sum to 1.

SparkQA · 2016-03-28T18:44:50Z

Test build #54334 has finished for PR 11961 at commit 879c97f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-28T20:32:34Z

mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

For RFs, I'd just say it follows the sklearn implementation (since the Friedman paper is for boosting).

I changed both GBT and RF to cite "Elements of Statistical Learning" which proposes the same averaging method for both boosting and bagging.

jkbradley · 2016-03-28T20:33:28Z

(one response to an outdated diff above: "I did mean to move the implementation of feature importances to TreeEnsembleModel, just for organization.")

Thanks for the updates!

SparkQA · 2016-03-28T21:54:06Z

Test build #54366 has finished for PR 11961 at commit 2d726eb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-28T22:46:52Z

Test build #54368 has finished for PR 11961 at commit 083b037.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-29T05:24:45Z

LGTM
Merging with master
Thanks!

sethah reviewed Mar 25, 2016
View reviewed changes

jkbradley reviewed Mar 28, 2016
View reviewed changes

sethah added 6 commits March 28, 2016 13:54

feature importance for gbt

2a43576

compressing docs and adding tests

ff47dab

style fixes

4c89571

doc typos

8ae86ec

addressing comments

184fd56

moving feature importance implementation to treeensemblemodel

2d726eb

sethah force-pushed the SPARK-11730 branch from 879c97f to 2d726eb Compare March 28, 2016 21:49

style errors

083b037

asfgit closed this in f6066b0 Mar 29, 2016

sethah mentioned this pull request Mar 30, 2016

[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark #12056

Closed

[SPARK-11730][ML] Add feature importances for GBTs. #11961

[SPARK-11730][ML] Add feature importances for GBTs. #11961

Uh oh!

Conversation

sethah commented Mar 25, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Mar 28, 2016

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

jkbradley commented Mar 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants