Skip to content

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented Mar 30, 2016

What changes were proposed in this pull request?

Feature importances are exposed in the python API for GBTs.

Other changes:

  • Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it.

How was this patch tested?

Python doc tests were updated to validate GBT feature importance.

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54496 has finished for PR 12056 at commit 892d301.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This generalizes the idea of "Gini" importance to other losses,
following the explanation of Gini importance from "Random Forests" documentation
by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.
Each feature's importance is the average of its importance across all trees in the ensemble
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the doc changing? Is it incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion on this here. Basically, since the random forest and GBT importance is just an average of the importances for single trees, we can just state that here and link to the doc for single trees, which explains how those are computed. Otherwise, we copy/paste the same explanation 6 times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - though the Scala doc should be updated likewise?
On Wed, 30 Mar 2016 at 17:21, Seth Hendrickson notifications@github.com
wrote:

In python/pyspark/ml/classification.py
#12056 (comment):

@@ -500,16 +500,12 @@ def featureImportances(self):
"""
Estimate of the importance of each feature.

  •    This generalizes the idea of "Gini" importance to other losses,
    
  •    following the explanation of Gini importance from "Random Forests" documentation
    
  •    by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.
    
  •    Each feature's importance is the average of its importance across all trees in the ensemble
    

There was some discussion on this here
#11961. Basically, since the random
forest and GBT importance is just an average of the importances for single
trees, we can just state that here and link to the doc for single trees,
which explains how those are computed. Otherwise, we copy/paste the same
explanation 6 times.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12056/files/892d30123a192cd3796892c0f64a5cf2993e1f09#r57906908

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was update in the PR I linked to above, so everything should be in sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I missed that.
On Wed, 30 Mar 2016 at 17:32, Seth Hendrickson notifications@github.com
wrote:

In python/pyspark/ml/classification.py
#12056 (comment):

@@ -500,16 +500,12 @@ def featureImportances(self):
"""
Estimate of the importance of each feature.

  •    This generalizes the idea of "Gini" importance to other losses,
    
  •    following the explanation of Gini importance from "Random Forests" documentation
    
  •    by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.
    
  •    Each feature's importance is the average of its importance across all trees in the ensemble
    

It was update in the PR I linked to above, so everything should be in sync.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12056/files/892d30123a192cd3796892c0f64a5cf2993e1f09#r57909234

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54548 has finished for PR 12056 at commit 0ccf1ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks!

@asfgit asfgit closed this in b11887c Mar 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants