[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark #12056

sethah · 2016-03-30T05:00:28Z

What changes were proposed in this pull request?

Feature importances are exposed in the python API for GBTs.

Other changes:

Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it.

How was this patch tested?

Python doc tests were updated to validate GBT feature importance.

SparkQA · 2016-03-30T05:17:01Z

Test build #54496 has finished for PR 12056 at commit 892d301.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-03-30T11:51:59Z

python/pyspark/ml/classification.py

-        This generalizes the idea of "Gini" importance to other losses,
-        following the explanation of Gini importance from "Random Forests" documentation
-        by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.
+        Each feature's importance is the average of its importance across all trees in the ensemble


Why is the doc changing? Is it incorrect?

There was some discussion on this here. Basically, since the random forest and GBT importance is just an average of the importances for single trees, we can just state that here and link to the doc for single trees, which explains how those are computed. Otherwise, we copy/paste the same explanation 6 times.

Fair enough - though the Scala doc should be updated likewise?
On Wed, 30 Mar 2016 at 17:21, Seth Hendrickson notifications@github.com
wrote:

In python/pyspark/ml/classification.py
#12056 (comment):

@@ -500,16 +500,12 @@ def featureImportances(self):
"""
Estimate of the importance of each feature.

This generalizes the idea of "Gini" importance to other losses,

following the explanation of Gini importance from "Random Forests" documentation

by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.

Each feature's importance is the average of its importance across all trees in the ensemble

There was some discussion on this here
#11961. Basically, since the random
forest and GBT importance is just an average of the importances for single
trees, we can just state that here and link to the doc for single trees,
which explains how those are computed. Otherwise, we copy/paste the same
explanation 6 times.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12056/files/892d30123a192cd3796892c0f64a5cf2993e1f09#r57906908

It was update in the PR I linked to above, so everything should be in sync.

Ah sorry I missed that.
On Wed, 30 Mar 2016 at 17:32, Seth Hendrickson notifications@github.com
wrote:

In python/pyspark/ml/classification.py
#12056 (comment):

@@ -500,16 +500,12 @@ def featureImportances(self):
"""
Estimate of the importance of each feature.

This generalizes the idea of "Gini" importance to other losses,

following the explanation of Gini importance from "Random Forests" documentation

by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.

Each feature's importance is the average of its importance across all trees in the ensemble

It was update in the PR I linked to above, so everything should be in sync.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12056/files/892d30123a192cd3796892c0f64a5cf2993e1f09#r57909234

SparkQA · 2016-03-30T20:16:12Z

Test build #54548 has finished for PR 12056 at commit 0ccf1ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-31T19:59:44Z

LGTM
Merging with master
Thanks!

adding feature importance to gbts in pyspark

892d301

MLnick reviewed Mar 30, 2016
View reviewed changes

add :py prefix to docstrings

0ccf1ed

asfgit closed this in b11887c Mar 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark #12056

[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark #12056

Uh oh!

sethah commented Mar 30, 2016

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

MLnick Mar 30, 2016

Uh oh!

sethah Mar 30, 2016

Uh oh!

MLnick Mar 30, 2016

Uh oh!

sethah Mar 30, 2016

Uh oh!

MLnick Mar 30, 2016

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

jkbradley commented Mar 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark #12056

[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark #12056

Uh oh!

Conversation

sethah commented Mar 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

MLnick Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

jkbradley commented Mar 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants