RandomForestFeatures.py

D. Levy-Booth 5/20/2016

#Description: Python implimentation of the Scikit-Learn Random Forest Classifier to select features (variables) associated with catagories or treatments. Random forests are ensemble classifiers of classification and regression trees. Use Boruta feature selection to permute the random forest classifier to calculate significance of associated features.

#This script has options to output:

A table (.csv) of your features (variables) with ranked Random Forest importance
A plot (.png) of your features with ranked Random Forest importance
A table of your features with ranked Boruta importance
A plot of your features with ranked Boruta importance

#Requirements:

Python 2.6 or higher (not Python 3): https://www.python.org/downloads/

Pandas: http://pandas.pydata.org/getpandas.html

Numpy: http://www.numpy.org/

SciKit-Learn 0.14.0 or higher: http://scikit-learn.org/stable/install.html

Boruta_py: https://github.com/danielhomola/boruta_py

Ensure SciKit-Learn is installed. If you've managed to get SciKit-Learn working, the other dependancies are probably in place. Download Boruta_py from github. Ensure that it either in your path, or in the same folder as RandomForestFeatures.py

#Usage:

In the terminal (or command line for weirdos who use windows):

RandomForestFeatures.py

Required arguments:

-i (--input_fp) : path to the input .csv file with columns for your data and one catagorical column (string)

-p (--predvar) : catagorical variable column against which features will be selected (string)

Optional arguments:

-g (--plot) : produce a feature plot? (boolean, default = true)

-o (--output) : output folder name (string, default = rf_output)

-f (--featdepth) : number of feature to plot (integer, default = 10)

-b (--boruta) : perform Boruta feature selection (boolean, IMPORTANT: default is FALSE)

#Example:

RandomForestFeatures.py -i dataToClassify.csv -p Catagory -o output_folder -f 20 -b true

Fungal OTUs associated with forest harvesting treatments. Input file (-i) was dataToClassify.csv, catagorial variable (-p) to classify was Catagory, a folder (-o) called output_folder was created with resulting tables and plots, and Boruta feature selection was used (-b).

#Advanced:

Aditional variables related to the Random Forest parameters and Boruta feature selection method can be altered in the script

Default Random Forest parameters:

rf_params = {'n_estimators': 1000, 'max_depth': 10, 'min_samples_split': 1}

Default Boruta method:

feat_selector = BorutaPy(rf, multi_corr_method='hommel', n_estimators='auto', max_iter = 100, verbose=1)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
img		img
README.md		README.md
RandomForestFeatures.py		RandomForestFeatures.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RandomForestFeatures.py

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RandomForestFeatures.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages