diff --git a/README.rst b/README.rst
index a615245..e7c2898 100644
--- a/README.rst
+++ b/README.rst
@@ -1,7 +1,6 @@
.. image:: https://github.com/PythonPredictions/cobra/raw/master/material/logo.png
- :width: 350
-
+ :width: 700
.. image:: https://img.shields.io/pypi/v/pythonpredictions-cobra.svg
:target: https://pypi.org/project/pythonpredictions-cobra/
@@ -54,7 +53,7 @@ The easiest way to install Cobra is using ``pip``: ::
Documentation and extra material
-=====================
+================================
- A `blog post `_ on the overall methodology.
@@ -64,7 +63,7 @@ Documentation and extra material
- A step-by-step `tutorial `_ for **logistic regression**.
-- A step-by-step `tutorial `_ for **linear regression**.
+- A step-by-step `tutorial `__ for **linear regression**.
- Check out the Data Science Leuven Meetup `talk `_ by one of the core developers (second presentation). His `slides `_ and `related material `_ are also available.
@@ -72,6 +71,6 @@ Contributing to Cobra
=====================
We'd love you to contribute to the development of Cobra! There are many ways in which you can contribute, the most common of which is to contribute to the source code or documentation of the project. However, there are many other ways you can contribute (report issues, improve code coverage by adding unit tests, ...).
-We use GitHub issue to track all bugs and feature requests. Feel free to open an issue in case you found a bug or in case you wish to see a new feature added.
+We use GitHub issues to track all bugs and feature requests. Feel free to open an issue in case you found a bug or in case you wish to see a new feature added.
-For more details, check our `wiki `_.
+For more details, check out our `wiki `_.
diff --git a/docs/Makefile b/docs/Makefile
index d0c3cbf..92f501f 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -1,5 +1,4 @@
# Minimal makefile for Sphinx documentation
-#
# You can set these variables from the command line, and also
# from the environment for the first two.
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 449fd5a..64d3d6f 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -1,5 +1,5 @@
# Configuration file for the Sphinx documentation builder.
-#
+
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
@@ -14,16 +14,14 @@
import sys
sys.path.insert(0, os.path.abspath('../../'))
-
# -- Project information -----------------------------------------------------
-project = 'cobra'
-copyright = '2020, Python Predictions'
+project = 'Cobra'
+copyright = '2021, Python Predictions'
author = 'Python Predictions'
# The full version, including alpha/beta/rc tags
-release = '1.0.0'
-
+release = '1.1.0'
# -- General configuration ---------------------------------------------------
@@ -72,12 +70,12 @@
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
-#
+
html_theme = 'sphinx_rtd_theme'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ['_static']
+# html_static_path = ['_static'] # uncomment if exists, currently doesn't
-html_favicon = 'images/cobra_icon.png'
+html_favicon = 'images/cobra-icon2.png'
diff --git a/docs/source/images/cobra_Icon.png b/docs/source/images/cobra_Icon.png
deleted file mode 100644
index d915c5c..0000000
Binary files a/docs/source/images/cobra_Icon.png and /dev/null differ
diff --git a/docs/source/index.rst b/docs/source/index.rst
index a887301..69b1fb1 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -1,30 +1,17 @@
-.. cobra documentation master file, created by
+.. Cobra documentation master file, created by
sphinx-quickstart on Thu Dec 3 11:55:07 2020.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
*********************************
-Welcome to cobra's documentation!
+Welcome to Cobra's documentation!
*********************************
-.. include:: C:/Users/hendrik.dewinter/PycharmProjects/cobra/README.rst
-
-.. toctree::
- :maxdepth: 2
- :hidden:
- :caption: Contents:
+.. include:: ../../README.rst
.. toctree::
:maxdepth: 4
:hidden:
:caption: API Reference
- C:/Users/hendrik.dewinter/PycharmProjects/cobra/docs/source/docstring/modules.rst
-
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
+ docstring/modules
diff --git a/docs/source/tutorial_outdated.rst b/docs/source/tutorial_outdated.rst
deleted file mode 100644
index fd95b83..0000000
--- a/docs/source/tutorial_outdated.rst
+++ /dev/null
@@ -1,224 +0,0 @@
-Tutorial
-========
-
-This section we will walk you through all the required steps to build a predictive model using cobra. All classes and functions used here are well-documented. In case you want more information on a class or function, simply read the corresponding parts in the documentation or run the following python snippet from e.g. a notebook:
-
-.. code-block:: python
-
- help(function_or_class_you_want_info_from)
-
-Building a good model involves three steps
-
- - preprocessing: properly prepare the predictors (a synonym for "feature" or variable that we use throughout this tutorial) for modelling.
- - feature selection: automatically select a subset of predictors which contribute most to the target variable or output in which you are interested.
- - model evaluation: once a model has been build, a detailed evaluation can be performed by computing all sorts of evaluation metrics.
-
-In the examples below, we assume the data for model building is available in a pandas DataFrame called ``basetable``. This DataFrame should at least contain an ID column (e.g. "customernumber"), a target column (e.g. "TARGET") and a number of candidate predictors (features) to build a model with.
-
-Preprocessing
--------------
-
-The first part focusses on preparing the predictors for modelling by:
-
-- Splitting the dataset into training, selection and validation datasets.
-- binning continuous variables into discrete intervals
-- Replacing missing values of both categorical and continuous variables (which are now binned) with an additional "Missing" bin/category
-- Regrouping categories in new category "other"
-- Replacing bins/categories with their corresponding incidence rate per category/bin.
-
-This will be taken care of by the ``PreProcessor`` class, which has a scikit-learn like interface (i.e. ``fit`` & ``transform``)
-
-.. code-block:: python
-
- import json
- from cobra.preprocessing import PreProcessor
-
- # Prepare data
- # create instance of PreProcessor from parameters
- # There are many options possible, see API reference, but here
- # we will use all the defaults
- preprocessor = PreProcessor.from_params()
-
- # split data into train-selection-validation set
- # in the result, an additional column "split" will be created
- # containing each of those values
- basetable = preprocessor.train_selection_validation_split(
- basetable,
- train_prop=0.6, selection_prop=0.2,
- validation_prop=0.2)
-
- # create list containing the column names of the discrete resp.
- # continuous variables
- continuous_vars = []
- discrete_vars = []
-
- # fit the pipeline
- preprocessor.fit(basetable[basetable["split"]=="train"],
- continuous_vars=continuous_vars,
- discrete_vars=discrete_vars,
- target_column_name=target_column_name)
-
- # store fitted preprocessing pipeline as a JSON file
- pipeline = preprocessor.serialize_pipeline()
-
- # I/O outside of PreProcessor to allow flexibility (e.g. upload to S3, ...)
- path = "path/to/store/preprocessing/pipeline/as/json/file/for/later/re-use.json"
- with open(path, "w") as file:
- json.dump(pipeline, file)
-
- # transform the data (e.g. perform discretisation, incidence replacement, ...)
- basetable = preprocessor.transform(basetable,
- continuous_vars=continuous_vars,
- discrete_vars=discrete_vars)
-
- # When you want to reuse the pipeline the next time, simply load it back in again
- # using the following snippet:
- # with open(path, "r") as file:
- # pipeline = json.load(file)
- # preprocessor = PreProcessor.from_pipeline(pipeline) and you're good to go!
-
-Feature selection
------------------
-
-Once the predictors are properly prepared, we can start building a predictive model, which boils down to selecting the right predictors from the dataset to train a model on. As a dataset typically contains many predictors, we can first perform a univariate preselection to rule out any predictor with little to no predictive power.
-
-This preselection is based on an AUC threshold of a univariate model on the train and selection datasets. As the AUC just calculates the quality of a ranking, all monotonous transformations of a given ranking (i.e. transformations that do not alter the ranking itself) will lead to the same AUC. Hence, pushing a categorical variable (incl. a binned continuous variable) through a logistic regression will produce exactly the same ranking as using target encoding, as it will produce the exact same output: a ranking of the categories on the training/selection set. Therefore, no univariate model is trained here as the target encoded train and selection data is used as predicted scores to compute the AUC with against the target.
-
-.. code-block:: python
-
- from cobra.model_building import univariate_selection
- from cobra.evaluation import plot_univariate_predictor_quality
- from cobra.evaluation import plot_correlation_matrix
-
- # Get list of predictor names to use for univariate_selection
- preprocessed_predictors = [col for col in basetable.columns if col.endswith("_enc")]
-
- # perform univariate selection on preprocessed predictors:
- df_auc = univariate_selection.compute_univariate_preselection(
- target_enc_train_data=basetable[basetable["split"] == "train"],
- target_enc_selection_data=basetable[basetable["split"] == "selection"],
- predictors=preprocessed_predictors,
- target_column=target_column_name,
- preselect_auc_threshold=0.53, # if auc_selection <= 0.53 exclude predictor
- preselect_overtrain_threshold=0.05 # if (auc_train - auc_selection) >= 0.05 --> overfitting!
- )
-
- # Plot df_auc to get a horizontal barplot:
- plot_univariate_predictor_quality(df_auc)
-
- # compute correlations between preprocessed predictors:
- df_corr = (univariate_selection
- .compute_correlations(basetable[basetable["split"] == "train"],
- preprocessed_predictors))
-
- # plot correlation matrix
- plot_correlation_matrix(df_corr)
-
- # get a list of predictors selection by the univariate selection
- preselected_predictors = (univariate_selection
- .get_preselected_predictors(df_auc))
-
-After an initial preselection on the predictors, we can start building the model itself using forward feature selection to choose the right set of predictors. Since we use target encoding on all our predictors, we will only consider models with positive coefficients (no sign flip should occur) as this makes the model more interpretable.
-
-.. code-block:: python
-
- from cobra.model_building import ForwardFeatureSelection
- from cobra.evaluation import plot_performance_curves
- from cobra.evaluation import plot_variable_importance
-
- forward_selection = ForwardFeatureSelection(max_predictors=30,
- pos_only=True)
-
- # fit the forward feature selection on the train data
- # has optional parameters to force and/or exclude certain predictors (see docs)
- forward_selection.fit(basetable[basetable["split"] == "train"],
- target_column_name,
- preselected_predictors)
-
- # compute model performance (e.g. AUC for train-selection-validation)
- performances = (forward_selection
- .compute_model_performances(basetable, target_column_name))
-
- # plot performance curves
- plot_performance_curves(performances)
-
-Based on the performance curves (AUC per model with a particular number of predictors in case of logistic regression), a final model can then be chosen and the variables importance can be plotted:
-
-.. code-block:: python
-
- # After plotting the performances and selecting the model,
- # we can extract this model from the forward_selection class:
- model = forward_selection.get_model_from_step(5)
-
- # Note that chosen model has 6 variables (python lists start with index 0),
- # which can be obtained as follows:
- final_predictors = model.predictors
- # We can also compute and plot the importance of each predictor in the model:
- variable_importance = model.compute_variable_importance(
- basetable[basetable["split"] == "selection"]
- )
- plot_variable_importance(variable_importance)
-
-**Note**: variable importance is based on correlation of the predictor with the *model scores* (and not the true labels!).
-
-Finally, we can again export the model to a dictionary to store it as JSON
-
-.. code-block:: python
-
- model_dict = model.serialize()
-
- with open(path, "w") as file:
- json.dump(model_dict, file)
-
- # To reload the model again from a JSON file, run the following snippet:
- # from cobra.model_building import LogisticRegressionModel
- # with open(path, "r") as file:
- # model_dict = json.load(file)
- # model = LogisticRegressionModel()
- # model.deserialize(model_dict)
-
-Evaluation
-----------
-
-Now that we have build and selected a final model, it is time to evaluate it against various evaluation metrics:
-
-.. code-block:: python
-
- from cobra.evaluation import Evaluator
-
- # get numpy array of True target labels and predicted scores:
- y_true = basetable[basetable["split"] == "selection"][target_column_name].values
- y_pred = model.score_model(basetable[basetable["split"] == "selection"])
-
- evaluator = Evaluator()
- evaluator.fit(y_true, y_pred) # Automatically find the best cut-off probability
-
- # Get various scalar metrics such as accuracy, AUC, precision, recall, ...
- evaluator.scalar_metrics
-
- # Plot non-scalar evaluation metrics:
- evaluator.plot_roc_curve()
-
- evaluator.plot_confusion_matrix()
-
- evaluator.plot_cumulative_gains()
-
- evaluator.plot_lift_curve()
-
- evaluator.plot_cumulative_response_curve()
-
-Additionally, we can also compute the output needed to plot the so-called Predictor Insights Graphs (PIGs in short). These are graphs that represents the insights of the relationship between a single predictor (e.g. age) and the target (e.g. burnouts). This is a graph where the predictor is binned into groups, and where we represent group size in bars and group (target) incidence in a colored line. We have the option to force order of predictor values.
-
-.. code-block:: python
-
- from cobra.evaluation import generate_pig_tables
- from cobra.evaluation import plot_incidence
-
- predictor_list = [col for col in basetable.columns
- if col.endswith("_bin") or col.endswith("_processed")]
- pig_tables = generate_pig_tables(basetable[basetable["split"] == "selection"],
- id_column_name=id_column_name,
- target_column_name=target_column_name,
- preprocessed_predictors=predictor_list)
- # Plot PIGs
- plot_incidence(pig_tables, 'predictor_name', predictor_order)
\ No newline at end of file
diff --git a/setup.py b/setup.py
index e4ec6ef..3653165 100644
--- a/setup.py
+++ b/setup.py
@@ -14,7 +14,7 @@
name="pythonpredictions-cobra",
version=__version__,
description=("A Python package to build predictive linear and logistic regression "
- "models focused on performance and interpretation"),
+ "models focused on performance and interpretation."),
long_description=README,
long_description_content_type="text/x-rst",
packages=find_packages(include=["cobra", "cobra.*"]),