Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
136c0a5
rename format bash script
roshankern Jul 1, 2022
eba8506
create ML module
roshankern Jul 1, 2022
e6aa716
confusion matrix, weight heatmap
roshankern Jul 8, 2022
6cb1245
black formatting
roshankern Jul 11, 2022
9dd89d0
data prep info
roshankern Jul 11, 2022
3ceb24e
update documentation
roshankern Jul 11, 2022
08445dc
add plots
roshankern Jul 11, 2022
2631e73
hypertuning
roshankern Jul 11, 2022
bd61949
order barcharts
roshankern Jul 11, 2022
02a5ef3
formatting
roshankern Jul 11, 2022
c2d12cd
update documentation
roshankern Jul 11, 2022
a3c027a
documentation
roshankern Jul 12, 2022
0b49fbc
documentation
roshankern Jul 12, 2022
5980093
final run
roshankern Jul 12, 2022
ec39df8
readme update
roshankern Jul 12, 2022
5300c46
renaming
roshankern Jul 12, 2022
a8395ba
shuffling info
roshankern Jul 12, 2022
49ea61b
Update README.md
roshankern Jul 12, 2022
846e62a
Update README.md
roshankern Jul 12, 2022
2b092f2
Update 2.ML_model/README.md
roshankern Jul 12, 2022
5ae379b
update readme
roshankern Jul 12, 2022
a747378
Update 2.ML_model/README.md
roshankern Jul 12, 2022
a218a54
rerun
roshankern Jul 13, 2022
087b188
final estimator
roshankern Jul 13, 2022
445acdf
baseline + save results
roshankern Jul 14, 2022
e7ee44d
baseline comparison
roshankern Jul 14, 2022
cd0752b
clarify shuffling
roshankern Jul 15, 2022
81387ef
shuffled baseline clarification
roshankern Jul 15, 2022
6fb05c8
column independent shuffling
roshankern Jul 15, 2022
b344839
export sklearn model
roshankern Jul 15, 2022
37138a4
DP_model update
roshankern Jul 15, 2022
6aa0b0e
split data
roshankern Jul 18, 2022
9f8710d
train model
roshankern Jul 19, 2022
85b1468
evaluation
roshankern Jul 20, 2022
294407f
number steps, utils file
roshankern Jul 22, 2022
3b419b3
evaluation
roshankern Jul 22, 2022
2395aa1
interpret model
roshankern Jul 22, 2022
3a87e81
rerun pipeline
roshankern Jul 25, 2022
8f8502b
intelligent holdout data
roshankern Jul 25, 2022
0915c99
full rerun
roshankern Jul 25, 2022
c1e32fc
documentation
roshankern Jul 26, 2022
c7349f5
stratified test split
roshankern Jul 26, 2022
92be301
final rerun
roshankern Jul 26, 2022
1306e2f
pipelinization
roshankern Jul 27, 2022
a81c27d
test html
roshankern Jul 27, 2022
c0315c5
restructure
roshankern Jul 28, 2022
dc0ae39
restructure 2
roshankern Jul 28, 2022
17fa898
Finalize
roshankern Jul 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Jupyter Notebook files
.ipynb_checkpoints
# Python files
__pycache__/
# vscode files
.vscode
2 changes: 1 addition & 1 deletion 1.format_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,5 +27,5 @@ conda activate 1.format_training_data

```bash
# Run this script to preprocess training movies
bash 1.format_trainind_data.sh
bash 1.format_training_data.sh
```
12 changes: 12 additions & 0 deletions 2.ML_model/2.ML_model.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash
# Step 0: Convert all notebooks to scripts
jupyter nbconvert --to=script \
--FilesWriter.build_directory=scripts \
notebooks/*.ipynb

# Step 1: Execute all jupyter notebooks
jupyter nbconvert --to=html \
--FilesWriter.build_directory=html \
--ExecutePreprocessor.kernel_name=python3 \
--ExecutePreprocessor.timeout=10000000 \
--execute notebooks/*.ipynb
10 changes: 10 additions & 0 deletions 2.ML_model/2.machine_learning_env.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: 2.ML_phenotypic_classification
channels:
- conda-forge
dependencies:
- conda-forge::python=3.8.13
- conda-forge::jupyter=1.0.0
- conda-forge::pandas=1.4.2
- conda-forge::scikit-learn=1.1.1
- conda-forge::matplotlib=3.5.2
- conda-forge::seaborn=0.11.2
114 changes: 114 additions & 0 deletions 2.ML_model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# 2. Machine Learning Model

In this module, we train and validate a machine learning model for phenotypic classification (mitotic stage) of nuclei based on nuclei features.

We use [Scikit-learn (sklearn)](https://scikit-learn.org/) for data manipulation, model training, and model evaluation.
Scikit-learn is described in [Pedregosa et al., JMLR 12, pp. 2825-2830, 2011](http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html) as a machine learning library for Python.
Its ease of implementation in a pipeline make it ideal for our use case.

We consistently use the following parameters with many sklearn functions:

- `n_jobs=-1`: Use all CPU cores in parallel when completing a task.
- `random_state=0`: Use seed 0 when shuffling data or generating random numbers.
This allows "random" sklearn operations to have consist results.
We also use `np.random.seed(0)` to make "random" numpy operations have consistent results.

We use [seaborn](https://seaborn.pydata.org/) for data visualization.
Seaborn is described in [Waskom, M.L., 2021](https://doi.org/10.21105/joss.03021) as a library for making statisical graphics in python.

All parts of the following pipeline are completed for a "final" model (from training data) and a "shuffled baseline" model (from shuffled training data).
This shuffled baseline model provides a suitable baseline comparison for the final model during evaluation.

### A. Data Preparation

First, we split the data into training, test, and holdout subsets in [0.split_data.ipynb](notebooks/0.split_data.ipynb).
The `get_representative_images()` function used to create the holdout dataset determines which images to holdout such that all phenotypic classes can be represented in these holdout images.
The test dataset is determined by taking a random number of samples (stratified by phenotypic class) from the dataset after the holdout images are removed.
The training dataset is the subset remaining after holdout/test samples are removed.
Sample indexes associated with training, test, and holdout subsets are stored in [0.data_split_indexes.tsv](results/0.data_split_indexes.tsv).
Sample indexes are used to load subsets from [training_data.csv.gz](../1.format_data/data/training_data.csv.gz).

We use [sklearn.utils.shuffle](https://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html) to shuffle the training data in a consistent way.
This function shuffles the order of the training data samples while keeping the phenotypic class labels aligned with their respective features.
In other words, this function shuffles entire rows of training data to remove any ordering scheme.
This is necessary because the data as labeled from MitoCheck tends to have phenotypic classes in groups, which can introduce bias into the model.

We use [sklearn.model_selection.StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) to create stratified training/test data sets for cross validation.
The training and test data sets that are created have the same distribution of classes.
This ensures that each class is proportionally represented across all data sets.
We use `n_splits=10` to create 10 folds for cross-validation.

To create the shuffled baseline training dataset, we first load the training data as described above.
We then shuffle each column of the training data independently, which removes any correlation between features and phenotypic class label.

### B. Model Training

We train the model in [1.train_model.ipynb](notebooks/1.train_model.ipynb).
We use [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for our machine learning model.
We use the following parameters for our Logisic Regression model:

- `penalty='elasticnet'`: Use elasticnet as the penalty for our model.
Elastic-Net regularization is a combination of L1 and L2 regularization methods.
The mixing of these two methods is determined by the `l1_ratio` parameter which we optimize later.
- `solver='saga'`: We use the saga solver as this is the only solver that supports Elastic-Net regularization.
- `max_iter=100`: Set the maximum number of iterations for solver to converge. 100 iterations allows the solver to maximize performance without completing unnecessary iterations.

We use [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) to perform an exhaustive search for the parameters below. This searches for parameters that maximize the weighted F1 score of the model. We optimize weighted F1 score because this metric measures model precision and recall aand accounts for label imbalance (see [sklearn.metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) for more details).

- `l1_ratio`: Elastic-Net mixing parameter.
Used to combine L1 and L2 regularization methods.
We search over the following parameters: `[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]`
- `C`: Inversely proportional to regularization strength.
We search over the following parameters: `[1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]`

The best parameters are used to train a final model on all of the training data.
This final model is saved in [1.log_reg_model.joblib](results/1.log_reg_model.joblib).
The model derived from shuffled training data is saved in [1.shuffled_baseline_log_reg_model.joblib](results/1.shuffled_baseline_log_reg_model.joblib).

### C. Model Evaluation

We train the model in [2.evaluate_model.ipynb](notebooks/2.evaluate_model.ipynb).
We use the final model and shuffled baseline model to predict the labels of the training, testing, and holdout datasets.
These predictions are saved in [results/2.model_predictions.tsv](results/2.model_predictions.tsv) and [2.shuffled_baseline_model_predictions.tsv](results/2.shuffled_baseline_model_predictions.tsv) respectively.

We evaluate these 6 sets of predictions with a confusion matrix to see the true/false postives and negatives (see [sklearn.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for more details).

We also evaluate these 6 sets of predictions with [sklearn.metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) to determine the final/shuffled baseline model's predictive performance on each subset.
F1 score measures the models precision and recall performance for each phenotypic class.

### D. Model Interpretation

We train the model in [3.interpret_model.ipynb](notebooks/3.interpret_model.ipynb).
The final model and shuffled baseline model coefficients are loaded from [1.log_reg_model.joblib](results/1.log_reg_model.joblib) and [1.shuffled_baseline_log_reg_model.joblib](results/1.shuffled_baseline_log_reg_model.joblib) respectively.
These coefficients are interpreted with the following diagrams:

- We use [seaborn.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) to display the coefficient values for each phenotypic class/feature.
- We use [seaborn.clustermap](https://seaborn.pydata.org/generated/seaborn.clustermap.html) to display a hierarchically-clustered heatmap of coefficient values for each phenotypic class/feature
- We use [seaborn.kedeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to display a density plot of coeffiecient values for each phenotypic class.
- We use [seaborn.barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html) to display a bar plot of average coeffiecient values per phenotypic class and feature.

## Step 1: Setup Machine Learning Environment

### Step 1a: Create Machine Learning Environment

```sh
# Run this command to create the conda environment for machine learning
conda env create -f 2.machine_learning_env.yml
```

### Step 1b: Activate Machine Learning Environment

```sh
# Run this command to activate the conda environment for machine learning
conda activate 2.ML_phenotypic_classification
```

## Step 2: Execute Machine Learning Pipeline

```bash
# Run this script to train, evaluate, and interpret DP model
bash 2.ML_model.sh
```

**Note:** Running pipeline will produce all intermediate files (located in [results](results/)).
Jupyter notebooks (located in [notebooks](notebooks/)) will not be updated but the executed notebooks (located in [html](html/)) will be updated.
Loading