Skip to content

Khalimat/JCIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[Some name] a pipeline to compare performance of AL and non-AL models

Docker image

We created a docker image (modified an image published by deepchem) to ensure reproducibility of calculations. Use the following command to pull the image: $ docker pull khalimat/jcim_f_holly

The directory with the RP should be mounted to image: $ sudo docker run -it --name jcim -v {directory_w_RP}:/root/mydir khalimat/jcim_f_holly

Run

Example of command to run the pipeline:

$ python main.py -s_n 'N_SF_TTS'

I used t-test for means of two independent samples, since our AL and non-AL models are trained on different data (AL training set is a subset of non-AL training set).

UML

UML

Research Summary

hypothesis Training data sampling can significantly improve the performance of SCAM classification models

measure of success We define a data sampling strategy to improve the performance if a machine learning model that includes the sampling strategy has a significantly better performance compared to the same machine learning model but without the sampling strategy. To evaluate such pairs of models, we will

  • keep all other model parameters and pre-processing steps consistent (e.g. dataset, train-test split, parameter optimization strategy, used descriptors, feature selection strategy)
  • performance metric: ROC AUC, MCC, F1
  • test set: 60-30% training-test split of original data, while ensuring consistent imbalance in test set (stratified) and using scaffold-based group assignment.
  • we will repeat training-test split 10-times and use bonferroni corrected t-test p values to ensure differences are significant.

To ensure that our hypothesis is generalizable and not limited to a single use case, we will explore these different scenarios

Results

Results could be found here

Folder names

Sampling Dataset Split
N (No sampling) SF (SCAMS_filtered.csv) TTS (train_test_split)
SMOTE SP1 (SCAMS_balanced_with_positive.csv) B (split_with_butina)
ADASYN SP2 (SCAMS_added_positives_653_1043.csv) SS (split_with_scaffold_splitter)
CondensedNearestNeighbour (CNN) __ ANV (almost no validation)
InstanceHardnessThreshold (IHT) __ __

For example, N_SF_SS stands for run with with no sampling on SCAMS_filtered.csv and scaffold_splitter

Other approaches

SCAMs detective (SD)

There are two models presented with SD (cruzain and beta-lactamase). I made a mistake in the previous email and wrote that there were 4 models, as pbz2-files were not models.

Results

I trained in parallel DC and our models, and visualized the results.

Study 1

  • Results on the test set Test

  • Results on the validation set Validation

Study 2

  • Results on the test set Test

  • Results on the validation set Validation

Visualization

Here is an example of a notebook I wrote to visualise the results.

Conceptually, there are three variant:

  • Violin plot + Scatter plot ViolinScatter

  • Violin plot Violin

  • Ridgeline plot (I assume it is the most informative) Ridgeline

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors