[Some name] a pipeline to compare performance of AL and non-AL models

Docker image

We created a docker image (modified an image published by deepchem) to ensure reproducibility of calculations. Use the following command to pull the image: $ docker pull khalimat/jcim_f_holly

The directory with the RP should be mounted to image: $ sudo docker run -it --name jcim -v {directory_w_RP}:/root/mydir khalimat/jcim_f_holly

Run

Example of command to run the pipeline:

$ python main.py -s_n 'N_SF_TTS'

I used t-test for means of two independent samples, since our AL and non-AL models are trained on different data (AL training set is a subset of non-AL training set).

UML

Research Summary

hypothesis Training data sampling can significantly improve the performance of SCAM classification models

measure of success We define a data sampling strategy to improve the performance if a machine learning model that includes the sampling strategy has a significantly better performance compared to the same machine learning model but without the sampling strategy. To evaluate such pairs of models, we will

keep all other model parameters and pre-processing steps consistent (e.g. dataset, train-test split, parameter optimization strategy, used descriptors, feature selection strategy)
performance metric: ROC AUC, MCC, F1
test set: 60-30% training-test split of original data, while ensuring consistent imbalance in test set (stratified) and using scaffold-based group assignment.
we will repeat training-test split 10-times and use bonferroni corrected t-test p values to ensure differences are significant.

To ensure that our hypothesis is generalizable and not limited to a single use case, we will explore these different scenarios

sampling strategies
- ADASYN
- SMOTE
- CondensedNearestNeighbor
- ActiveLearning
dataset
- small Shoichet dataset from Excel sheet
- larger Shoichet dataset Excel + large set of positive data from AggAdvisor
- dataset from Tropsha SCAMDetective based PubChem
  - bLactamase https://pubs-acs-org.proxy.lib.duke.edu/doi/suppl/10.1021/acs.jcim.0c00415/suppl_file/ci0c00415_si_002.zip
  - Cruzain https://pubs-acs-org.proxy.lib.duke.edu/doi/suppl/10.1021/acs.jcim.0c00415/suppl_file/ci0c00415_si_003.zip
descriptor
- ECFP (Morgan)
- RDKit Fingerprint
feature processing
- none
- feature scaling
models
- TF MLP
- DeepSCAMs

Results

Results could be found here

Folder names

Sampling	Dataset	Split
N (No sampling)	SF (SCAMS_filtered.csv)	TTS (train_test_split)
SMOTE	SP1 (SCAMS_balanced_with_positive.csv)	B (split_with_butina)
ADASYN	SP2 (SCAMS_added_positives_653_1043.csv)	SS (split_with_scaffold_splitter)
CondensedNearestNeighbour (CNN)	__	ANV (almost no validation)
InstanceHardnessThreshold (IHT)	__	__

For example, N_SF_SS stands for run with with no sampling on SCAMS_filtered.csv and scaffold_splitter

Other approaches

SCAMs detective (SD)

There are two models presented with SD (cruzain and beta-lactamase). I made a mistake in the previous email and wrote that there were 4 models, as pbz2-files were not models.

Results

I trained in parallel DC and our models, and visualized the results.

Study 1

Results on the test set
Results on the validation set

Study 2

Results on the test set
Results on the validation set

Visualization

Here is an example of a notebook I wrote to visualise the results.

Conceptually, there are three variant:

Violin plot + Scatter plot
Violin plot
Ridgeline plot (I assume it is the most informative)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.idea		.idea
Code		Code
Datasets		Datasets
Results		Results
modAL		modAL
Pipeline_UML.jpg		Pipeline_UML.jpg
README.md		README.md
requirements.txt		requirements.txt
scams_project_env.txt		scams_project_env.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[Some name] a pipeline to compare performance of AL and non-AL models

Docker image

Run

UML

Research Summary

Results

Folder names

Other approaches

SCAMs detective (SD)

Results

Study 1

Study 2

Visualization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[Some name] a pipeline to compare performance of AL and non-AL models

Docker image

Run

UML

Research Summary

Results

Folder names

Other approaches

SCAMs detective (SD)

Results

Study 1

Study 2

Visualization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages