We created a docker image (modified an image published by deepchem) to ensure reproducibility of calculations. Use the following command to pull the image:
$ docker pull khalimat/jcim_f_holly
The directory with the RP should be mounted to image:
$ sudo docker run -it --name jcim -v {directory_w_RP}:/root/mydir khalimat/jcim_f_holly
Example of command to run the pipeline:
$ python main.py -s_n 'N_SF_TTS'
I used t-test for means of two independent samples, since our AL and non-AL models are trained on different data (AL training set is a subset of non-AL training set).
hypothesis Training data sampling can significantly improve the performance of SCAM classification models
measure of success We define a data sampling strategy to improve the performance if a machine learning model that includes the sampling strategy has a significantly better performance compared to the same machine learning model but without the sampling strategy. To evaluate such pairs of models, we will
- keep all other model parameters and pre-processing steps consistent (e.g. dataset, train-test split, parameter optimization strategy, used descriptors, feature selection strategy)
- performance metric: ROC AUC, MCC, F1
- test set: 60-30% training-test split of original data, while ensuring consistent imbalance in test set (stratified) and using scaffold-based group assignment.
- we will repeat training-test split 10-times and use bonferroni corrected t-test p values to ensure differences are significant.
To ensure that our hypothesis is generalizable and not limited to a single use case, we will explore these different scenarios
- sampling strategies
- ADASYN
- SMOTE
- CondensedNearestNeighbor
- ActiveLearning
- dataset
- small Shoichet dataset from Excel sheet
- larger Shoichet dataset Excel + large set of positive data from AggAdvisor
- dataset from Tropsha SCAMDetective based PubChem
- descriptor
- ECFP (Morgan)
- RDKit Fingerprint
- feature processing
- none
- feature scaling
- models
- TF MLP
- DeepSCAMs
Results could be found here
| Sampling | Dataset | Split |
|---|---|---|
| N (No sampling) | SF (SCAMS_filtered.csv) | TTS (train_test_split) |
| SMOTE | SP1 (SCAMS_balanced_with_positive.csv) | B (split_with_butina) |
| ADASYN | SP2 (SCAMS_added_positives_653_1043.csv) | SS (split_with_scaffold_splitter) |
| CondensedNearestNeighbour (CNN) | __ | ANV (almost no validation) |
| InstanceHardnessThreshold (IHT) | __ | __ |
For example, N_SF_SS stands for run with with no sampling on SCAMS_filtered.csv and scaffold_splitter
There are two models presented with SD (cruzain and beta-lactamase). I made a mistake in the previous email and wrote that there were 4 models, as pbz2-files were not models.
I trained in parallel DC and our models, and visualized the results.
Here is an example of a notebook I wrote to visualise the results.
Conceptually, there are three variant:




