Updated 17 June 2025: Latest Release on pip! v1.5.4
GenoML (Genomics + Machine Learning) is an automated Machine Learning (autoML) for genomics data. In general, use a Linux or Mac with Python 3.9-3.12 for best results. This repository and pip package are under active development!
This README is a brief look into how to structure arguments and what arguments are available at each phase for the GenoML CLI.
If you are using GenoML for your own work, please cite the following papers:
- Makarious, M. B., Leonard, H. L., Vitale, D., Iwaki, H., Saffo, D., Sargent, L., ... & Faghri, F. (2021). GenoML: Automated Machine Learning for Genomics. arXiv preprint arXiv:2103.03221
- Makarious, M. B., Leonard, H. L., Vitale, D., Iwaki, H., Sargent, L., Dadu, A., ... & Nalls, M. A. (2021). Multi-Modality Machine Learning Predicting Parkinson’s Disease. bioRxiv.
- Install this repository directly from GitHub (from source; master branch)
git clone https://github.com/GenoML/genoml2.git
- Install using pip or upgrade using pip
pip install genoml2
OR
pip install genoml2 --upgrade
- To install the
examples/directory (~315 KB), you can use SVN (pre-installed on most Macs)
svn export https://github.com/GenoML/genoml2.git/trunk/examples
Note: When you pip install this package, the examples/ folder is also downloaded! However, if you still want to download the directory and SVN is not pre-installed, you can download it via Homebrew if you have that installed using
brew install svn
- 16-JUN-2025: Addition of multiclass prediction functionality using the same base models that are used for the discrete module. We have additionally restructured the munging functionality to allow users to process training and testing data all at once to ensure they are munged under the same conditions, as well as including multiple GWAS summary stats files for SNP filtering. We also upgraded from plink1.9 to plink2 for genomic data processing. Finally, we have added a log file in the output directory to facilitate full reproducbility of results.
READMEupdated to reflect these changes. - 8-OCT-2024: Big changes to output file structure, so now output files go in subdirectories named for each step, and prefixes are not required.
READMEupdated to reflect these changes.
You can create a virtual environment to run GenoML, if you prefer. If you already have the Anaconda Distribution, this is fairly simple.
To create and activate a virtual environment:
# To create a virtual environment
conda create -n GenoML python=3.12
# To activate a virtual environment
conda activate GenoML
# To install requirements via pip
pip install -r requirements.txt
# If issues installing xgboost from requirements - (3 options)
# Option 1: use Homebrew to
# xcode-select --install
# brew install gcc@7
# or Option 2: conda install -c conda-forge xgboost
# or Option 3: pip install xgboost==2.0.3
# If issues installing umap
# pip install umap-learn
# If issues installing pytables/dependency issue
# conda install -c conda-forge pytables
# If issues with blosc
# conda install -c conda-forge tables blosc
## MISC
# To deactivate the virtual environment
# conda deactivate GenoML
# To delete your virtual environment
# conda env remove -n GenoMLTo install the GenoML in the user's path in a virtual environment, you can do the following:
# Install the package at this path
pip install .
# MISC
# To save out the environment requirements to a .txt file
# pip freeze > requirements.txt
# Removing a conda virtualenv
# conda remove --name GenoML --all Note: The following examples are for discrete data, but if you substitute following commands with
continuousormulticlassinstead of discrete, you can munge, harmonize, train, tune, and test your continuous/multiclass data!
Munging with GenoML will, at minimum, do the following:
- Prune your genotypes using PLINK v2 (if
--genoflag is used) - Impute per column using median or mean (can be changed with the
--imputeflag) - Z-scaling of features and removing columns with a std dev = 0
Required arguments for GenoML munging are --prefix and --pheno
data: Are the datacontinuous,discrete, ormulticlass?method: Do you want to usesupervisedorunsupervisedmachine learning? (unsupervised currently under development)mode: would you like tomunge,harmonize,train,tune, ortestyour model? Here, you will usemunge.--prefix: Where would you like your outputs to be saved?--pheno: Where is your phenotype file? This file only has 2 columns, ID in one, and PHENO in the other (0 for controls and 1 for cases when using thediscretemodule, 0, ..., n-1 when using themulticlassmodule for n distinct phenotypes, or numeric values when using thecontinuousmodule).
Be sure to have your files formatted the same as the examples, key points being:
- Your phenotype file consisting only of the "ID" and "PHENO" columns
- Your sample IDs matching across all files
- Your sample IDs not consisting with only integers (add a prefix or suffix to all sample IDs ensuring they are alphanumeric if this is the case prior to running GenoML)
- Please avoid the use of characters like commas, semi-colons, etc. in the column headers (it is Python after all!)
If you would like to munge just with genotypes (in PLINK binary format), the simplest command is the following:
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csvIf you would like a more detailed log printed to your console, you may use the --verbose flag as follows:
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file with a detailed log printed to the console
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--verboseNote: The
--verboseflag may be used like this for any GenoML command, not just munging.
To properly evaluate your model, it must be applied to a dataset it's never seen before (testing data). If you have both training and testing data, GenoML allows you to munge them together upfront. To do this with your training and testing phenotype/genotype data, the simplest command is the following:
# Running GenoML munging on discrete data using PLINK genotype binary files and phenotype files for both the training and testing datasets.
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--geno_test examples/discrete/validation \
--pheno_test examples/discrete/validation_pheno.csvIf you would like to control the pruning stringency in genotypes:
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--r2_cutoff 0.3 \
--pheno examples/discrete/training_pheno.csvYou can choose to skip pruning your SNPs at this stage by including the --skip_prune flag
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--skip_prune \
--pheno examples/discrete/training_pheno.csvYou can choose to impute on mean or median by modifying the --impute flag, like so (default is median):
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file and specifying impute
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--impute meanIf you suspect collinear variables, and think this will be a problem for training the model moving forward, you can use variance inflation factor (VIF) filtering:
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file while using VIF to remove multicollinearity
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--vif 5 \
--vif_iter 1- The
--vifflag specifies the VIF threshold you would like to use (5 is recommended) - The number of iterations you'd like to run can be modified with the
--vif_iterflag (if you have or anticipate many collinear variables, it's a good idea to increase the iterations)
Well, what if you had GWAS summary statistics handy, and would like to just use the same SNPs outlined in that file? You can do so by running the following:
# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and a GWAS summary statistics file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--gwas examples/discrete/example_GWAS.csvNote: When using the GWAS flag, the PLINK binaries will be pruned to include matching SNPs to the GWAS file.
And if you have more than one GWAS summary statistics file, we support that too! Just use the same --gwas flag for each of the files you would like to include, as follows:
# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and two GWAS summary statistics files
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--gwas examples/discrete/example_GWAS.csv \
--gwas examples/discrete/example_GWAS_2.csvNote: This is particularly helpful when using the
multiclassmodule when you have multiple phenotypes of interest and would like to include SNPs that are relevant for each phenotype.
...and if you wanted to add a p-value cut-off...
# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and a GWAS summary statistics file with a p-value cut-off
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--gwas examples/discrete/example_GWAS.csv \
--p 0.01Do you have additional data you would like to incorporate? Perhaps clinical, demographic, or transcriptomics data? If coded and all numerical, these can be added as an --addit file by doing the following:
# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and an addit file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csvYou also have the option of not using PLINK binary files if you would like to just preprocess (and then, later train) on a phenotype and addit file by doing the following:
# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and an addit file
genoml discrete supervised munge \
--prefix outputs \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csvAre you interested in selecting and ranking your features? If so, you can use the --feature_selection flag and specify like so...:
# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and running feature selection
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv \
--feature_selection 50The --feature_selection flag uses extraTrees (classifier for discrete data; regressor for continuous data) to output a *.approx_feature_importance.txt file with the features most contributing to your model at the top.
Do you have additional covariates and confounders you would like to adjust for in the munging step prior to training your model and/or would like to reduce your data? To adjust, use the --adjust_data flag with the following necessary flags:
--target_features: A .txt file, one column, with a list of features to adjust (no header). These should correspond to features in the munged dataset--confounders: A .csv of confounders to adjust for with ID column and header. Numeric, with no missing data and the ID column is mandatory (this can be PCs, for example)
You may also include the following optional flag:
--adjust_normalize: Would you like to normalize your final adjusted data?
To reduce your data prior to adjusting, use the --umap_reduce flag. This flag will also prompt you for if you want to adjust your data, normalize, and what your target features and confounders might be. We use the Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) to reduce your data into 2D, adjust, and export a plot and an adjusted dataframe moving forward. This can be done by running the following:
# Running GenoML munging on discreate data using PLINK binary files, a phenotype file, using UMAP to reduce dimensions and account for features, and running feature selection
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv \
--umap_reduce \
--adjust_data \
--adjust_normalize \
--target_features examples/discrete/to_adjust.txt \
--confounders examples/discrete/training_addit_confounder_example.csv And if you are munging your training and testing data together, you must include a confounders file for your test dataset as well using the --confounders_test flag:
# Running GenoML munging on discreate data using PLINK binary files, a phenotype file, using UMAP to reduce dimensions and account for features, and running feature selection, for both the training and testing data together.
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv \
--geno_test examples/discrete/validation \
--pheno_test examples/discrete/validation_pheno.csv \
--addit_test examples/discrete/validation_addit.csv \
--umap_reduce \
--adjust_data \
--adjust_normalize \
--target_features examples/discrete/to_adjust.txt \
--confounders examples/discrete/training_addit_confounder_example.csv \
--confounders_test examples/discrete/validation_addit_confounder_example.csv Here, the --confounders and --confounders_test flags take in datasets of features that should be accounted for. This is a .csv file with the ID column and header included and is numeric with no missing data. The ID column is mandatory. The --target_features flag takes in a .txt with a list of features (column names) you are adjusting for.
GenoML allows you to munge your testing data separately from your training data as well using the harmonization feature. This is particularly helpful if you would like to apply a model pre-trained elsewhere on your own datasets. This will apply the same preprocessing and normalization parameters that were used during the original munging step to ensure your datasets are consistent with the original model inputs.
Required arguments for GenoML harmonizing are the following:
- Are the data
continuous,discrete, ormulticlass? method: Do you want to usesupervisedorunsupervisedmachine learning? (unsupervised currently under development)mode: would you like tomunge,harmonize,train,tune, ortestyour model? Here, you will useharmonize.--prefix: Where would you like your outputs to be saved?--pheno: Where is your phenotype file? This file only has 2 columns: ID in one, and PHENO in the other (0 for controls and 1 for cases when using thediscretemodule, 0, ..., n-1 when using themulticlassmodule for n distinct phenotypes, or numeric values when using thecontinuousmodule).
Be sure to have your files formatted the same as the examples, key points being:
- Your phenotype file consisting only of the "ID" and "PHENO" columns
- Your sample IDs matching across all files
- Your sample IDs not consisting with only integers (add a prefix or suffix to all sample IDs ensuring they are alphanumeric if this is the case prior to running GenoML)
- Please avoid the use of characters like commas, semi-colons, etc. in the column headers (it is Python after all!)
Note: The following examples are for discrete data, but if you substitute following commands with
continuousormulticlassinstead of discrete, you can preprocess your continuous/multiclass data!
If you would like to harmonize just with genotypes (in PLINK binary format), the simplest command is the following:
# Running GenoML harmonization on discrete data using PLINK genotype binary files and a phenotype file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/validation \
--pheno examples/discrete/validation_pheno.csvNote: You must use the same
--prefixthat was used for training. This is how GenoML will know where to look for your munged data!
If the training data were adjusted by confounders, you must include a file with the same features to adjust your harmonized data accordingly. You can do this by providing a path to this file using the --confounders flag (see "1. Munging with GenoML" for further explanation) as follows:
# Running GenoML harmonization on discrete data using PLINK genotype binary files and a phenotype file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/validation \
--pheno examples/discrete/validation_pheno.csv \
--confounders examples/discrete/validation_addit_confounder_example.csvMachine learning models require that your datasets include all of the features that were used to train the model. Because of this, we (and we cannot emphasize this enough) STRONGLY recommend that your harmonization dataset include every feature used in the model. However, if for some reason this is not possible and you would like to test a pre-trained model on your data anyways, we provide the option of adding the entire column to your harmonization dataset. This will take the average value from the training dataset for each feature (as determined from --impute) and use that same value for each of your participants. You may do so using the --force_impute flag as follows:
# Running GenoML harmonization on discrete data using PLINK genotype binary files and a phenotype file, while imputing any missing columns (ie, if an addit file was used during training and is not present for the harmonization participants).
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/validation \
--pheno examples/discrete/validation_pheno.csv \
--force_imputeTraining with GenoML competes a number of different algorithms and outputs the best algorithm based on a specific metric that can be tweaked using the --metric_max flag (default is AUC).
Required arguments for GenoML training are the following:
- Are the data
continuous,discrete, ormulticlass? method: Do you want to usesupervisedorunsupervisedmachine learning? (unsupervised currently under development)mode: would you like tomunge,harmonize,train,tune, ortestyour model? Here, you will usetrain.--prefix: Where would you like your outputs to be saved?
The most basic command to train your model looks like the following:
# Running GenoML supervised training after munging on discrete data
genoml discrete supervised train \
--prefix outputsNote: You must use the same
--prefixthat was used for training. This is how GenoML will know where to look for thetrain_dataset.h5file with your munged data!
If you would like to determine the best competing algorithm by something other than the AUC, you can do so by changing the --metric_max flag (Options include AUC, Balanced_Accuracy, Sensitivity, and Specificity for discrete and multiclass datasets, or Explained_Variance, Mean_Squared_Error, Median_Absolute_Error, and R-Squared_Error for continuous datasets):
# Running GenoML supervised training after munging on discrete data and specifying Sensitivity as the metric to optimize
genoml discrete supervised train \
--prefix outputs \
--metric_max SensitivityTuning with GenoML applies fine-tuning with cross-validation using the trained model as a starting point to find the best set of hyperparameters for your datasets.
Required arguments for GenoML training are the following:
- Are the data
continuous,discrete, ormulticlass? method: Do you want to usesupervisedorunsupervisedmachine learning? (unsupervised currently under development)mode: would you like tomunge,harmonize,train,tune, ortestyour model? Here, you will usetune.--prefix: Where would you like your outputs to be saved?
The most basic command to tune your model looks like the following:
# Running GenoML supervised tuning after munging and training on discrete data
genoml discrete supervised tune \
--prefix outputsNote: You must use the same
--prefixthat was used for training. This is how GenoML will know where to look for thetrain_dataset.h5file with your munged data!
If you are interested in changing the number of iterations the tuning process goes through by modifying --max_tune (default is 50), or the number of cross-validations by modifying --n_cv (default is 5), this is what the command would look like:
# Running GenoML supervised tuning after munging and training on discrete data, modifying the number of iterations and cross-validations
genoml discrete supervised tune \
--prefix outputs \
--max_tune 10 \
--n_cv 3If you are interested in tuning on another metric other than AUC (default is AUC), you can modify --metric_tune (Options include AUC and Balanced_Accuracy for discrete datasets, AUC for multiclass datasets, or Explained_Variance, Mean_Squared_Error, Median_Absolute_Error, and R-Squared_Error for continuous datasets) by doing the following:
# Running GenoML supervised tuning after munging and training on discrete data, modifying the metric to tune by
genoml discrete supervised tune \
--prefix outputs \
--metric_tune Balanced_AccuracyTesting/validation with GenoML applies your fully-tuned model on a new dataset to evaluate how well its performance generalizes beyond data it was trained on.
Required arguments for GenoML training are the following:
- Are the data
continuous,discrete, ormulticlass? method: Do you want to usesupervisedorunsupervisedmachine learning? (unsupervised currently under development)mode: would you like tomunge,harmonize,train,tune, ortestyour model? Here, you will usetest.--prefix: Where would you like your outputs to be saved?
# Running GenoML test
genoml discrete supervised test \
--prefix outputsA step-by-step guide on how to achieve this is listed below:
# 0. MUNGE THE REFERENCE DATASET
genoml discrete supervised munge \
--prefix outputs \
--pheno examples/discrete/training_pheno.csv \
--geno examples/discrete/training \
--addit examples/discrete/training_addit.csv \
--pheno_test examples/discrete/validation_pheno.csv \
--geno_test examples/discrete/validation \
--addit_test examples/discrete/validation_addit.csv \
--r2_cutoff 0.3 \
--impute mean \
--vif 10 \
--vif_iter 1 \
--gwas examples/discrete/example_GWAS.csv \
--gwas examples/discrete/example_GWAS_2.csv \
--p 0.05 \
--feature_selection 50 \
--adjust_data \
--adjust_normalize \
--umap_reduce \
--confounders examples/discrete/training_addit_confounder_example.csv \
--confounders_test examples/discrete/validation_addit_confounder_example.csv \
--target_features examples/discrete/to_adjust.txt \
--verbose
# Files made:
# outputs/log.txt
# outputs/Munge/approx_feature_importance.txt
# outputs/Munge/list_features.txt
# outputs/Munge/params.pkl
# outputs/Munge/p_threshold_variants.tab
# outputs/Munge/test_dataset.h5
# outputs/Munge/train_dataset.h5
# outputs/Munge/umap_clustering.joblib
# outputs/Munge/umap_data_reduction_test.txt
# outputs/Munge/umap_data_reduction_train.txt
# outputs/Munge/umap_plot_test.png
# outputs/Munge/umap_plot_train.png
# outputs/Munge/variants_and_alleles.tab
# outputs/Munge/variants.txt
# 1. TRAIN THE REFERENCE MODEL
genoml discrete supervised train \
--prefix outputs \
--metric_max Balanced_Accuracy
# Files made:
# outputs/model.joblib
# outputs/algorithm.txt
# outputs/Train/precision_recall.png
# outputs/Train/predictions.txt
# outputs/Train/probabilities.png
# outputs/Train/roc.png
# outputs/Train/train_predictions.txt
# outputs/Train/withheld_performance_metrics.txt
# Files updated:
# outputs/log.txt
# 2. OPTIONAL: TUNING YOUR REFERENCE MODEL
genoml discrete supervised tune \
--prefix outputs \
--max_tune 10 \
--n_cv 3 \
--metric_tune Balanced_Accuracy
# Files made:
# outputs/Tune/cv_summary.txt
# outputs/Tune/precision_recall.png
# outputs/Tune/predictions.txt
# outputs/Tune/probabilities.png
# outputs/Tune/roc.png
# outputs/Tune/tuning_summary.txt
# Files updated:
# outputs/model.joblib
# outputs/log.txt
# 3. TEST TUNED MODEL ON UNSEEN DATA
genoml discrete supervised test \
--prefix outputs
# Files made:
# outputs/Test/performance_metrics.txt
# outputs/Test/precision_recall.png
# outputs/Test/predictions.txt
# outputs/Test/probabilities.png
# outputs/Test/roc.png
# Files updated:
# outputs/log.txtUNDER ACTIVE DEVELOPMENT
Planned experimental features include, but are not limited to:
- Support for unsupervised learning models
- Multiclass and multilabel prediction
- GWAS QC and Pipeline
- Network analyses
- Multi-omic munging
- Meta-learning
- Federated learning
- Cross-silo checks for genetic duplicates
- Outlier detection
- ...?
