Skip to content
/ genoml2 Public

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data

License

Notifications You must be signed in to change notification settings

GenoML/genoml2

Repository files navigation

GenoML

Downloads

Updated 17 June 2025: Latest Release on pip! v1.5.4

How to Get Started with GenoML

Introduction

GenoML (Genomics + Machine Learning) is an automated Machine Learning (autoML) for genomics data. In general, use a Linux or Mac with Python 3.9-3.12 for best results. This repository and pip package are under active development!

This README is a brief look into how to structure arguments and what arguments are available at each phase for the GenoML CLI.

If you are using GenoML for your own work, please cite the following papers:

  • Makarious, M. B., Leonard, H. L., Vitale, D., Iwaki, H., Saffo, D., Sargent, L., ... & Faghri, F. (2021). GenoML: Automated Machine Learning for Genomics. arXiv preprint arXiv:2103.03221
  • Makarious, M. B., Leonard, H. L., Vitale, D., Iwaki, H., Sargent, L., Dadu, A., ... & Nalls, M. A. (2021). Multi-Modality Machine Learning Predicting Parkinson’s Disease. bioRxiv.

Installing + Downloading Example Data

  • Install this repository directly from GitHub (from source; master branch)

git clone https://github.com/GenoML/genoml2.git

  • Install using pip or upgrade using pip

pip install genoml2

OR

pip install genoml2 --upgrade

  • To install the examples/ directory (~315 KB), you can use SVN (pre-installed on most Macs)

svn export https://github.com/GenoML/genoml2.git/trunk/examples

Note: When you pip install this package, the examples/ folder is also downloaded! However, if you still want to download the directory and SVN is not pre-installed, you can download it via Homebrew if you have that installed using brew install svn

CHANGELOG

  • 16-JUN-2025: Addition of multiclass prediction functionality using the same base models that are used for the discrete module. We have additionally restructured the munging functionality to allow users to process training and testing data all at once to ensure they are munged under the same conditions, as well as including multiple GWAS summary stats files for SNP filtering. We also upgraded from plink1.9 to plink2 for genomic data processing. Finally, we have added a log file in the output directory to facilitate full reproducbility of results. README updated to reflect these changes.
  • 8-OCT-2024: Big changes to output file structure, so now output files go in subdirectories named for each step, and prefixes are not required. README updated to reflect these changes.

Table of Contents

0. [OPTIONAL] How to Set Up a Virtual Environment via Conda

You can create a virtual environment to run GenoML, if you prefer. If you already have the Anaconda Distribution, this is fairly simple.

To create and activate a virtual environment:

# To create a virtual environment
conda create -n GenoML python=3.12

# To activate a virtual environment
conda activate GenoML

# To install requirements via pip 
pip install -r requirements.txt
    # If issues installing xgboost from requirements - (3 options)
        # Option 1: use Homebrew to 
            # xcode-select --install
            # brew install gcc@7
        # or Option 2: conda install -c conda-forge xgboost 
        # or Option 3: pip install xgboost==2.0.3
    # If issues installing umap 
        # pip install umap-learn
    # If issues installing pytables/dependency issue 
        # conda install -c conda-forge pytables
    # If issues with blosc 
        # conda install -c conda-forge tables blosc

## MISC
# To deactivate the virtual environment
# conda deactivate GenoML

# To delete your virtual environment
# conda env remove -n GenoML

To install the GenoML in the user's path in a virtual environment, you can do the following:

# Install the package at this path
pip install .

# MISC
	# To save out the environment requirements to a .txt file
# pip freeze > requirements.txt

	# Removing a conda virtualenv
# conda remove --name GenoML --all 

Note: The following examples are for discrete data, but if you substitute following commands with continuous or multiclass instead of discrete, you can munge, harmonize, train, tune, and test your continuous/multiclass data!

1. Munging with GenoML

Munging with GenoML will, at minimum, do the following:

  • Prune your genotypes using PLINK v2 (if --geno flag is used)
  • Impute per column using median or mean (can be changed with the --impute flag)
  • Z-scaling of features and removing columns with a std dev = 0

Required arguments for GenoML munging are --prefix and --pheno

  • data : Are the data continuous, discrete, or multiclass?
  • method: Do you want to use supervised or unsupervised machine learning? (unsupervised currently under development)
  • mode: would you like to munge, harmonize, train, tune, or test your model? Here, you will use munge.
  • --prefix : Where would you like your outputs to be saved?
  • --pheno : Where is your phenotype file? This file only has 2 columns, ID in one, and PHENO in the other (0 for controls and 1 for cases when using the discrete module, 0, ..., n-1 when using the multiclass module for n distinct phenotypes, or numeric values when using the continuous module).

Be sure to have your files formatted the same as the examples, key points being:

  • Your phenotype file consisting only of the "ID" and "PHENO" columns
  • Your sample IDs matching across all files
  • Your sample IDs not consisting with only integers (add a prefix or suffix to all sample IDs ensuring they are alphanumeric if this is the case prior to running GenoML)
  • Please avoid the use of characters like commas, semi-colons, etc. in the column headers (it is Python after all!)

If you would like to munge just with genotypes (in PLINK binary format), the simplest command is the following:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv

If you would like a more detailed log printed to your console, you may use the --verbose flag as follows:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file with a detailed log printed to the console

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--verbose

Note: The --verbose flag may be used like this for any GenoML command, not just munging.

To properly evaluate your model, it must be applied to a dataset it's never seen before (testing data). If you have both training and testing data, GenoML allows you to munge them together upfront. To do this with your training and testing phenotype/genotype data, the simplest command is the following:

# Running GenoML munging on discrete data using PLINK genotype binary files and phenotype files for both the training and testing datasets.

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--geno_test examples/discrete/validation \
--pheno_test examples/discrete/validation_pheno.csv

If you would like to control the pruning stringency in genotypes:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--r2_cutoff 0.3 \
--pheno examples/discrete/training_pheno.csv

You can choose to skip pruning your SNPs at this stage by including the --skip_prune flag

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--skip_prune \
--pheno examples/discrete/training_pheno.csv

You can choose to impute on mean or median by modifying the --impute flag, like so (default is median):

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file and specifying impute

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--impute mean

If you suspect collinear variables, and think this will be a problem for training the model moving forward, you can use variance inflation factor (VIF) filtering:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file while using VIF to remove multicollinearity 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--vif 5 \
--vif_iter 1
  • The --vif flag specifies the VIF threshold you would like to use (5 is recommended)
  • The number of iterations you'd like to run can be modified with the --vif_iter flag (if you have or anticipate many collinear variables, it's a good idea to increase the iterations)

Well, what if you had GWAS summary statistics handy, and would like to just use the same SNPs outlined in that file? You can do so by running the following:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and a GWAS summary statistics file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--gwas examples/discrete/example_GWAS.csv

Note: When using the GWAS flag, the PLINK binaries will be pruned to include matching SNPs to the GWAS file.

And if you have more than one GWAS summary statistics file, we support that too! Just use the same --gwas flag for each of the files you would like to include, as follows:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and two GWAS summary statistics files

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--gwas examples/discrete/example_GWAS.csv \
--gwas examples/discrete/example_GWAS_2.csv

Note: This is particularly helpful when using the multiclass module when you have multiple phenotypes of interest and would like to include SNPs that are relevant for each phenotype.

...and if you wanted to add a p-value cut-off...

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and a GWAS summary statistics file with a p-value cut-off 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--gwas examples/discrete/example_GWAS.csv \
--p 0.01

Do you have additional data you would like to incorporate? Perhaps clinical, demographic, or transcriptomics data? If coded and all numerical, these can be added as an --addit file by doing the following:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and an addit file

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv

You also have the option of not using PLINK binary files if you would like to just preprocess (and then, later train) on a phenotype and addit file by doing the following:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and an addit file

genoml discrete supervised munge \
--prefix outputs \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv

Are you interested in selecting and ranking your features? If so, you can use the --feature_selection flag and specify like so...:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and running feature selection 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv \
--feature_selection 50

The --feature_selection flag uses extraTrees (classifier for discrete data; regressor for continuous data) to output a *.approx_feature_importance.txt file with the features most contributing to your model at the top.

Do you have additional covariates and confounders you would like to adjust for in the munging step prior to training your model and/or would like to reduce your data? To adjust, use the --adjust_data flag with the following necessary flags:

  • --target_features: A .txt file, one column, with a list of features to adjust (no header). These should correspond to features in the munged dataset
  • --confounders: A .csv of confounders to adjust for with ID column and header. Numeric, with no missing data and the ID column is mandatory (this can be PCs, for example)

You may also include the following optional flag:

  • --adjust_normalize: Would you like to normalize your final adjusted data?

To reduce your data prior to adjusting, use the --umap_reduce flag. This flag will also prompt you for if you want to adjust your data, normalize, and what your target features and confounders might be. We use the Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) to reduce your data into 2D, adjust, and export a plot and an adjusted dataframe moving forward. This can be done by running the following:

# Running GenoML munging on discreate data using PLINK binary files, a phenotype file, using UMAP to reduce dimensions and account for features, and running feature selection

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv \
--umap_reduce \
--adjust_data \
--adjust_normalize \
--target_features examples/discrete/to_adjust.txt \
--confounders examples/discrete/training_addit_confounder_example.csv 

And if you are munging your training and testing data together, you must include a confounders file for your test dataset as well using the --confounders_test flag:

# Running GenoML munging on discreate data using PLINK binary files, a phenotype file, using UMAP to reduce dimensions and account for features, and running feature selection, for both the training and testing data together.

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv \
--geno_test examples/discrete/validation \
--pheno_test examples/discrete/validation_pheno.csv \
--addit_test examples/discrete/validation_addit.csv \
--umap_reduce \
--adjust_data \
--adjust_normalize \
--target_features examples/discrete/to_adjust.txt \
--confounders examples/discrete/training_addit_confounder_example.csv \
--confounders_test examples/discrete/validation_addit_confounder_example.csv 

Here, the --confounders and --confounders_test flags take in datasets of features that should be accounted for. This is a .csv file with the ID column and header included and is numeric with no missing data. The ID column is mandatory. The --target_features flag takes in a .txt with a list of features (column names) you are adjusting for.

1b. Harmonizing with GenoML

GenoML allows you to munge your testing data separately from your training data as well using the harmonization feature. This is particularly helpful if you would like to apply a model pre-trained elsewhere on your own datasets. This will apply the same preprocessing and normalization parameters that were used during the original munging step to ensure your datasets are consistent with the original model inputs.

Required arguments for GenoML harmonizing are the following:

  • Are the data continuous, discrete, or multiclass?
  • method: Do you want to use supervised or unsupervised machine learning? (unsupervised currently under development)
  • mode: would you like to munge, harmonize, train, tune, or test your model? Here, you will use harmonize.
  • --prefix : Where would you like your outputs to be saved?
  • --pheno : Where is your phenotype file? This file only has 2 columns: ID in one, and PHENO in the other (0 for controls and 1 for cases when using the discrete module, 0, ..., n-1 when using the multiclass module for n distinct phenotypes, or numeric values when using the continuous module).

Be sure to have your files formatted the same as the examples, key points being:

  • Your phenotype file consisting only of the "ID" and "PHENO" columns
  • Your sample IDs matching across all files
  • Your sample IDs not consisting with only integers (add a prefix or suffix to all sample IDs ensuring they are alphanumeric if this is the case prior to running GenoML)
  • Please avoid the use of characters like commas, semi-colons, etc. in the column headers (it is Python after all!)

Note: The following examples are for discrete data, but if you substitute following commands with continuous or multiclass instead of discrete, you can preprocess your continuous/multiclass data!

If you would like to harmonize just with genotypes (in PLINK binary format), the simplest command is the following:

# Running GenoML harmonization on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/validation \
--pheno examples/discrete/validation_pheno.csv

Note: You must use the same --prefix that was used for training. This is how GenoML will know where to look for your munged data!

If the training data were adjusted by confounders, you must include a file with the same features to adjust your harmonized data accordingly. You can do this by providing a path to this file using the --confounders flag (see "1. Munging with GenoML" for further explanation) as follows:

# Running GenoML harmonization on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/validation \
--pheno examples/discrete/validation_pheno.csv \
--confounders examples/discrete/validation_addit_confounder_example.csv

Machine learning models require that your datasets include all of the features that were used to train the model. Because of this, we (and we cannot emphasize this enough) STRONGLY recommend that your harmonization dataset include every feature used in the model. However, if for some reason this is not possible and you would like to test a pre-trained model on your data anyways, we provide the option of adding the entire column to your harmonization dataset. This will take the average value from the training dataset for each feature (as determined from --impute) and use that same value for each of your participants. You may do so using the --force_impute flag as follows:

# Running GenoML harmonization on discrete data using PLINK genotype binary files and a phenotype file, while imputing any missing columns (ie, if an addit file was used during training and is not present for the harmonization participants).

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/validation \
--pheno examples/discrete/validation_pheno.csv \
--force_impute

2. Training with GenoML

Training with GenoML competes a number of different algorithms and outputs the best algorithm based on a specific metric that can be tweaked using the --metric_max flag (default is AUC).

Required arguments for GenoML training are the following:

  • Are the data continuous, discrete, or multiclass?
  • method: Do you want to use supervised or unsupervised machine learning? (unsupervised currently under development)
  • mode: would you like to munge, harmonize, train, tune, or test your model? Here, you will use train.
  • --prefix : Where would you like your outputs to be saved?

The most basic command to train your model looks like the following:

# Running GenoML supervised training after munging on discrete data

genoml discrete supervised train \
--prefix outputs

Note: You must use the same --prefix that was used for training. This is how GenoML will know where to look for the train_dataset.h5 file with your munged data!

If you would like to determine the best competing algorithm by something other than the AUC, you can do so by changing the --metric_max flag (Options include AUC, Balanced_Accuracy, Sensitivity, and Specificity for discrete and multiclass datasets, or Explained_Variance, Mean_Squared_Error, Median_Absolute_Error, and R-Squared_Error for continuous datasets):

# Running GenoML supervised training after munging on discrete data and specifying Sensitivity as the metric to optimize

genoml discrete supervised train \
--prefix outputs \
--metric_max Sensitivity

3. Tuning with GenoML

Tuning with GenoML applies fine-tuning with cross-validation using the trained model as a starting point to find the best set of hyperparameters for your datasets.

Required arguments for GenoML training are the following:

  • Are the data continuous, discrete, or multiclass?
  • method: Do you want to use supervised or unsupervised machine learning? (unsupervised currently under development)
  • mode: would you like to munge, harmonize, train, tune, or test your model? Here, you will use tune.
  • --prefix : Where would you like your outputs to be saved?

The most basic command to tune your model looks like the following:

# Running GenoML supervised tuning after munging and training on discrete data

genoml discrete supervised tune \
--prefix outputs

Note: You must use the same --prefix that was used for training. This is how GenoML will know where to look for the train_dataset.h5 file with your munged data!

If you are interested in changing the number of iterations the tuning process goes through by modifying --max_tune (default is 50), or the number of cross-validations by modifying --n_cv (default is 5), this is what the command would look like:

# Running GenoML supervised tuning after munging and training on discrete data, modifying the number of iterations and cross-validations 

genoml discrete supervised tune \
--prefix outputs \
--max_tune 10 \
--n_cv 3

If you are interested in tuning on another metric other than AUC (default is AUC), you can modify --metric_tune (Options include AUC and Balanced_Accuracy for discrete datasets, AUC for multiclass datasets, or Explained_Variance, Mean_Squared_Error, Median_Absolute_Error, and R-Squared_Error for continuous datasets) by doing the following:

# Running GenoML supervised tuning after munging and training on discrete data, modifying the metric to tune by

genoml discrete supervised tune \
--prefix outputs \
--metric_tune Balanced_Accuracy

4. Testing/Validation with GenoML

Testing/validation with GenoML applies your fully-tuned model on a new dataset to evaluate how well its performance generalizes beyond data it was trained on.

Required arguments for GenoML training are the following:

  • Are the data continuous, discrete, or multiclass?
  • method: Do you want to use supervised or unsupervised machine learning? (unsupervised currently under development)
  • mode: would you like to munge, harmonize, train, tune, or test your model? Here, you will use test.
  • --prefix : Where would you like your outputs to be saved?
# Running GenoML test

genoml discrete supervised test \
--prefix outputs

5. Full pipeline example

A step-by-step guide on how to achieve this is listed below:

# 0. MUNGE THE REFERENCE DATASET
genoml discrete supervised munge \
--prefix outputs \
--pheno examples/discrete/training_pheno.csv \
--geno examples/discrete/training \
--addit examples/discrete/training_addit.csv \
--pheno_test examples/discrete/validation_pheno.csv \
--geno_test examples/discrete/validation \
--addit_test examples/discrete/validation_addit.csv \
--r2_cutoff 0.3 \
--impute mean \
--vif 10 \
--vif_iter 1 \
--gwas examples/discrete/example_GWAS.csv \
--gwas examples/discrete/example_GWAS_2.csv \
--p 0.05 \
--feature_selection 50 \
--adjust_data \
--adjust_normalize \
--umap_reduce \
--confounders examples/discrete/training_addit_confounder_example.csv \
--confounders_test examples/discrete/validation_addit_confounder_example.csv \
--target_features examples/discrete/to_adjust.txt \
--verbose
# Files made: 
    # outputs/log.txt
    # outputs/Munge/approx_feature_importance.txt
    # outputs/Munge/list_features.txt
    # outputs/Munge/params.pkl
    # outputs/Munge/p_threshold_variants.tab
    # outputs/Munge/test_dataset.h5
    # outputs/Munge/train_dataset.h5
    # outputs/Munge/umap_clustering.joblib
    # outputs/Munge/umap_data_reduction_test.txt
    # outputs/Munge/umap_data_reduction_train.txt
    # outputs/Munge/umap_plot_test.png
    # outputs/Munge/umap_plot_train.png
    # outputs/Munge/variants_and_alleles.tab
    # outputs/Munge/variants.txt

# 1. TRAIN THE REFERENCE MODEL
genoml discrete supervised train \
--prefix outputs \
--metric_max Balanced_Accuracy
# Files made: 
    # outputs/model.joblib
    # outputs/algorithm.txt
    # outputs/Train/precision_recall.png
    # outputs/Train/predictions.txt
    # outputs/Train/probabilities.png
    # outputs/Train/roc.png
    # outputs/Train/train_predictions.txt
    # outputs/Train/withheld_performance_metrics.txt
# Files updated: 
    # outputs/log.txt

# 2. OPTIONAL: TUNING YOUR REFERENCE MODEL
genoml discrete supervised tune \
--prefix outputs \
--max_tune 10 \
--n_cv 3 \
--metric_tune Balanced_Accuracy
# Files made: 
    # outputs/Tune/cv_summary.txt
    # outputs/Tune/precision_recall.png
    # outputs/Tune/predictions.txt
    # outputs/Tune/probabilities.png
    # outputs/Tune/roc.png
    # outputs/Tune/tuning_summary.txt
# Files updated: 
    # outputs/model.joblib
    # outputs/log.txt

# 3. TEST TUNED MODEL ON UNSEEN DATA
genoml discrete supervised test \
--prefix outputs
# Files made: 
    # outputs/Test/performance_metrics.txt
    # outputs/Test/precision_recall.png
    # outputs/Test/predictions.txt
    # outputs/Test/probabilities.png
    # outputs/Test/roc.png
# Files updated: 
    # outputs/log.txt

6. Experimental Features

UNDER ACTIVE DEVELOPMENT

Planned experimental features include, but are not limited to:

  • Support for unsupervised learning models
  • Multiclass and multilabel prediction
  • GWAS QC and Pipeline
  • Network analyses
  • Multi-omic munging
  • Meta-learning
  • Federated learning
  • Cross-silo checks for genetic duplicates
  • Outlier detection
  • ...?

About

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 11

Languages