PheMART: Phenotypic Prediction of Missense Variants

This repository hosts the official implementation of PheMART, a method that predicts the phenotypic effects of missense variants (MV) via deep contrastive learning.
All the source data will be published on: https://doi.org/10.6084/m9.figshare.26036227 and https://doi.org/10.5281/zenodo.13138603.
We offer visualizations of high-confidence phenotypic predictions at: https://shiny.parse-health.org/PheMART/. These visualizations are categorized by phenotypes, genes and protein domains.

Introduction

PheMART is a computational framework designed for predict the clinical phenotypic effects of missense variants. Users can either:

Use a pre-trained model for quick inference.
Fine-tune or train the model using their own dataset.

This README provides step-by-step instructions for setting up, running, and interpreting the results.

System Requirements

To ensure smooth execution, we recommend the following system specifications:

Operating System: Ubuntu 20.04
Python Version: 3.7+
GPU (for training): NVIDIA GPU with at least 16GB VRAM
CUDA: 11.2+

Installation

1. Clone the Repository

git clone https://github.com/celehs/PheMART.git
cd PheMART

2. Create a Virtual Environment (Optional but Recommended)

conda create -n phemart_env python=3.7.4
conda activate phemart_env

3. Install Dependencies

pip install -r requirements.txt

Input Data

PheMART requires different input datasets depending on the use case.

1. Using the Pre-trained Model

Required Input: A list of missense variants in CSV format and a file containing the variant embeddings with *Numpy array format. Each row represents the embedding vector of a variant.

Example format (CSV):

variants
NM_002074.5(GNB1):c.230G>A (p.Gly77Asp)
NM_022787.4(NMNAT1):c.205A>G (p.Met69Val)

2. Fine-Tuning or Training a New Model

Variant Embeddings: Precomputed embeddings for missense variants.
Phenotype Embeddings: Precomputed phenotype representation. The LLM phenotype embeddings have been provided and the EHR embedding will be provided upon request.

Training Labels: Variant-phenotype annotations in CSV format.

Example format (CSV):

variant_id,gene,phenotype_CUI
NM_002074.5(GNB1):c.388G>A (p.Glu130Lys),GNB1,C3276355
NM_003036.4(SKI):c.68A>C (p.Gln23Pro),SKI,C1321551

Note: We also provide scripts to preprocess user-provided patient-level data for generating EHR embeddings.

Usage

Running Inference with Pre-trained Model

To run inference using our pre-trained model:

python predict.py --file_snp_prediction variants.csv  --dirr_results_main  result/ --dirr_pretrained_model data/model_pretrained/

Arguments

--file_snp_prediction: File containing the list of variants to predict (CSV).
--dirr_results_main: Path to save the predictions.
--dirr_pretrained_model: Path to the pre-trained model.

Fine-Tuning or Training a New Model

To fine-tune or train a model using your own dataset:

bash submit.sh --train --file_annotations /path/to/annotations --file_snps_labeled  /path/to/list of labeled variants  --file_snps_labeled_embedding  /path/to/embeddings of labeled variants  --dirr_results_main /path/to/results  --dirr_save_model  /path/to/saved model

Arguments

--train: Flag to indicate training mode.
--file_annotations: File containing the annotated variant-phenotype pairs.
--file_snps_labeled: File containing the list of annotated variants.
--file_snps_labeled_embedding: File containing the embedding vectors of the annotated variants.
--dirr_results_main: Path to save result files.
--dirr_save_model: Path to save the trained model.

Expected Outputs

PheMART generates different outputs depending on the mode of operation.

1. Inference Mode (Using Pre-trained Model)

variant_ID.csv: For each variant, the result file contains the scores to the 4,179 phenotypes;
For example, in rs5453.csv:
```
C1321551, 0.532
C1535926, 0.125
C1321557, 0.021
```
- Higher scores indicate a higher likelihood of pathogenicity to the phenotype.

2. Training Mode

results_validations.txt: Training loss,hyperparameters, model performance on the validation dataset.

Computation Steps

The submit.sh script automates the following computational steps:

Data Preprocessing
- Get variant embeddings.
- Get phenotype embeddings.
- Split data into training and validation sets.
Model Training / Fine-Tuning
- Loads the dataset and initializes the neural network.
- Runs training with mini-batch gradient descent.
- Performs validation and saves the models.
Prediction and Evaluation
- If in inference mode, loads the trained model and predicts the variant's relevance to all the phenotypes investigated.
- Saves predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
data		data
README.md		README.md
model.py		model.py
predict.py		predict.py
requirements.txt		requirements.txt
setup.sh		setup.sh
snps_prediction.csv		snps_prediction.csv
submit.sh		submit.sh
utilization.py		utilization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PheMART: Phenotypic Prediction of Missense Variants

Table of Contents

Introduction

System Requirements

Installation

1. Clone the Repository

2. Create a Virtual Environment (Optional but Recommended)

3. Install Dependencies

Input Data

1. Using the Pre-trained Model

2. Fine-Tuning or Training a New Model

Usage

Running Inference with Pre-trained Model

Arguments

Fine-Tuning or Training a New Model

Arguments

Expected Outputs

1. Inference Mode (Using Pre-trained Model)

2. Training Mode

Computation Steps

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PheMART: Phenotypic Prediction of Missense Variants

Table of Contents

Introduction

System Requirements

Installation

1. Clone the Repository

2. Create a Virtual Environment (Optional but Recommended)

3. Install Dependencies

Input Data

1. Using the Pre-trained Model

2. Fine-Tuning or Training a New Model

Usage

Running Inference with Pre-trained Model

Arguments

Fine-Tuning or Training a New Model

Arguments

Expected Outputs

1. Inference Mode (Using Pre-trained Model)

2. Training Mode

Computation Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages