This repository hosts the official implementation of PheMART, a method that predicts the phenotypic effects of missense variants (MV) via deep contrastive learning.
All the source data will be published on: https://doi.org/10.6084/m9.figshare.26036227 and https://doi.org/10.5281/zenodo.13138603.
We offer visualizations of high-confidence phenotypic predictions at: https://shiny.parse-health.org/PheMART/. These visualizations are categorized by phenotypes, genes and protein domains.
PheMART is a computational framework designed for predict the clinical phenotypic effects of missense variants. Users can either:
- Use a pre-trained model for quick inference.
- Fine-tune or train the model using their own dataset.
This README provides step-by-step instructions for setting up, running, and interpreting the results.
To ensure smooth execution, we recommend the following system specifications:
- Operating System: Ubuntu 20.04
- Python Version: 3.7+
- GPU (for training): NVIDIA GPU with at least 16GB VRAM
- CUDA: 11.2+
git clone https://github.com/celehs/PheMART.git
cd PheMARTconda create -n phemart_env python=3.7.4
conda activate phemart_envpip install -r requirements.txtPheMART requires different input datasets depending on the use case.
- Required Input: A list of missense variants in CSV format and a file containing the variant embeddings with *Numpy array format. Each row represents the embedding vector of a variant.
- Example format (CSV):
variants NM_002074.5(GNB1):c.230G>A (p.Gly77Asp) NM_022787.4(NMNAT1):c.205A>G (p.Met69Val)
- Variant Embeddings: Precomputed embeddings for missense variants.
- Phenotype Embeddings: Precomputed phenotype representation. The LLM phenotype embeddings have been provided and the EHR embedding will be provided upon request.
- Training Labels: Variant-phenotype annotations in CSV format.
- Example format (CSV):
variant_id,gene,phenotype_CUI NM_002074.5(GNB1):c.388G>A (p.Glu130Lys),GNB1,C3276355 NM_003036.4(SKI):c.68A>C (p.Gln23Pro),SKI,C1321551
Note: We also provide scripts to preprocess user-provided patient-level data for generating EHR embeddings.
To run inference using our pre-trained model:
python predict.py --file_snp_prediction variants.csv --dirr_results_main result/ --dirr_pretrained_model data/model_pretrained/--file_snp_prediction: File containing the list of variants to predict (CSV).--dirr_results_main: Path to save the predictions.--dirr_pretrained_model: Path to the pre-trained model.
To fine-tune or train a model using your own dataset:
bash submit.sh --train --file_annotations /path/to/annotations --file_snps_labeled /path/to/list of labeled variants --file_snps_labeled_embedding /path/to/embeddings of labeled variants --dirr_results_main /path/to/results --dirr_save_model /path/to/saved model--train: Flag to indicate training mode.--file_annotations: File containing the annotated variant-phenotype pairs.--file_snps_labeled: File containing the list of annotated variants.--file_snps_labeled_embedding: File containing the embedding vectors of the annotated variants.--dirr_results_main: Path to save result files.--dirr_save_model: Path to save the trained model.
PheMART generates different outputs depending on the mode of operation.
variant_ID.csv: For each variant, the result file contains the scores to the 4,179 phenotypes;- For example, in
rs5453.csv:C1321551, 0.532 C1535926, 0.125 C1321557, 0.021- Higher scores indicate a higher likelihood of pathogenicity to the phenotype.
results_validations.txt: Training loss,hyperparameters, model performance on the validation dataset.
The submit.sh script automates the following computational steps:
-
Data Preprocessing
- Get variant embeddings.
- Get phenotype embeddings.
- Split data into training and validation sets.
-
Model Training / Fine-Tuning
- Loads the dataset and initializes the neural network.
- Runs training with mini-batch gradient descent.
- Performs validation and saves the models.
-
Prediction and Evaluation
- If in inference mode, loads the trained model and predicts the variant's relevance to all the phenotypes investigated.
- Saves predictions.