Skip to content

sethsh7/PRSedm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PRSedm

PyPI Version Conda Version Paper DOI DOI

Graphical Abstract

Overview

PRSedm (Polygenic Risk Score Extension for Diabetes Mellitus) is a flexible and extendable open-source package for efficient local and remote (All of Us, UK Biobank, etc.) generation of published Polygenic Risk Scores (PRS) for Diabetes Mellitus (DM) and related cardiometabolic phenotypes.

PRSedm introduces a new parallelized "one-liner" method to generate standardized PRS and pPS for DM robust to variables such as genotyping method, quality control, and imputation panel.

Updates (v1.3.0)

  • Improved multiallelic SNP handling and fixed related bugs
  • Significant performance improvements from optimized SNP batching and parallelism
  • Simplified command-line interface and argument structure
  • Fixed bugs and syntax issues in the SNP database backend
  • PRS metadata is now fetched automatically alongside the SNP database
  • Per-PRS variant log files are now generated, capturing metrics such as INFO/R², missing variants, and allele frequency
  • Renamed "imputation" feature to "estimate" to distinguish from genotype imputation
  • Added support for custom proxy variant substitution via --proxy

New proxy feature

PRSedm supports optional variant substitution via a user-supplied proxy file (--proxy). Required format (whitespace-delimited):

target_rsid target_contig_id target_position target_effect_allele sub_rsid sub_contig_id sub_position sub_effect_allele
rs12345     chr1             1234567         A                     rs54321  chr1           1234999      A
rs23456     chr2             7654321         G                     rs65432  chr2           7654000      G

Installation

Dependencies

PRSedm requires the following packages:

  • Python (>=3.9), Joblib (>=1.3.2), Pandas (>=2.2.3), Pysam (>=0.22.0), Numpy* (2.x/1.x)

*Build with 1.x when deploying to RAP platforms with 1.x dependencies.

User Installation

PIP: pip install prsedm
Anaconda: conda install sethsh7::prsedm
Build from source: python -m build

Usage

PRSEDM can be called from the command line:

prsedm --vcf <path_to_vcf_file> [options]

PRSEDM can be also be called from Python:

import prsedm
df = prsedm.gen_dm(
    vcf=vcf,
    ... 
)

Options

  • --vcf (required): Path to an indexed VCF or BCF file, or a text file mapping one VCF/BCF per contig.
  • --col: Genotype column to score (default: GT, options: GT for WGS, GP for imputed data).
  • --build: Genome build to use (default: hg38, options: hg19, hg38).
  • --scores (required): Comma-separated list of PRS to generate, e.g., t1dgrs2-luckett25,t2dp-udler18.
  • --estimate (optional): Path to indexed reference VCF/BCF, or text file mapping one VCF/BCF per contig. Used to estimate missing variants and enable normalization when variants are absent.
  • --ntasks (optional): Number of tasks to use (default: 1).
  • --batch-size (optional): Number of variants per batch (default: 5000).
  • --output: Path to save the output file (default: results.csv).
  • --full: Include individual variant scores with PRS name prepended.
  • --getsql: Download or locate the PRS SQL database (variants.db) and metadata JSON (prs_meta.json) and exit.

Single file per-chromosome loading

For --vcf and --estimate you can point to a single text file mapping per contiguous region formatted as such (whitespace delimited):

chr1   file1.chr1.vcf.gz
chr2   file2.chr2.vcf.gz
...

PRS Database and metadata

The database containing PRS designs and metadata is hosted at https://zenodo.org/records/17903390 and downloaded automatically. PRSedm will first check the environment variables PRSEDM_SQL_PATH and PRSEDM_META_PATH for custom databases.

Research Analysis Platforms (RAP's)

Remote deployment to remote Research Analysis Platforms (RAP's) is possible via notebook wrappers:

List of available PRS

Type 1 Diabetes

Flag Method Variants Description PMID
t1dgrs2-luckett25 HLA Interaction + Partitioned 67 "GRS2x" updated PRS with widest compatibility and HLA-based risk pPS. 40267362
t1dgrs2-qu22 HLA Interaction + Partitioned 71 Original "GRS2" PRS with the addition of 4 African ancestry SNPs from Onengut, proposed in Qu et al and utilized in eMERGE. 34997821
t1dgrs2-sharp21 HLA Interaction + Partitioned 67 Version of "GRS2" PRS designed for "TOPMED-R2" from 2021 GitHub. 35312757
t1d-onengut19-afr Additive 6 African-ancestry PRS proposed by Onengut in 2019, updated for modern compatibility. 30659077
t1dgrs2-sharp19 HLA Interaction + Partitioned 67 Original 1000 Genomes version of "GRS2" PRS as published, with limited modern compatibility. 30655379

Type 2 Diabetes

Flag Method Variants Description PMID
t2d-suzuki24-prscsx-ma Additive + Partitioned ~1 M Full genome-wide multi-ancestry PRS for Suzuki (PRS-CSx meta). 38374256
t2d-suzuki24-prscsx-<ancestry> Additive + Partitioned >500k Full genome-wide ancestry-specific PRS for Suzuki (PRS-CSx), where <ancestry> is one of eur, afr, eas, sas, safr, or his. 38374256
t2dp-suzuki24-ma Additive + Partitioned 1289 Multiancestry weighted Suzuki T2D index variant PRS, and pPS from hard-clustering analyses. 38374256
t2dp-suzuki24-<ancestry> Additive + Partitioned 1128 - 1285 As above but weighted for specific ancestries <eur/afr/safr/eas/sas/his>. 38374256
t2dp-smith24-ma Additive + Partitioned 353 Multiancestry cluster-weighted Smith T2D index variant PRS, and pPS from soft-clustering analyses. 38443691
t2dp-smith24-<ancestry> Additive + Partitioned 25 - 490 As above but from ancestry-specific soft clustering <eur/afr/eas/amr>. 38443691
t2d-mahajan22-ma Additive 338 Older PRS from Mahajan et al composed of multiancestry index variants. 35551307
t2d-mahajan22-prscsx-eur Additive + Partitioned >500k Genome-wide European ancestry PRS for Mahajan (PRS-CSx). 38374256
t2dp-udler18 Additive + Partitioned 67 T2D pPS from first soft-clustering analysis. 30240442

Other

Flag Phenotype Method Variants Description PMID
cdgrs-sharp25 Celiac Disease HLA Interaction + Partitioned 42 Modernized Celiac disease PRS and pPS with similar model to "GRS2x", utilized for combined screening. 32790217

Features

HLA Interaction PRS (+GRS2x Update)

PRSedm features a complete algorithm for GRS which incorporate HLA interaction terms as previously published by us such as T1D-GRS2 (or just GRS2). A number of advancements have been added to improve the generation of HLA interaction, described as GRS2x.

HLA Type Estimation and LD Tiebreak

HLA alleles can be estimated by proxy (or tag) single nucleotide polymorphisms alone and predictions are output e.g. (DR3-DQ2.5/DR3-DQ2.5). Due to imperfect proxy SNPs, >2 HLA calls can be made in interaction scores such as GRS2, and a probabilistic tiebreaker algorithm using HLA reference frequencies (Klitz et al) now resolves impossible numbers of calls without excluding any samples.

Missing variant mean effect estimation (optional)

PRSedm optionally uses Hardy-Weinberg Equilibrium with a reference VCF/BCF legend (ensure you have variant frequency coded as AF, genotypes not required) to estimate the mean effect size for missing SNPs, handle missing variants, and enable static normalization.

Minimum and Maximum Normalization

PRSedm hardcodes static normalization of minimum and maximum potential risk contribution (no risk alleles vs all risk alleles) creating a scale of 0-1. Static normalization with estimation ensures that PRS values translate to a common relative risk scale across datasets. If variants are missing and no estimation reference is supplied, normalization is skipped automatically and a warning is emitted.

Development

Developed and maintained by Seth A. Sharp (ssharp@stanford.edu) at the Translational Genomics of Diabetes, Stanford University, with collaboration from colleagues at the University of Exeter and MGH/Broad Institute. Lu Zhang and Han Sun contributed to v1.1.0 onwards.

Citation

If you use PRSedm in your research, please cite both the software release (10.5281/zenodo.17903985) and the accompanying article (10.2337/dc25-0142).

License

This project is licensed under the MIT License (Non-Commercial).

  • Academic, research, and personal use are allowed.
  • Commercial use is prohibited without prior permission.

See the LICENSE file for full details.

About

Polygenic Risk Scores Extension for Diabetes Mellitus (PRSedm)

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors