PRSedm (Polygenic Risk Score Extension for Diabetes Mellitus) is a flexible and extendable open-source package for efficient local and remote (All of Us, UK Biobank, etc.) generation of published Polygenic Risk Scores (PRS) for Diabetes Mellitus (DM) and related cardiometabolic phenotypes.
PRSedm introduces a new parallelized "one-liner" method to generate standardized PRS and pPS for DM robust to variables such as genotyping method, quality control, and imputation panel.
- Improved multiallelic SNP handling and fixed related bugs
- Significant performance improvements from optimized SNP batching and parallelism
- Simplified command-line interface and argument structure
- Fixed bugs and syntax issues in the SNP database backend
- PRS metadata is now fetched automatically alongside the SNP database
- Per-PRS variant log files are now generated, capturing metrics such as INFO/R², missing variants, and allele frequency
- Renamed "imputation" feature to "estimate" to distinguish from genotype imputation
- Added support for custom proxy variant substitution via
--proxy
PRSedm supports optional variant substitution via a user-supplied proxy file (--proxy).
Required format (whitespace-delimited):
target_rsid target_contig_id target_position target_effect_allele sub_rsid sub_contig_id sub_position sub_effect_allele rs12345 chr1 1234567 A rs54321 chr1 1234999 A rs23456 chr2 7654321 G rs65432 chr2 7654000 G
PRSedm requires the following packages:
- Python (>=3.9), Joblib (>=1.3.2), Pandas (>=2.2.3), Pysam (>=0.22.0), Numpy* (2.x/1.x)
*Build with 1.x when deploying to RAP platforms with 1.x dependencies.
PIP: pip install prsedm
Anaconda: conda install sethsh7::prsedm
Build from source: python -m build
PRSEDM can be called from the command line:
prsedm --vcf <path_to_vcf_file> [options]
PRSEDM can be also be called from Python:
import prsedm
df = prsedm.gen_dm(
vcf=vcf,
...
)--vcf(required): Path to an indexed VCF or BCF file, or a text file mapping one VCF/BCF per contig.--col: Genotype column to score (default:GT, options:GTfor WGS,GPfor imputed data).--build: Genome build to use (default:hg38, options:hg19,hg38).--scores(required): Comma-separated list of PRS to generate, e.g.,t1dgrs2-luckett25,t2dp-udler18.--estimate(optional): Path to indexed reference VCF/BCF, or text file mapping one VCF/BCF per contig. Used to estimate missing variants and enable normalization when variants are absent.--ntasks(optional): Number of tasks to use (default:1).--batch-size(optional): Number of variants per batch (default:5000).--output: Path to save the output file (default:results.csv).--full: Include individual variant scores with PRS name prepended.--getsql: Download or locate the PRS SQL database (variants.db) and metadata JSON (prs_meta.json) and exit.
For --vcf and --estimate you can point to a single text file mapping per contiguous region formatted as such (whitespace delimited):
chr1 file1.chr1.vcf.gz
chr2 file2.chr2.vcf.gz
...
The database containing PRS designs and metadata is hosted at
https://zenodo.org/records/17903390 and downloaded automatically.
PRSedm will first check the environment variables PRSEDM_SQL_PATH and PRSEDM_META_PATH for custom databases.
Remote deployment to remote Research Analysis Platforms (RAP's) is possible via notebook wrappers:
- All of Us (WGS) - Notebook Here
- UK Biobank (imputed WGS) - Notebook Here
- UK Biobank (imputed array) - Notebook Here
| Flag | Method | Variants | Description | PMID |
|---|---|---|---|---|
t1dgrs2-luckett25 |
HLA Interaction + Partitioned | 67 | "GRS2x" updated PRS with widest compatibility and HLA-based risk pPS. | 40267362 |
t1dgrs2-qu22 |
HLA Interaction + Partitioned | 71 | Original "GRS2" PRS with the addition of 4 African ancestry SNPs from Onengut, proposed in Qu et al and utilized in eMERGE. | 34997821 |
t1dgrs2-sharp21 |
HLA Interaction + Partitioned | 67 | Version of "GRS2" PRS designed for "TOPMED-R2" from 2021 GitHub. | 35312757 |
t1d-onengut19-afr |
Additive | 6 | African-ancestry PRS proposed by Onengut in 2019, updated for modern compatibility. | 30659077 |
t1dgrs2-sharp19 |
HLA Interaction + Partitioned | 67 | Original 1000 Genomes version of "GRS2" PRS as published, with limited modern compatibility. | 30655379 |
| Flag | Method | Variants | Description | PMID |
|---|---|---|---|---|
t2d-suzuki24-prscsx-ma |
Additive + Partitioned | ~1 M | Full genome-wide multi-ancestry PRS for Suzuki (PRS-CSx meta). | 38374256 |
t2d-suzuki24-prscsx-<ancestry> |
Additive + Partitioned | >500k | Full genome-wide ancestry-specific PRS for Suzuki (PRS-CSx), where <ancestry> is one of eur, afr, eas, sas, safr, or his. |
38374256 |
t2dp-suzuki24-ma |
Additive + Partitioned | 1289 | Multiancestry weighted Suzuki T2D index variant PRS, and pPS from hard-clustering analyses. | 38374256 |
t2dp-suzuki24-<ancestry> |
Additive + Partitioned | 1128 - 1285 | As above but weighted for specific ancestries <eur/afr/safr/eas/sas/his>. |
38374256 |
t2dp-smith24-ma |
Additive + Partitioned | 353 | Multiancestry cluster-weighted Smith T2D index variant PRS, and pPS from soft-clustering analyses. | 38443691 |
t2dp-smith24-<ancestry> |
Additive + Partitioned | 25 - 490 | As above but from ancestry-specific soft clustering <eur/afr/eas/amr>. |
38443691 |
t2d-mahajan22-ma |
Additive | 338 | Older PRS from Mahajan et al composed of multiancestry index variants. | 35551307 |
t2d-mahajan22-prscsx-eur |
Additive + Partitioned | >500k | Genome-wide European ancestry PRS for Mahajan (PRS-CSx). | 38374256 |
t2dp-udler18 |
Additive + Partitioned | 67 | T2D pPS from first soft-clustering analysis. | 30240442 |
| Flag | Phenotype | Method | Variants | Description | PMID |
|---|---|---|---|---|---|
cdgrs-sharp25 |
Celiac Disease | HLA Interaction + Partitioned | 42 | Modernized Celiac disease PRS and pPS with similar model to "GRS2x", utilized for combined screening. | 32790217 |
PRSedm features a complete algorithm for GRS which incorporate HLA interaction terms as previously published by us such as T1D-GRS2 (or just GRS2). A number of advancements have been added to improve the generation of HLA interaction, described as GRS2x.
HLA alleles can be estimated by proxy (or tag) single nucleotide polymorphisms alone and predictions are output e.g. (DR3-DQ2.5/DR3-DQ2.5). Due to imperfect proxy SNPs, >2 HLA calls can be made in interaction scores such as GRS2, and a probabilistic tiebreaker algorithm using HLA reference frequencies (Klitz et al) now resolves impossible numbers of calls without excluding any samples.
PRSedm optionally uses Hardy-Weinberg Equilibrium with a reference VCF/BCF legend (ensure you have variant frequency coded as AF, genotypes not required) to estimate the mean effect size for missing SNPs, handle missing variants, and enable static normalization.
- dbSNP hg38 - TOPMED Bravo Freeze 8, or NCBI (
AFfield added) are recommended. - dbSNP hg19 - 1000 Genomes, Haplotype Reference Consortium, NCBI GRCh37 recommended.
PRSedm hardcodes static normalization of minimum and maximum potential risk contribution (no risk alleles vs all risk alleles) creating a scale of 0-1. Static normalization with estimation ensures that PRS values translate to a common relative risk scale across datasets. If variants are missing and no estimation reference is supplied, normalization is skipped automatically and a warning is emitted.
Developed and maintained by Seth A. Sharp (ssharp@stanford.edu) at the Translational Genomics of Diabetes, Stanford University, with collaboration from colleagues at the University of Exeter and MGH/Broad Institute. Lu Zhang and Han Sun contributed to v1.1.0 onwards.
If you use PRSedm in your research, please cite both the software release (10.5281/zenodo.17903985) and the accompanying article (10.2337/dc25-0142).
This project is licensed under the MIT License (Non-Commercial).
- Academic, research, and personal use are allowed.
- Commercial use is prohibited without prior permission.
See the LICENSE file for full details.
