Note
This README may be out of date. For the most up-to-date documentation, tutorials, and API reference, please visit our official documentation site at pyhealth.readthedocs.io.
Important
- Join our PyHealth Discord Community! We are actively looking for contributors and want to get to know our users better! Click here to join Discord
- Signup for our mailing list! We will email any significant PyHealth changes that are soon to come! Click here to subscribe
Yang, Chaoqi, Zhenbang Wu, Patrick Jiang, Zhen Lin, Junyi Gao, Benjamin P. Danek, and Jimeng Sun. 2023. "PyHealth: A Deep Learning Toolkit for Healthcare Applications." In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 5788β89. KDD '23. New York, NY, USA: Association for Computing Machinery.
@inproceedings{pyhealth2023yang,
author = {Yang, Chaoqi and Wu, Zhenbang and Jiang, Patrick and Lin, Zhen and Gao, Junyi and Danek, Benjamin and Sun, Jimeng},
title = {{PyHealth}: A Deep Learning Toolkit for Healthcare Predictive Modeling},
url = {https://github.com/sunlabuiuc/PyHealth},
booktitle = {Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2023},
year = {2023}
}PyHealth is a comprehensive deep learning toolkit for supporting clinical predictive modeling, which is designed for both ML researchers and medical practitioners. We can make your healthcare AI applications easier to develop, test, and deployβmore flexible and more customizable. [Tutorials]
Key Features
- Modular 5-stage pipeline for healthcare ML
- Healthcare-first: medical codes and clinical datasets (MIMIC, eICU, OMOP)
- 33+ pre-built models and production-ready trainer/metrics
- 10+ supported healthcare tasks and datasets
- Fast (~3x faster than pandas) data processing for quick experimentation
[News!] We are continuously implementing good papers and benchmarks into PyHealth, checkout the [Planned List]. Welcome to pick one from the list and send us a PR or add more influential and new papers into the plan list.
Python Version Recommendation
We recommend using Python 3.12 for optimal parallel processing and memory management performance. While PyHealth supports Python 3.8+, Python 3.12 provides significant improvements in these areas.
Recommended Installation (Alpha Version)
We recommend installing the latest alpha version from PyPi, which offers significant improvements in performance:
pip install pyhealth==2.0a13This version includes optimized implementations and enhanced features compared to the legacy version.
Legacy Version
The older stable version is still available for backward compatibility:
pip install pyhealthFor Contributors and Developers
If you are contributing to PyHealth or need the latest development features, install from GitHub source:
git clone https://github.com/sunlabuiuc/PyHealth.git
cd PyHealth
pip install -e .Note: PyHealth has multiple neural network based models implemented in PyTorch. However, PyHealth does NOT install these DL libraries for you. If you want to use neural-net based models, please make sure PyTorch is installed.
pyhealth provides these functionalities (we are still enriching some modules):
You can use the following functions independently:
- Dataset:
MIMIC-III,MIMIC-IV,eICU,OMOP-CDM,EHRShot,COVID19-CXR,SleepEDF,SHHS,ISRUC,customized EHR datasets, etc. - Tasks:
diagnosis-based drug recommendation,patient hospitalization and mortality prediction,readmission prediction,length of stay forecasting,sleep staging, etc. - ML models:
RNN,LSTM,GRU,Transformer,RETAIN,SafeDrug,GAMENet,MoleRec,AdaCare,ConCare,StageNet,GRASP,SparcNet,ContraWR,Deepr,TCN,Dr. Agent, etc.
Building a healthcare AI pipeline can be as short as 10 lines of code in PyHealth.
All healthcare tasks in our package follow a five-stage pipeline:
We try hard to make sure each stage is as separate as possible, so that people can customize their own pipeline by only using our data processing steps or the ML models.
pyhealth.datasets provides a clean structure for the dataset, independent from the tasks. We support MIMIC-III, MIMIC-IV, eICU, OMOP-CDM, and more. The output (mimic3base) is a multi-level dictionary structure (see illustration below).
from pyhealth.datasets import MIMIC3Dataset
mimic3base = MIMIC3Dataset(
# root directory of the dataset
root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
# raw CSV table name
tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
# map all NDC codes to CCS codes in these tables
code_mapping={"NDC": "CCSCM"},
)pyhealth.tasks defines how to process each patient's data into a set of samples for the tasks. In the package, we provide several task examples, such as drug recommendation, mortality prediction, and readmission prediction. It is easy to customize your own tasks following our template.
from pyhealth.tasks import ReadmissionPredictionMIMIC3
mimic3sample = mimic3base.set_task(ReadmissionPredictionMIMIC3())
mimic3sample[0] # show the information of the first sample
from pyhealth.datasets import split_by_patient, get_dataloader
train_ds, val_ds, test_ds = split_by_patient(mimic3sample, [0.8, 0.1, 0.1])
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)pyhealth.models provides different ML models with very similar argument configs.
from pyhealth.models import Transformer
model = Transformer(
dataset=mimic3sample,
)pyhealth.trainer can specify training arguments, such as epochs, optimizer, learning rate, etc. The trainer will automatically save the best model and output the path in the end.
from pyhealth.trainer import Trainer
trainer = Trainer(model=model)
trainer.train(
train_dataloader=train_loader,
val_dataloader=val_loader,
epochs=50,
monitor="pr_auc_samples",
)pyhealth.metrics provides several common evaluation metrics (refer to Doc and see what are available).
# method 1
trainer.evaluate(test_loader)
# method 2
from pyhealth.metrics.binary import binary_metrics_fn
y_true, y_prob, loss = trainer.inference(test_loader)
binary_metrics_fn(y_true, y_prob, metrics=["pr_auc", "roc_auc"])pyhealth.codemap provides two core functionalities. This module can be used independently.
- For code ontology lookup within one medical coding system (e.g., name, category, sub-concept);
from pyhealth.medcode import InnerMap
icd9cm = InnerMap.load("ICD9CM")
icd9cm.lookup("428.0")
# `Congestive heart failure, unspecified`
icd9cm.get_ancestors("428.0")
# ['428', '420-429.99', '390-459.99', '001-999.99']
atc = InnerMap.load("ATC")
atc.lookup("M01AE51")
# `ibuprofen, combinations`
atc.lookup("M01AE51", "drugbank_id")
# `DB01050`
atc.lookup("M01AE51", "description")
# Ibuprofen is a non-steroidal anti-inflammatory drug (NSAID) derived ...
atc.lookup("M01AE51", "indication")
# Ibuprofen is the most commonly used and prescribed NSAID. It is very common over the ...- For code mapping between two coding systems (e.g., ICD9CM to CCSCM).
from pyhealth.medcode import CrossMap
codemap = CrossMap.load("ICD9CM", "CCSCM")
codemap.map("428.0")
# ['108']
codemap = CrossMap.load("NDC", "RxNorm")
codemap.map("50580049698")
# ['209387']
codemap = CrossMap.load("NDC", "ATC")
codemap.map("50090539100")
# ['A10AC04', 'A10AD04', 'A10AB04']pyhealth.tokenizer is used for transformations between string-based tokens and integer-based indices, based on the overall token space. We provide flexible functions to tokenize 1D, 2D and 3D lists. This module can be used independently.
from pyhealth.tokenizer import Tokenizer
# Example: we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', \
'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
# 2d encode
tokens = [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', 'B035', 'C129']]
indices = tokenizer.batch_encode_2d(tokens)
# [[8, 9, 10, 11], [12, 1, 1, 0]]
# 2d decode
indices = [[8, 9, 10, 11], [12, 1, 1, 0]]
tokens = tokenizer.batch_decode_2d(indices)
# [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]
# 3d encode
tokens = [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A']], \
[['A04A', 'B035', 'C129']]]
indices = tokenizer.batch_encode_3d(tokens)
# [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0], [0, 0, 0, 0]]]
# 3d decode
indices = [[[8, 9, 10, 11], [24, 25, 0, 0]], \
[[12, 1, 1, 0], [0, 0, 0, 0]]]
tokens = tokenizer.batch_decode_3d(indices)
# [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A']], [['A04A', '<unk>', '<unk>']]]We provide the following tutorials to help users get started with our pyhealth. Please bear with us as we update the documentation on how to use PyHealth 2.0.
Tutorial 0: Introduction to pyhealth.data [Video]
Tutorial 1: Introduction to pyhealth.datasets [Video (PyHealth 1.6)]
Tutorial 2: Introduction to pyhealth.tasks [Video (PyHealth 1.6)]
Tutorial 3: Introduction to pyhealth.models [Video]
Tutorial 4: Introduction to pyhealth.trainer [Video]
Tutorial 5: Introduction to pyhealth.metrics [Video]
Tutorial 6: Introduction to pyhealth.tokenizer [Video]
Tutorial 7: Introduction to pyhealth.medcode [Video]
The following tutorials will help users build their own task pipelines.
Pipeline 1: Chest Xray Classification [Video]
Pipeline 3: Medical Transcription Classification
Pipeline 4: Mortality Prediction
Pipeline 5: Readmission Prediction
We provide advanced tutorials for supporting various needs.
Advanced Tutorial 1: Fit your dataset into our pipeline [Video]
Advanced Tutorial 2: Define your own healthcare task
Advanced Tutorial 3: Adopt customized model into pyhealth [Video]
Advanced Tutorial 4: Load your own processed data into pyhealth and try out our ML models [Video]
We provide the processing files for the following open EHR datasets:
| MIMIC-III | pyhealth.datasets.MIMIC3Dataset |
2016 | MIMIC-III Clinical Database |
| MIMIC-IV | pyhealth.datasets.MIMIC4Dataset |
2020 | MIMIC-IV Clinical Database |
| eICU | pyhealth.datasets.eICUDataset |
2018 | eICU Collaborative Research Database |
| OMOP | pyhealth.datasets.OMOPDataset |
Β | OMOP-CDM schema based dataset |
| EHRShot | pyhealth.datasets.EHRShotDataset |
2023 | Few-shot EHR benchmarking dataset |
| COVID19-CXR | pyhealth.datasets.COVID19CXRDataset |
2020 | COVID-19 chest X-ray image dataset |
| SleepEDF | pyhealth.datasets.SleepEDFDataset |
2018 | Sleep-EDF dataset |
| SHHS | pyhealth.datasets.SHHSDataset |
2016 | Sleep Heart Health Study dataset |
| ISRUC | pyhealth.datasets.ISRUCDataset |
2016 | ISRUC-SLEEP dataset |
Deep Learning Models
| Model | Year | Key Innovation |
|---|---|---|
| RETAIN | 2016 | Interpretable attention for clinical decisions |
| GAMENet | 2019 | Memory networks for drug recommendation |
| SafeDrug | 2021 | Molecular graphs for safe drug combinations |
| MoleRec | 2023 | Substructure-aware drug recommendation |
| AdaCare | 2020 | Scale-adaptive feature extraction |
| ConCare | 2020 | Transformer-based patient modeling |
| StageNet | 2020 | Disease progression stage modeling |
| GRASP | 2021 | Graph neural networks for patient clustering |
| MICRON | 2021 | Medication change prediction with recurrent residual networks |
Foundation Models
| Model | Year | Description |
|---|---|---|
| Transformer | 2017 | Attention-based sequence modeling |
| RNN/LSTM/GRU | 2011 | Recurrent neural networks for sequences |
| CNN | 1989 | Convolutional networks for structured data |
| TCN | 2018 | Temporal convolutional networks |
| MLP | 1986 | Multi-layer perceptrons for tabular data |
Specialized Models
| Model | Year | Specialization |
|---|---|---|
| ContraWR | 2021 | Biosignal analysis (EEG, ECG) |
| SparcNet | 2023 | Seizure detection and sleep staging |
| Deepr | 2017 | Electronic health records |
| Dr. Agent | 2020 | Reinforcement learning for clinical decisions |
The PyHealth Research Initiative is a year-round, open research program that brings together talented individuals from diverse backgrounds to conduct cutting-edge research in healthcare AI.
How to participate:
- Join our Discord server
- Submit a high-quality PR to the PyHealth repository
- Check the documentation for more details
Recent research from the initiative has been published at venues including ML4H 2025 and other top conferences.
We are the SunLab healthcare research team at UIUC.
Current Maintainers:
- Zhenbang Wu (Ph.D. Student @ UIUC)
- John Wu (Ph.D. Student @ UIUC)
- Junyi Gao (Ph.D. Student @ University of Edinburgh)
- Jimeng Sun (Professor @ UIUC)
Get in Touch:
- Discord Community (fastest response)
- GitHub Issues
- Mailing List



