Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 23 additions & 7 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,23 @@ on:
branches: [ master ]

jobs:
devel:
runs-on: ${{ matrix.os }}
strategy:
matrix:
python-version: [3.6, 3.7]
os: [ubuntu-latest, macos-latest]
steps:
- uses: actions/checkout@v1
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python-version }}
- name: Install package
run: pip install .[dev]
- name: make test-devel
run: make test-devel

unit:
runs-on: ${{ matrix.os }}
strategy:
Expand All @@ -24,7 +41,7 @@ jobs:
- name: Test with pytest
run: make test

devel:
readme:
runs-on: ${{ matrix.os }}
strategy:
matrix:
Expand All @@ -36,9 +53,8 @@ jobs:
uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python-version }}
- name: Install package
run: pip install .[dev]
- name: Lint with flake8
run: make lint
- name: Test with pytest
run: make test
- name: Install package and dependencies
run: pip install rundoc .
- name: make test-readme
run: make test-readme

11 changes: 9 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,17 @@ test: ## run tests quickly with the default Python
test-all: ## run tests on every Python version with tox
tox

.PHONY: test-readme
test-readme: ## run the readme snippets
rundoc run --single-session python3 -t python3 README.md


.PHONY: check-dependencies
check-dependencies: ## test if there are any broken dependencies
pip check

.PHONY: test-devel
test-devel: check-dependencies lint docs ## test everything that needs development dependencies


.PHONY: coverage
Expand All @@ -131,9 +140,7 @@ coverage: clean-coverage ## check code coverage quickly with the default Python

.PHONY: docs
docs: clean-docs ## generate Sphinx HTML documentation, including API docs
sphinx-apidoc --module-first --separate --no-toc --output-dir docs/api/ cardea
$(MAKE) -C docs html
touch docs/_build/html/.nojekyll

.PHONY: viewdocs
viewdocs: ## view the docs in a browser
Expand Down
187 changes: 161 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,6 @@
<img width=20% src="https://dai.lids.mit.edu/wp-content/uploads/2018/08/cardea.png" alt=“Cardea” />
</p>

<p align="left">
<i>Cardea is a machine learning library built on top of FHIR schema. </I>
</p>

<p align="left">
<i>An open source project from Data to AI Lab at MIT </I>
</p>



[![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
[![PyPi Shield](https://img.shields.io/pypi/v/cardea.svg)](https://pypi.python.org/pypi/cardea)
Expand All @@ -19,24 +10,168 @@

# Cardea

This library is under development. Please contact dai-lab@mit.edu or any of the contributors for more information. We will announce our first release soon.
*This library is under development. Please contact dai-lab@mit.edu or any of the contributors for more information.*

* License: [MIT](https://github.com/MLBazaar/Cardea/blob/master/LICENSE)
* Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
* Homepage: https://github.com/MLBazaar/Cardea
* Documentation: https://MLBazaar.github.io/Cardea

# Overview

Cardea is a machine learning library built on top of *schemas* that support electronic health records (EHR). The library uses a number of AutoML tools developed under [The Human Data Interaction Project](https://github.com/HDI-Project) at [Data to AI Lab at MIT](https://dai.lids.mit.edu/).


Our goal is to provide an easy to use library to develop machine learning models from electronic health records. A typical usage of this library will involve interacting with our API to develop prediction models.

![process](docs/images/cardea-process.png)

A series of sequential processes are applied to build a machine learning model. These processes are triggered using our following APIs to perform the following:

* loading data using the automatic **data assembler**, where we capture data from its raw format into an entityset representation.

* **data labeling** where we create label times that generates (1) the time index that indicates the timespan for which I create my features (2) the encoded labels of the prediction task. this is essential for our feature engineering phase.

* **featurization** for which we automatically feature engineer our data to generate a feature matrix.

* lastly, we build, train, and tune our machine learning model using the **modeling** component.

to learn more about how we structure our machine learning process and our data structures, read our documentation [here](https://MLBazaar.github.io/Cardea).

# Quickstart

## Install with pip


The easiest and recommended way to install **Cardea** is using [pip](https://pip.pypa.io/en/stable/):

```bash
pip install cardea
```

This will pull and install the latest stable release from [PyPi](https://pypi.org/).

## Quickstart

In this short tutorial we will guide you through a series of steps that will help you get Cardea started.

First, load the core class to work with:

```python3
from cardea import Cardea

cardea = Cardea()
```

We then seamlessly plug in our data. Here in this example, we are loading a pre-processed version of the [Kaggle dataset: Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments).
To use this dataset download the data from here then unzip it in the root directory, or run the command:

```bash
curl -O https://dai-cardea.s3.amazonaws.com/kaggle.zip && unzip kaggle.zip
```
To load the data, supply the ``folder_path`` to the loader using the following command:

```python3
cardea.load_entityset()
```
> :bulb: To load local data, use ``cardea.load_entityset(folder_path='kaggle')``.

To verify that the data has been loaded, you can find the loaded entityset by viewing ``cardea.es`` which should output the following:

```bash
Entityset: kaggle
Entities:
Address [Rows: 81, Columns: 2]
Appointment_Participant [Rows: 6100, Columns: 2]
Appointment [Rows: 110527, Columns: 5]
CodeableConcept [Rows: 4, Columns: 2]
Coding [Rows: 3, Columns: 2]
Identifier [Rows: 227151, Columns: 1]
Observation [Rows: 110527, Columns: 3]
Patient [Rows: 6100, Columns: 4]
Reference [Rows: 6100, Columns: 1]
Relationships:
Appointment_Participant.actor -> Reference.identifier
Appointment.participant -> Appointment_Participant.object_id
CodeableConcept.coding -> Coding.object_id
Observation.code -> CodeableConcept.object_id
Observation.subject -> Reference.identifier
Patient.address -> Address.object_id
```

The output shown represents the entityset data structure where ``cardea.es`` is composed of entities and relationships. You can read more about entitysets [here](https://mlbazaar.github.io/Cardea/basic_concepts/data_loading.html).

From there, you can select the prediction problem you aim to solve by specifying the name of the class, which in return gives us the ``label_times`` of the problem.

```python3
label_times = cardea.select_problem('MissedAppointment')
```

``label_times`` summarizes for each instance in the dataset (1) what is its corresponding label of the instance and (2) what is the time index that indicates the timespan allowed for calculating features that pertain to each instance in the dataset.

```bash
cutoff_time instance_id label
0 2015-11-10 07:13:56 5030230 noshow
1 2015-12-03 08:17:28 5122866 fulfilled
2 2015-12-07 10:40:59 5134197 fulfilled
3 2015-12-07 10:42:42 5134220 noshow
4 2015-12-07 10:43:01 5134223 noshow
```

You can read more about ``label_times`` [here](https://mlbazaar.github.io/Cardea/basic_concepts/machine_learning_tasks.html).

Then, you can perform the AutoML steps and take advantage of Cardea.

Cardea extracts features through automated feature engineering by supplying the ``label_times`` pertaining to the problem you aim to solve

```python3
feature_matrix = cardea.generate_features(label_times[:1000])
```
> :warning: Featurizing the data might take a while depending on the size of the data. For demonstration, we only featurize the first 1000 records.

Once we have the features, we can now split the data into training and testing

```python3
y = list(feature_matrix.pop('label'))

X = feature_matrix.values

X_train, X_test, y_train, y_test = cardea.train_test_split(
X, y, test_size=0.2, shuffle=True)
```

Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model

```python3
cardea.select_pipeline('Random Forest')
cardea.fit(X_train, y_train)
y_pred = cardea.predict(X_test)
```

Cardea is a machine learning library built on top of the FHIR data schema. The library uses a number of automl tools developed under ["The Human Data Interaction Project"](https://github.com/HDI-Project) at [Data to AI lab at MIT](https://dai.lids.mit.edu/). Our goal is to provide an easy to use library to develop machine learning models from electronic health records. A typical usage of this library will involve:
Finally, you can evaluate the performance of the model
```python3
cardea.evaluate(X, y, test_size=0.2, shuffle=True)
```
which returns the scoring metric depending on the type of problem
```bash
{'Accuracy': 0.75,
'F1 Macro': 0.5098039215686274,
'Precision': 0.5183001719479243,
'Recall': 0.5123528436411872}
```

* Installing the library available via pypi
* Integrating their data in FHIR schema (whatever subset of data is available)
* Following the API develop some pre specified prediction models (or specify new ones using our API) The model building process is parameterized but automatically does:
* data cleaning, auditing
* preprocessing
* feature engineering
* machine learning model search and tuning
* model evaluation
* model auditing
* Testing the models using our API
* Preparing and deploying the models
# Citation
If you use Cardea for your research, please consider citing the following paper:

## License
- Free software: MIT license
Sarah Alnegheimish; Najat Alrashed; Faisal Aleissa; Shahad Althobaiti; Dongyu Liu; Mansour Alsaleh; Kalyan Veeramachaneni. [Cardea: An Open Automated Machine Learning Framework for Electronic Health Records](https://arxiv.org/abs/2010.00509). [IEEE DSAA 2020](https://ieeexplore.ieee.org/document/9260104).

## Documentation
- Documentation: https://mlbazaar.github.io/Cardea
```bash
@inproceedings{alnegheimish2020cardea,
title={Cardea: An Open Automated Machine Learning Framework for Electronic Health Records},
author={Alnegheimish, Sarah and Alrashed, Najat and Aleissa, Faisal and Althobaiti, Shahad and Liu, Dongyu and Alsaleh, Mansour and Veeramachaneni, Kalyan},
booktitle={2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)},
pages={536--545},
year={2020},
organization={IEEE}
}
```
2 changes: 1 addition & 1 deletion cardea/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import logging
import os

from cardea.cardea import Cardea
from cardea.core import Cardea

logging.getLogger('cardea').addHandler(logging.NullHandler())

Expand Down
Loading