From 16a024290606353213979210fb8cb82a2555623a Mon Sep 17 00:00:00 2001
From: sarahmish
-Cardea is a machine learning library built on top of FHIR schema. -
- --An open source project from Data to AI Lab at MIT -
- - [](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha) [](https://pypi.python.org/pypi/cardea) @@ -19,24 +10,168 @@ # Cardea -This library is under development. Please contact dai-lab@mit.edu or any of the contributors for more information. We will announce our first release soon. +*This library is under development. Please contact dai-lab@mit.edu or any of the contributors for more information.* + +* License: [MIT](https://github.com/MLBazaar/Cardea/blob/master/LICENSE) +* Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha) +* Homepage: https://github.com/MLBazaar/Cardea +* Documentation: https://MLBazaar.github.io/Cardea + +# Overview + +Cardea is a machine learning library built on top of *schemas* that support electronic health records (EHR). The library uses a number of AutoML tools developed under [The Human Data Interaction Project](https://github.com/HDI-Project) at [Data to AI Lab at MIT](https://dai.lids.mit.edu/). + + +Our goal is to provide an easy to use library to develop machine learning models from electronic health records. A typical usage of this library will involve interacting with our API to develop prediction models. + +  + +A series of sequential processes are applied to build a machine learning model. These processes are triggered using our following APIs to perform the following: + +* loading data using the automatic **data assembler**, where we capture data from its raw format into an entityset representation. + +* **data labeling** where we create label times that generates (1) the time index that indicates the timespan for which I create my features (2) the encoded labels of the prediction task. this is essential for our feature engineering phase. + +* **featurization** for which we automatically feature engineer our data to generate a feature matrix. + +* lastly, we build, train, and tune our machine learning model using the **modeling** component. + +to learn more about how we structure our machine learning process and our data structures, read our documentation [here](https://MLBazaar.github.io/Cardea). + +# Quickstart + +## Install with pip + + +The easiest and recommended way to install **Cardea** is using [pip](https://pip.pypa.io/en/stable/): + +```bash +pip install cardea +``` + +This will pull and install the latest stable release from [PyPi](https://pypi.org/). + +## Quickstart + +In this short tutorial we will guide you through a series of steps that will help you get Cardea started. + +First, load the core class to work with: + +```python3 +from cardea import Cardea + +cardea = Cardea() +``` + +We then seamlessly plug in our data. Here in this example, we are loading a pre-processed version of the [Kaggle dataset: Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments). +To use this dataset download the data from here then unzip it in the root directory, or run the command: + +```bash +curl -O https://dai-cardea.s3.amazonaws.com/kaggle.zip && unzip kaggle.zip +``` +To load the data, supply the ``folder_path`` to the loader using the following command: + +```python3 +cardea.load_entityset() +``` +> :bulb: To load local data, use ``cardea.load_entityset(folder_path='kaggle')``. + +To verify that the data has been loaded, you can find the loaded entityset by viewing ``cardea.es`` which should output the following: + +```bash +Entityset: kaggle + Entities: + Address [Rows: 81, Columns: 2] + Appointment_Participant [Rows: 6100, Columns: 2] + Appointment [Rows: 110527, Columns: 5] + CodeableConcept [Rows: 4, Columns: 2] + Coding [Rows: 3, Columns: 2] + Identifier [Rows: 227151, Columns: 1] + Observation [Rows: 110527, Columns: 3] + Patient [Rows: 6100, Columns: 4] + Reference [Rows: 6100, Columns: 1] + Relationships: + Appointment_Participant.actor -> Reference.identifier + Appointment.participant -> Appointment_Participant.object_id + CodeableConcept.coding -> Coding.object_id + Observation.code -> CodeableConcept.object_id + Observation.subject -> Reference.identifier + Patient.address -> Address.object_id +``` + +The output shown represents the entityset data structure where ``cardea.es`` is composed of entities and relationships. You can read more about entitysets [here](https://mlbazaar.github.io/Cardea/basic_concepts/data_loading.html). + +From there, you can select the prediction problem you aim to solve by specifying the name of the class, which in return gives us the ``label_times`` of the problem. + +```python3 +label_times = cardea.select_problem('MissedAppointment') +``` + +``label_times`` summarizes for each instance in the dataset (1) what is its corresponding label of the instance and (2) what is the time index that indicates the timespan allowed for calculating features that pertain to each instance in the dataset. + +```bash + cutoff_time instance_id label +0 2015-11-10 07:13:56 5030230 noshow +1 2015-12-03 08:17:28 5122866 fulfilled +2 2015-12-07 10:40:59 5134197 fulfilled +3 2015-12-07 10:42:42 5134220 noshow +4 2015-12-07 10:43:01 5134223 noshow +``` + +You can read more about ``label_times`` [here](https://mlbazaar.github.io/Cardea/basic_concepts/machine_learning_tasks.html). + +Then, you can perform the AutoML steps and take advantage of Cardea. + +Cardea extracts features through automated feature engineering by supplying the ``label_times`` pertaining to the problem you aim to solve + +```python3 +feature_matrix = cardea.generate_features(label_times[:1000]) +``` +> :warning: Featurizing the data might take a while depending on the size of the data. For demonstration, we only featurize the first 1000 records. + +Once we have the features, we can now split the data into training and testing + +```python3 +y = list(feature_matrix.pop('label')) + +X = feature_matrix.values + +X_train, X_test, y_train, y_test = cardea.train_test_split( + X, y, test_size=0.2, shuffle=True) +``` + +Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model + +```python3 +cardea.select_pipeline('Random Forest') +cardea.fit(X_train, y_train) +y_pred = cardea.predict(X_test) +``` -Cardea is a machine learning library built on top of the FHIR data schema. The library uses a number of automl tools developed under ["The Human Data Interaction Project"](https://github.com/HDI-Project) at [Data to AI lab at MIT](https://dai.lids.mit.edu/). Our goal is to provide an easy to use library to develop machine learning models from electronic health records. A typical usage of this library will involve: +Finally, you can evaluate the performance of the model +```python3 +cardea.evaluate(X, y, test_size=0.2, shuffle=True) +``` +which returns the scoring metric depending on the type of problem +```bash +{'Accuracy': 0.75, + 'F1 Macro': 0.5098039215686274, + 'Precision': 0.5183001719479243, + 'Recall': 0.5123528436411872} +``` -* Installing the library available via pypi -* Integrating their data in FHIR schema (whatever subset of data is available) -* Following the API develop some pre specified prediction models (or specify new ones using our API) The model building process is parameterized but automatically does: - * data cleaning, auditing - * preprocessing - * feature engineering - * machine learning model search and tuning - * model evaluation - * model auditing -* Testing the models using our API -* Preparing and deploying the models +# Citation +If you use Cardea for your research, please consider citing the following paper: -## License -- Free software: MIT license +Sarah Alnegheimish; Najat Alrashed; Faisal Aleissa; Shahad Althobaiti; Dongyu Liu; Mansour Alsaleh; Kalyan Veeramachaneni. [Cardea: An Open Automated Machine Learning Framework for Electronic Health Records](https://arxiv.org/abs/2010.00509). [IEEE DSAA 2020](https://ieeexplore.ieee.org/document/9260104). -## Documentation -- Documentation: https://mlbazaar.github.io/Cardea +```bash +@inproceedings{alnegheimish2020cardea, + title={Cardea: An Open Automated Machine Learning Framework for Electronic Health Records}, + author={Alnegheimish, Sarah and Alrashed, Najat and Aleissa, Faisal and Althobaiti, Shahad and Liu, Dongyu and Alsaleh, Mansour and Veeramachaneni, Kalyan}, + booktitle={2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)}, + pages={536--545}, + year={2020}, + organization={IEEE} +} +``` diff --git a/cardea/__init__.py b/cardea/__init__.py index 7e1c0fbf..18c15abf 100644 --- a/cardea/__init__.py +++ b/cardea/__init__.py @@ -8,7 +8,7 @@ import logging import os -from cardea.cardea import Cardea +from cardea.core import Cardea logging.getLogger('cardea').addHandler(logging.NullHandler()) diff --git a/cardea/cardea.py b/cardea/cardea.py deleted file mode 100644 index b239fb9b..00000000 --- a/cardea/cardea.py +++ /dev/null @@ -1,239 +0,0 @@ -import json -from inspect import isclass - -import featuretools as ft -import pandas as pd - -import cardea -from cardea.data_loader import EntitySetLoader -from cardea.featurization import Featurization -from cardea.modeling import Modeler -from cardea.problem_definition import ( - DiagnosisPrediction, LengthOfStay, MissedAppointmentProblemDefinition, MortalityPrediction, - ProlongedLengthOfStay, Readmission) - - -class Cardea(): - """An interface class that ties the end-to-end system together. - - Args: - es_loader (EntitySetLoader): - An entityset loader. - featurization (Featurization): - A featurization class. - modeler (Modeler): - A modeling class. - problems (list): - A list of currently available prediction problems. - chosen_problem (str): - The selected prediction problem or regression. - es (featuretools.EntitySet): - The loaded entityset. - target_entity (str): - The target entity for featurization. - """ - - def __init__(self): - - self.es_loader = EntitySetLoader() - self.featurization = Featurization() - self.modeler = Modeler() - - self.es = None - self.chosen_problem = None - self.target_entity = None - - def load_data_entityset(self, folder_path=None): - """Returns an entityset loaded with .csv files in folder_path. - - Load the given dataset within the folder path into an entityset. The dataset - must be in a FHIR structure format. If no folder_path is not passed, the - function will automatically load kaggle's missed appointment dataset. - - Args: - folder_path (str): - A directory of all .csv files that should be loaded. - - Returns: - featuretools.EntitySet: - An entityset with loaded data. - """ - - if folder_path: - self.es = self.es_loader.load_data_entityset(folder_path) - - else: - csv_s3 = "https://s3.amazonaws.com/dai-cardea/" - kaggle = ['Address', - 'Appointment_Participant', - 'Appointment', - 'CodeableConcept', - 'Coding', - 'Identifier', - 'Observation', - 'Patient', - 'Reference'] - - fhir = { - resource: pd.read_csv( - csv_s3 + resource + ".csv") for resource in kaggle} - self.es = self.es_loader.load_df_entityset(fhir) - - def list_problems(self): - """Returns a list of the currently available problems. - - Returns: - list: - A list of the available problems. - """ - - problems = set([]) - for attribute_string in dir(cardea.problem_definition): - attribute = getattr(cardea.problem_definition, attribute_string) - if isclass(attribute): - if attribute.__name__ and attribute.__name__ != 'ProblemDefinition': - problems.add(attribute.__name__) - - return problems - - def select_problem(self, selection, parameter=None): - """Select a prediction problem and extract information. - - Update the select_problem attribute and generate the cutoff times, - the target entity and update the entityset. - - Args: - selection (str): - Name of the chosen prediction problem. - parameters (dict): - Variables to change the default parameters, if any. - - Returns: - featuretools.EntitySet, str, pandas.DataFrame: - * An updated EntitySet if a new column is generated. - * A string indicating the selected target entity. - * A dataframe of cutoff times and their target labels. - """ - - # problem selection - if selection == 'LengthOfStay': - self.chosen_problem = LengthOfStay() - - elif selection == 'MortalityPrediction': - self.chosen_problem = MortalityPrediction() - - elif selection == 'MissedAppointmentProblemDefinition': - self.chosen_problem = MissedAppointmentProblemDefinition() - - elif selection == 'ProlongedLengthOfStay' and parameter: - self.chosen_problem = ProlongedLengthOfStay(parameter) - - elif selection == 'ProlongedLengthOfStay': - self.chosen_problem = ProlongedLengthOfStay() - - elif selection == 'Readmission' and parameter: - self.chosen_problem = Readmission(parameter) - - elif selection == 'Readmission': - self.chosen_problem = Readmission() - - elif selection == 'DiagnosisPrediction' and parameter: - self.chosen_problem = DiagnosisPrediction(parameter) - - elif selection == 'DiagnosisPrediction': - raise ValueError('unspecified diagnosis code') - - else: - raise ValueError('{} is not a defined problem'.format(selection)) - - # target label calculation - self.es, self.target_entity, cutoff = self.chosen_problem.generate_cutoff_times(self.es) - return cutoff - - def list_feature_primitives(self): - """Returns built-in primitive in Featuretools. - - Returns: - pandas.DataFrame: - A dataframe that lists and describes each built-in primitives. - """ - return ft.list_primitives() - - def generate_features(self, cutoff): - """Returns a the calculated feature matrix. - - Args: - es (featuretools.EntitySet): - An entityset that holds data. - cutoff (pandas.DataFrame): - A dataframe that indicates cutoff time for each instance. - - Returns: - pandas.DataFrame, list: - * The generated feature matrix. - * List of feature definitions in the feature matrix. - """ - - fm_encoded, _ = self.featurization.generate_feature_matrix( - self.es, self.target_entity, cutoff) - fm_encoded = fm_encoded.reset_index(drop=True) - return fm_encoded - - def execute_model(self, feature_matrix, target, primitives, - optimize=False, hyperparameters=None): - """Executes and predicts all of the pipelines. - - This method executes the given pipeline and returns a list for all the pipelines - with the result of each fold with its associated predicted values and actual values. - - Args: - data_frame (pandas.DataFrame or ndarray): - A dataframe which encapsulates all the feature matrix. - target (ndarray): - An array of labels for the target variable. - primitives_list (list): - A list of the primitives within a pipeline. - optimize (bool): - A boolean value which indicates whether to optimize the model or not. - hyperparameters (dict): - A dictionary of hyperparameters for each primitive. - - Returns: - dict: - A dictionary for all the executed pipelines and its result. - """ - - return self.modeler.execute_pipeline( - data_frame=feature_matrix, - target=target, - primitives_list=primitives, - problem_type=self.chosen_problem.prediction_type, - optimize=optimize, - hyperparameters=hyperparameters - ) - - def convert_to_json(X): - """Converts a given dictionary to json format. - - Args: - X (dict): - A dictionary of values to be coverted. - - Returns: - str: - A string in json format. - """ - return json.dumps(X) - - def convert_from_json(X): - """Converts a given json string to dictionary format. - - Args: - X (str): - A string of values to be coverted to json. - - Returns: - dict: - A parsed dictionary. - """ - return json.loads(X) diff --git a/cardea/core.py b/cardea/core.py new file mode 100644 index 00000000..b884a205 --- /dev/null +++ b/cardea/core.py @@ -0,0 +1,348 @@ +"""Cardea Core module. + +This module defines the Cardea Class, which is responsible for the +tying all components together, as well as the interact with them. +""" + +import logging +import os +import pickle +from inspect import isclass + +import featuretools as ft +import pandas as pd + +import cardea +from cardea.data_loader import EntitySetLoader +from cardea.featurization import Featurization +from cardea.modeling import Modeler +from cardea.problem_definition import ( + DiagnosisPrediction, LengthOfStay, MissedAppointment, MortalityPrediction, + ProlongedLengthOfStay, Readmission) + +LOGGER = logging.getLogger(__name__) + + +class Cardea(): + """An interface class that ties the end-to-end system together. + + Args: + es_loader (EntitySetLoader): + An entityset loader. + featurization (Featurization): + A featurization class. + modeler (Modeler): + A modeling class. + problems (list): + A list of currently available prediction problems. + chosen_problem (str): + The selected prediction problem or regression. + es (featuretools.EntitySet): + The loaded entityset. + target_entity (str): + The target entity for featurization. + """ + + def __init__(self): + + self.es_loader = EntitySetLoader() + self.featurization = Featurization() + + self.es = None + self.chosen_problem = None + self.target_entity = None + self.modeler = None + + def load_entityset(self, folder_path=None): + """Returns an entityset loaded with .csv files in folder_path. + + Load the given dataset within the folder path into an entityset. The dataset + must be in a FHIR structure format. If no folder_path is not passed, the + function will automatically load kaggle's missed appointment dataset. + + Args: + folder_path (str): + A directory of all .csv files that should be loaded. + + Returns: + featuretools.EntitySet: + An entityset with loaded data. + """ + + if folder_path: + self.es = self.es_loader.load_data_entityset(folder_path) + + else: + csv_s3 = "https://s3.amazonaws.com/dai-cardea/" + kaggle = ['Address', + 'Appointment_Participant', + 'Appointment', + 'CodeableConcept', + 'Coding', + 'Identifier', + 'Observation', + 'Patient', + 'Reference'] + + fhir = { + resource: pd.read_csv( + csv_s3 + resource + ".csv") for resource in kaggle} + self.es = self.es_loader.load_df_entityset(fhir) + + def list_problems(self): + """Returns a list of the currently available problems. + + Returns: + list: + A list of the available problems. + """ + + problems = set([]) + for attribute_string in dir(cardea.problem_definition): + attribute = getattr(cardea.problem_definition, attribute_string) + if isclass(attribute): + if attribute.__name__ and attribute.__name__ != 'ProblemDefinition': + problems.add(attribute.__name__) + + return problems + + def select_problem(self, selection, parameter=None): + """Select a prediction problem and extract information. + + Update the select_problem attribute and generate the cutoff times, + the target entity and update the entityset. + + Args: + selection (str): + Name of the chosen prediction problem. + parameters (dict): + Variables to change the default parameters, if any. + + Returns: + featuretools.EntitySet, str, pandas.DataFrame: + * An updated EntitySet if a new column is generated. + * A string indicating the selected target entity. + * A dataframe of cutoff times and their target labels. + """ + LOGGER.info("Selecting %s prediction problem", selection) + + # problem selection + if selection == 'LengthOfStay': + self.chosen_problem = LengthOfStay() + + elif selection == 'MortalityPrediction': + self.chosen_problem = MortalityPrediction() + + elif selection == 'MissedAppointment': + self.chosen_problem = MissedAppointment() + + elif selection == 'ProlongedLengthOfStay' and parameter: + self.chosen_problem = ProlongedLengthOfStay(parameter) + + elif selection == 'ProlongedLengthOfStay': + self.chosen_problem = ProlongedLengthOfStay() + + elif selection == 'Readmission' and parameter: + self.chosen_problem = Readmission(parameter) + + elif selection == 'Readmission': + self.chosen_problem = Readmission() + + elif selection == 'DiagnosisPrediction' and parameter: + self.chosen_problem = DiagnosisPrediction(parameter) + + elif selection == 'DiagnosisPrediction': + raise ValueError('unspecified diagnosis code') + + else: + raise ValueError('{} is not a defined problem'.format(selection)) + + # target label calculation + self.es, self.target_entity, cutoff = self.chosen_problem.generate_cutoff_times(self.es) + + # set default pipeline + if self.chosen_problem.prediction_type == "classification": + pipeline = "Random Forest" + else: + pipeline = "Random Forest Regressor" + + self.modeler = Modeler(pipeline, self.chosen_problem.prediction_type) + + return cutoff + + def list_feature_primitives(self): + """Returns built-in primitive in Featuretools. + + Returns: + pandas.DataFrame: + A dataframe that lists and describes each built-in primitives. + """ + return ft.list_primitives() + + def generate_features(self, cutoff): + """Returns a the calculated feature matrix. + + Args: + es (featuretools.EntitySet): + An entityset that holds data. + cutoff (pandas.DataFrame): + A dataframe that indicates cutoff time for each instance. + + Returns: + pandas.DataFrame, list: + * The generated feature matrix. + * List of feature definitions in the feature matrix. + """ + + fm_encoded, _ = self.featurization.generate_feature_matrix( + self.es, self.target_entity, cutoff) + fm_encoded = fm_encoded.reset_index(drop=True) + return fm_encoded + + def select_pipeline(self, pipeline): + """Select a pipeline. + + Args: + pipeline (MLPipeline or str): + A pipeline instance or the name/path of a pipeline. + """ + LOGGER.info("Selecting %s pipeline", pipeline) + self.modeler = Modeler(pipeline, self.chosen_problem.prediction_type) + + def train_test_split(self, X, y, test_size, shuffle): + """Split the training dataset and the testing dataset. + + Args: + X (pandas.DataFrame or ndarray): + Inputs to the pipeline. + y (pandas.Series or ndarray): + Target values. + test_size (float): + The proportion of the dataset to include in the test dataset. + shuffle (bool): + Whether or not to shuffle the data before splitting. + + Returns: + list: + List containing the train-test split of the inputs and targets. + """ + return self.modeler.train_test_split(X, y, test_size, shuffle) + + def fit(self, X, y, tune=False, max_evals=10, scoring=None, verbose=False): + """Train the cardea pipeline. + + Args: + X (pandas.DataFrame or ndarray): + Inputs to the pipeline. + y (pandas.Series ndarray): + Target values. + tune (bool): + Whether to optimize hyper-parameters of the pipelines. + max_evals (int): + Maximum number of hyper-parameter optimization iterations. + scoring (str): + The name of the scoring function used in the hyper-parameter optimization. + verbose (bool): + Whether to log information during processing. + """ + self.modeler.fit(X, y, tune, max_evals, scoring, verbose) + + def predict(self, X): + """Get predictions from the cardea pipeline. + + Args: + X (pandas.DataFrame or ndarray): + Inputs to the pipeline. + + Returns: + ndarray: + Predictions to the input data. + """ + return self.modeler.predict(X) + + def fit_predict(self, X, y, tune=False, max_evals=10, scoring=None, verbose=False): + """Train a cardea pipeline then make predictions. + + Args: + X (pandas.DataFrame or ndarray): + Inputs to the pipeline. + y (pandas.Series or ndarray): + Target values. + tune (bool): + Whether to optimize hyper-parameters of the pipelines. + max_evals (int): + Maximum number of hyper-parameter optimization iterations. + scoring (str): + The name of the scoring function used in the hyper-parameter optimization. + verbose (bool): + Whether to log information during processing. + + Returns: + ndarray: + Predictions to the input data. + """ + return self.modeler.fit_predict(X, y, tune, max_evals, scoring, verbose) + + def evaluate(self, X, y, test_size=0.2, shuffle=True, tune=False, max_evals=10, scoring=None, + metrics=None, verbose=False): + """Evaluate the cardea pipeline. + + Args: + X (pandas.DataFrame or ndarray): + Inputs to the pipeline. + y (pandas.Series or ndarray): + Target values. + test_size (float): + The proportion of the dataset to include in the test dataset. + shuffle (bool): + Whether or not to shuffle the data before splitting. + tune (bool): + Whether to optimize hyper-parameters of the pipelines. + max_evals (int): + Maximum number of hyper-parameter optimization iterations. + scoring (str): + The name of the scoring function used in the hyper-parameter optimization. + metrics (list): + A list of scoring function names. The scoring functions should be consistent + with the problem type. + verbose (bool): + Whether to log information during processing. + """ + return self.modeler.evaluate( + X, y, test_size, shuffle, tune, max_evals, scoring, metrics, verbose) + + def save(self, path): + """Save this object using pickle. + + Args: + path (str): + Path to the file where the serialization of + this object will be stored. + """ + os.makedirs(os.path.dirname(path), exist_ok=True) + with open(path, 'wb') as pickle_file: + pickle.dump(self, pickle_file) + + @classmethod + def load(cls, path: str): + """Load an Orion instance from a pickle file. + + Args: + path (str): + Path to the file where the instance has been + previously serialized. + + Returns: + Cardea: + A Cardea instance + + Raises: + ValueError: + If the serialized object is not an Cardea instance. + """ + with open(path, 'rb') as pickle_file: + cardea = pickle.load(pickle_file) + if not isinstance(cardea, cls): + raise ValueError('Serialized object is not a Cardea instance') + + return cardea diff --git a/cardea/problem_definition/__init__.py b/cardea/problem_definition/__init__.py index 68e6f123..9e39a925 100644 --- a/cardea/problem_definition/__init__.py +++ b/cardea/problem_definition/__init__.py @@ -5,7 +5,7 @@ from cardea.problem_definition.predicting_diagnosis import DiagnosisPrediction from cardea.problem_definition.prolonged_length_of_stay import ProlongedLengthOfStay from cardea.problem_definition.readmission import Readmission -from cardea.problem_definition.show_noshow_appointment import MissedAppointmentProblemDefinition +from cardea.problem_definition.show_noshow_appointment import MissedAppointment __all__ = ( "ProblemDefinition", @@ -14,5 +14,5 @@ "DiagnosisPrediction", "ProlongedLengthOfStay", "Readmission", - "MissedAppointmentProblemDefinition" + "MissedAppointment" ) diff --git a/cardea/problem_definition/show_noshow_appointment.py b/cardea/problem_definition/show_noshow_appointment.py index 7e6feb3f..6093802f 100644 --- a/cardea/problem_definition/show_noshow_appointment.py +++ b/cardea/problem_definition/show_noshow_appointment.py @@ -3,7 +3,7 @@ from cardea.problem_definition import ProblemDefinition -class MissedAppointmentProblemDefinition (ProblemDefinition): +class MissedAppointment(ProblemDefinition): """Defines the problem of missed appointment Predict whether the patient will show to the appointment or not. diff --git a/docs/api_reference/cardea.rst b/docs/api_reference/cardea.rst index 9b5e6766..9f968a44 100644 --- a/docs/api_reference/cardea.rst +++ b/docs/api_reference/cardea.rst @@ -12,11 +12,16 @@ Cardea :toctree: api/ Cardea - Cardea.load_data_entityset + Cardea.load_entityset Cardea.list_problems Cardea.select_problem Cardea.list_feature_primitives Cardea.generate_features - Cardea.execute_model - Cardea.convert_to_json - Cardea.convert_from_json \ No newline at end of file + Cardea.select_pipeline + Cardea.train_test_split + Cardea.fit + Cardea.predict + Cardea.fit_predict + Cardea.evaluate + Cardea.save + Cardea.load \ No newline at end of file diff --git a/docs/api_reference/modeling.rst b/docs/api_reference/modeling.rst index 471917a8..350c5331 100644 --- a/docs/api_reference/modeling.rst +++ b/docs/api_reference/modeling.rst @@ -12,4 +12,10 @@ Modeler :toctree: api/ Modeler - Modeler.execute_pipeline \ No newline at end of file + Modeler.train_test_split + Modeler.fit + Modeler.predict + Modeler.fit_predict + Modeler.evaluate + Modeler.save + Modeler.load diff --git a/docs/api_reference/problem_definition.rst b/docs/api_reference/problem_definition.rst index 190dc94d..7a109238 100644 --- a/docs/api_reference/problem_definition.rst +++ b/docs/api_reference/problem_definition.rst @@ -56,5 +56,5 @@ MissedAppointmentProblemDefinition .. autosummary:: :toctree: api/ - MissedAppointmentProblemDefinition - MissedAppointmentProblemDefinition.generate_cutoff_times + MissedAppointment + MissedAppointment.generate_cutoff_times diff --git a/docs/authors.rst b/docs/authors.rst deleted file mode 100644 index e122f914..00000000 --- a/docs/authors.rst +++ /dev/null @@ -1 +0,0 @@ -.. include:: ../AUTHORS.rst diff --git a/docs/basic_concepts/index.rst b/docs/basic_concepts/index.rst index 6d54c8d2..cf0ebc8d 100644 --- a/docs/basic_concepts/index.rst +++ b/docs/basic_concepts/index.rst @@ -1,3 +1,5 @@ +.. _concepts: + Basic Concepts ============== diff --git a/docs/basic_concepts/machine_learning_tasks.rst b/docs/basic_concepts/machine_learning_tasks.rst index fbcb1fc6..f58a3992 100644 --- a/docs/basic_concepts/machine_learning_tasks.rst +++ b/docs/basic_concepts/machine_learning_tasks.rst @@ -35,8 +35,8 @@ values in the **Missed Appointment** task: from cardea import Cardea cardea = Cardea() - cardea.load_data_entityset() - cardea.select_problem('MissedAppointmentProblemDefinition') + cardea.load_entityset() + cardea.select_problem('MissedAppointment') Current Prediction Problems --------------------------- diff --git a/docs/community/contributing.rst b/docs/community/contributing.rst index 09af707e..f73491fc 100644 --- a/docs/community/contributing.rst +++ b/docs/community/contributing.rst @@ -6,13 +6,13 @@ Contributing Guidelines Ready to contribute with your own code? Great! Before diving deeper into the contributing guidelines, please make sure to having read -the :ref:`concepts` section and to have gone through the :ref:`development` guide. +the :ref:`concepts` section and to have gone through the development guide. Afterwards, please make sure to read the following contributing guidelines carefully, and later on head to the step-by-step guides for each possible type of contribution. General Coding Guidelines -************************* +------------------------- Once you have set up your development environment, you are ready to start working on your python code. @@ -76,7 +76,7 @@ When doing so, make sure to follow these guidelines: Unit Testing Guidelines -*********************** +----------------------- If you are going to contribute Python code, we will ask you to write unit tests that cover your development, following these requirements: diff --git a/docs/getting_started/quickstart.rst b/docs/getting_started/quickstart.rst index 861be6cb..8011a7d3 100644 --- a/docs/getting_started/quickstart.rst +++ b/docs/getting_started/quickstart.rst @@ -20,59 +20,60 @@ following command: .. ipython:: python - cardea.load_data_entityset() + cardea.load_entityset() + cardea.es You can see the list of problem definitions and select one with the following commands: .. ipython:: python cardea.list_problems() - problem = cardea.select_problem('MissedAppointmentProblemDefinition') -Then, you can perform the AutoML steps and take advantage of Cardea: +From there, you can select the prediction problem you aim to solve by specifying the name of the class, which in return gives us the ``label_times`` of the problem. -1. Extracting features (automated feature engineering), using the following commands: +.. ipython:: python - .. ipython:: python + label_times = cardea.select_problem('MissedAppointment') + label_times.head() - feature_matrix = cardea.generate_features(problem[:1000]) # a subset - feature_matrix = feature_matrix.sample(frac=1) # shuffle - y = list(feature_matrix.pop('label')) - X = feature_matrix.values +Then, you can perform the AutoML steps and take advantage of Cardea. -2. Modeling, optimizing hyperparameters and finding the most optimal model using the following commands: +Cardea extracts features through automated feature engineering by supplying the ``label_times`` pertaining to the problem you aim to solve, using the following commands: - .. ipython:: python + .. ipython:: python + :okwarning: - pipeline = [ - ['sklearn.ensemble.RandomForestClassifier'], - ['sklearn.naive_bayes.MultinomialNB'], - ['sklearn.neighbors.KNeighborsClassifier'] - ] + feature_matrix = cardea.generate_features(label_times[:1000]) # a subset + feature_matrix.head() - result = cardea.execute_model( - feature_matrix=X, - target=y, - primitives=pipeline - ) +Once we have the features, we can now split the data into training and testing + .. ipython:: python + :okwarning: -Finally, you can see accuracy results using the following commands: + y = list(feature_matrix.pop('label')) + X = feature_matrix.values + + X_train, X_test, y_train, y_test = cardea.train_test_split( + X, y, test_size=0.2, shuffle=True) -.. ipython:: python - import pandas as pd - from sklearn.metrics import accuracy_score +Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model is done using the following commands: - y_test = [] - y_pred = [] + .. ipython:: python + :okwarning: + + cardea.select_pipeline('Random Forest') + cardea.fit(X_train, y_train) + y_pred = cardea.predict(X_test) + + +Finally, you can see accuracy results using the following commands: - for i in range(0, 10): - y_test.extend(result['pipeline0']['folds'][str(i)]['Actual']) - y_pred.extend(result['pipeline0']['folds'][str(i)]['predicted']) + .. ipython:: python + :okwarning: + + cardea.evaluate(X, y, test_size=0.2, metrics=['Accuracy', 'F1 Macro']) - y_test = pd.Categorical(pd.Series(y_test)).codes - y_pred = pd.Categorical(pd.Series(y_pred)).codes - accuracy_score(y_test, y_pred) .. _Medical Appointment No Shows: https://www.kaggle.com/joniarroba/noshowappointments diff --git a/docs/images/cardea-process.png b/docs/images/cardea-process.png new file mode 100644 index 0000000000000000000000000000000000000000..21d8ab341eb0dd29ed09c97c0818df756f36b9fd GIT binary patch literal 71017 zcmb5WcRZHi`#yf7p;Sbo6jBHwBa*#kMD|wp9@(;rLP)ZAva+-HC|gFz-aC8m@qEvR z-tYJ4zu)V3zmmvvKi7R-=XIXPc^t>ZN9LL6<%>5jA_#K%shE%~g5bah<~TMceCI{h zq!WC)U@4|zjUehG=nsr*!GinB1^z#4i|Ipa7HOAXlcp3z;1|`GHWv>>m>Yl;OOt#-kXIE+dZ9p+$mGQRR
zpD_>4BQGejt|CY^T?V(tUXp9GGxq#5R+07!r7(qBOiUhuAfo>a5JAGP76#AVx-00)
zBPZwLm|hS>L_Bs^9=#VncNvg9?5RLMnrE2E4T6+kwBizyfz`?gg5TL^CvWKVIWXUK
zy!@LF7N`B%&!&x(qTX_z{gBU6jsY1C?pJ6Ky!B|UyZ2n*oUZ0^j#mHfde={~SoNZQ
zUIoejJkdGkp0GiuGOSl4YR&nD+|&*o7?phJKoFCNb0MEQxN+Pbbr6wRq^5H}?E#zd
zpNMiy3?zYv%U-qkAXj^>)2C0xqB;j(xwb6RZ2i(*^sf%0?kfK5Xk({$ud7NbTP|
zTzS&83xDLd2D6D5ei?0H?qZ_%5jp*Uv0$SCj|gG#Tqk~a;P@*azw-#W-k^ylBex89
zH)6qy>6#gL_Kf=tZ=5p-g4d+H6G}o>AtEZ|?DEq2=jVfj8DShPj*YpwfQ*=)g
zabNRx;rjq<&BC_9B(7%WAQQs(ngy}+Nmp^$< Pct
zvyCgAJmWpy0oReos`!=dJTi{o>P?L-UtfCuVXrh}7BIL#_I%s2b>Rpy6hL>st4WvH
z|4A3h>zn|||LuFj)m>BD+V5O|hv~`sS(qzgBCVt+Ck61V$UX!BTU=~ZXgKq>ujr)$
zw~&M1sEVz)5#kIJXzT!5X($Gwa4&m4{4a0eKOV1WVsKw-CZ!_i+Vv~43A6ajCN>@y
z1K}~;R{$ki%zfW=`)XD(+-%UZJ(yiw;k8(B
~k&MBPz6IDP?9v{Sn2=^D>Y|~ jt%1l&bx}4XAo0%#3dfe^bvu3C #nXT7
z7)3LX6Erui!RcS#2Y=tG1jwb3@qFSi$~7Cfg^ah?PcL0_)bsau%Ms;)LaCirCK-M5
z`cs*-weI18w-gdbn^r!pyAOdz+k8!%&Nv4()_q>R>x&=|?s{Db`;ECcO{$IeY&fO_
zXCvD6C7X1;hdFoC_^y2VI7fiUIg|`S
zk>EamaWuvC0$j?91uiulAI>6Knf^6*2Esp?+i5`)_;%fQ4AT6bYXXu-YeFd8N9O+l
z82(Q)_`jdnoT7pfzf+;&4x~~|V#ZDR!&jx?D64lUo~64XM^mt=uYcAADtm$d?YQ}W
zwb=iQTm8?^*F^k(!j=Emr0Vg94|-S@%F#paQmV2+p0)E8lts%yK5kCDo!VZJ9k?qj
zTKwLNgZdC RCUi*H#@6M`zVZ}XsiV~9*Y@#wm$c_?%>plHK|BMr
zyHB(07A27_M`&MB5a>