diff --git a/pipelines/introduction-to-generic-pipelines/Part 1 - Data Cleaning.ipynb b/pipelines/introduction-to-generic-pipelines/Part 1 - Data Cleaning.ipynb index b9e0f01..d55d8cf 100644 --- a/pipelines/introduction-to-generic-pipelines/Part 1 - Data Cleaning.ipynb +++ b/pipelines/introduction-to-generic-pipelines/Part 1 - Data Cleaning.ipynb @@ -2,1032 +2,155 @@ "cells": [ { "cell_type": "markdown", + "id": "34369de6-b412-4e7c-8f68-4a65781545d1", "metadata": { - "papermill": { - "duration": 0.092204, - "end_time": "2020-06-23T18:39:21.723375", - "exception": false, - "start_time": "2020-06-23T18:39:21.631171", - "status": "completed" - }, + "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ - "# Cleaning NOAA Weather Data of JFK Airport (New York)\n", - "\n", - "This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/jfk-weather-data/).\n", - "\n", - "In this notebook, we clean the raw dataset by:\n", - "* removing redundant columns and preserving only key numeric columns\n", - "* converting and cleaning data where required\n", - "* creating a fixed time interval between observations (this aids with later time-series analysis)\n", - "* filling missing values\n", - "* encoding certain weather features\n", - "\n", - "### Table of Contents:\n", - "* [1. Read the Raw Data](#cell1)\n", - "* [2. Clean the Data](#cell2)\n", - " * [2.1 Select data columns](#cell3)\n", - " * [2.2 Clean up precipitation column](#cell4)\n", - " * [2.3 Convert columns to numerical types](#cell5)\n", - " * [2.4 Reformat and process data](#cell6)\n", - " * [2.5 Create a fixed interval dataset](#cell7)\n", - " * [2.6 Feature encoding](#cell8)\n", - " * [2.7 Rename columns](#cell9)\n", - "* [3. Save the Cleaned Data](#cell10)\n", - "* [Authors](#authors)\n", - "\n", - "#### Import required modules\n", - "\n", - "Import and configure the required modules." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 8.307268, - "end_time": "2020-06-23T18:39:30.075908", - "exception": false, - "start_time": "2020-06-23T18:39:21.768640", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "!pip3 install PyGithub pandas > /dev/null 2>&1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 1.643517, - "end_time": "2020-06-23T18:39:31.767117", - "exception": false, - "start_time": "2020-06-23T18:39:30.123600", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Define required imports\n", - "import numpy as np\n", - "import pandas as pd\n", - "import re\n", - "import sys\n", - "\n", - "# These set pandas max column and row display in the notebook\n", - "pd.set_option('display.max_columns', 50)\n", - "pd.set_option('display.max_rows', 50)" + "## Copyright 2018-2022 Elyra Authors" ] }, { "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.041501, - "end_time": "2020-06-23T18:39:31.845060", - "exception": false, - "start_time": "2020-06-23T18:39:31.803559", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "\n", - "### 1. Read the Raw Data\n", - "\n", - "We start by reading in the raw dataset, displaying the first few rows of the dataframe, and taking a look at the columns and column types present." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 12.521019, - "end_time": "2020-06-23T18:39:44.415882", - "exception": false, - "start_time": "2020-06-23T18:39:31.894863", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "raw_data = pd.read_csv('data/noaa-weather-data-jfk-airport/jfk_weather.csv',\n", - " parse_dates=['DATE'])\n", - "raw_data.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.111785, - "end_time": "2020-06-23T18:39:44.607034", - "exception": false, - "start_time": "2020-06-23T18:39:44.495249", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "raw_data.dtypes" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.05262, - "end_time": "2020-06-23T18:39:44.711391", - "exception": false, - "start_time": "2020-06-23T18:39:44.658771", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "\n", - "### 2. Clean the Data\n", - "\n", - "As you can see above, there are a lot of fields which are non-numerical - usually these will be fields that contain text or categorical data, e.g. `HOURLYSKYCONDITIONS`.\n", - "\n", - "There are also fields - such as the main temperature field of interest `HOURLYDRYBULBTEMPF` - that we expect to be numerical, but are instead `object` type. This often indicates that there may be missing (or `null`) values, or some other unusual readings that we may have to deal with (since otherwise the field would have been fully parsed as a numerical data type).\n", - "\n", - "In addition, some fields relate to hourly observations, while others relate to daily or monthly intervals. For purposes of later exploratory data analysis, we will restrict the dataset to a certain subset of numerical fields that relate to hourly observations.\n", - "\n", - "In this section, we refer to the [NOAA Local Climatological Data Documentation](https://data.noaa.gov/dataset/dataset/u-s-local-climatological-data-lcd/resource/ee7381ea-647a-434f-8cfa-81202b9b4c05) to describe the fields and meaning of various values.\n", - "\n", - "\n", - "#### 2.1 Select data columns\n", - "\n", - "First, we select only the subset of data columns of interest and inspect the column types." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.310215, - "end_time": "2020-06-23T18:39:45.071537", - "exception": false, - "start_time": "2020-06-23T18:39:44.761322", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Choose what columns to import from raw data\n", - "column_subset = [\n", - " 'DATE',\n", - " 'HOURLYVISIBILITY',\n", - " 'HOURLYDRYBULBTEMPF',\n", - " 'HOURLYWETBULBTEMPF',\n", - " 'HOURLYDewPointTempF',\n", - " 'HOURLYRelativeHumidity',\n", - " 'HOURLYWindSpeed',\n", - " 'HOURLYWindDirection',\n", - " 'HOURLYStationPressure',\n", - " 'HOURLYPressureTendency',\n", - " 'HOURLYSeaLevelPressure',\n", - " 'HOURLYPrecip',\n", - " 'HOURLYAltimeterSetting'\n", - "]\n", - "\n", - "# Filter dataset to relevant columns\n", - "hourly_data = raw_data[column_subset]\n", - "# Set date index\n", - "hourly_data = hourly_data.set_index(pd.DatetimeIndex(hourly_data['DATE']))\n", - "hourly_data.drop(['DATE'], axis=1, inplace=True)\n", - "hourly_data.replace(to_replace='*', value=np.nan, inplace=True)\n", - "hourly_data.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.075418, - "end_time": "2020-06-23T18:39:45.193224", - "exception": false, - "start_time": "2020-06-23T18:39:45.117806", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "hourly_data.dtypes" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.038803, - "end_time": "2020-06-23T18:39:45.296888", - "exception": false, - "start_time": "2020-06-23T18:39:45.258085", - "status": "completed" - }, - "tags": [] - }, + "id": "a3167354-5a83-4950-8e88-c4899d93737d", + "metadata": {}, "source": [ - "\n", - "#### 2.2 Clean up precipitation column\n", + "Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "you may not use this file except in compliance with the License.\n", + "You may obtain a copy of the License at\n", "\n", - "From the dataframe preview above, we can see that the column `HOURLYPrecip` - which is the hourly measure of precipitation levels - contains both `NaN` and `T` values. `T` specifies *trace amounts of precipitation*, while `NaN` means *not a number*, and is used to denote missing values.\n", - "\n", - "We can also inspect the unique values present for the field." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.075263, - "end_time": "2020-06-23T18:39:45.412578", - "exception": false, - "start_time": "2020-06-23T18:39:45.337315", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "hourly_data['HOURLYPrecip'].unique()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.050967, - "end_time": "2020-06-23T18:39:45.515397", - "exception": false, - "start_time": "2020-06-23T18:39:45.464430", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We can see that some values end with an `s` (indicating snow), while there is a strange value `0.020.01s` which appears to be an error of some sort. To deal with `T` values, we will set the observation to be `0`. We will also replace the erroneous value `0.020.01s` with `NaN`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.068251, - "end_time": "2020-06-23T18:39:45.642901", - "exception": false, - "start_time": "2020-06-23T18:39:45.574650", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Fix imported data\n", - "hourly_data['HOURLYPrecip'].replace(to_replace='T', value='0.00', inplace=True)\n", - "hourly_data['HOURLYPrecip'].replace('0.020.01s', np.nan, inplace=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.063395, - "end_time": "2020-06-23T18:39:45.772975", - "exception": false, - "start_time": "2020-06-23T18:39:45.709580", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "#### 2.3 Convert columns to numerical types\n", - "\n", - "Next, we will convert string columns that refer to numerical values to numerical types. For columns such as `HOURLYPrecip`, we first also drop the non-numerical parts of the value (the `s` character)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 2.690677, - "end_time": "2020-06-23T18:39:48.523767", - "exception": false, - "start_time": "2020-06-23T18:39:45.833090", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Set of columns to convert\n", - "messy_columns = column_subset[1:]\n", - "\n", - "# Convert columns to float32 datatype\n", - "for i in messy_columns:\n", - " hourly_data[i] = hourly_data[i].apply(\n", - " lambda x: re.sub('[^0-9,.-]', '', x)\n", - " if type(x) == str else x).replace('', np.nan).astype(('float32'))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.061545, - "end_time": "2020-06-23T18:39:48.670019", - "exception": false, - "start_time": "2020-06-23T18:39:48.608474", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We can now see that all fields have numerical data type." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.15388, - "end_time": "2020-06-23T18:39:48.876939", - "exception": false, - "start_time": "2020-06-23T18:39:48.723059", - "status": "completed" - }, - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "print(hourly_data.info())\n", - "print()\n", - "hourly_data.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.060749, - "end_time": "2020-06-23T18:39:49.004064", - "exception": false, - "start_time": "2020-06-23T18:39:48.943315", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "#### 2.4 Reformat and process data\n", - "\n", - "Next, we will clean up some of the data columns to ensure their values fall within the parameters defined by the NOAA documentation (referred to above). " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.348362, - "end_time": "2020-06-23T18:39:49.415029", - "exception": false, - "start_time": "2020-06-23T18:39:49.066667", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Generate the summary statistics for each column\n", - "hourly_data.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.071245, - "end_time": "2020-06-23T18:39:49.539044", - "exception": false, - "start_time": "2020-06-23T18:39:49.467799", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "According to the documentation, the `HOURLYPressureTendency` field should be an integer value in the range `[0, 8]`. Let's check if this condition holds for this dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.112969, - "end_time": "2020-06-23T18:39:49.719340", - "exception": false, - "start_time": "2020-06-23T18:39:49.606371", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Check if categorical variable HOURLYPressureTendency ever has a non-integer\n", - "# entry outside the bounds of 0-8\n", - "cond =\\\n", - " len(hourly_data[~hourly_data['HOURLYPressureTendency'].isin(\n", - " list(range(0, 9)) + [np.nan])])\n", + "http://www.apache.org/licenses/LICENSE-2.0\n", "\n", - "print('Hourly Pressure Tendency should be between 0 and 8: {}'\n", - " .format(cond == 0))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.064188, - "end_time": "2020-06-23T18:39:49.838842", - "exception": false, - "start_time": "2020-06-23T18:39:49.774654", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "The `HOURLYVISIBILITY` should be an integer in the range `[0, 10]`. Let's check this condition too." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.188961, - "end_time": "2020-06-23T18:39:50.114669", - "exception": false, - "start_time": "2020-06-23T18:39:49.925708", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Hourly Visibility should be between 0 and 10\n", - "hourly_data[(hourly_data['HOURLYVISIBILITY'] < 0) | (hourly_data['HOURLYVISIBILITY'] > 10)]" + "Unless required by applicable law or agreed to in writing, software\n", + "distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "See the License for the specific language governing permissions and\n", + "limitations under the License." ] }, { "cell_type": "markdown", + "id": "d113346a-57c6-4d24-a056-5399f4143d8e", "metadata": { - "papermill": { - "duration": 0.074652, - "end_time": "2020-06-23T18:39:50.305437", - "exception": false, - "start_time": "2020-06-23T18:39:50.230785", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We find that a couple of observations fall outside the range. These must be spurious data observations and we handle them by replacing them with `NaN`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.106781, - "end_time": "2020-06-23T18:39:50.494873", - "exception": false, - "start_time": "2020-06-23T18:39:50.388092", - "status": "completed" - }, "tags": [] }, - "outputs": [], "source": [ - "# Replace any hourly visibility figure outside these bounds with nan\n", - "hourly_data.loc[hourly_data['HOURLYVISIBILITY'] > 10, 'HOURLYVISIBILITY'] = np.nan\n", - "\n", - "# Hourly Visibility should be between 0 and 10\n", - "cond = len(hourly_data[(hourly_data['HOURLYVISIBILITY'] < 0) | (hourly_data['HOURLYVISIBILITY'] > 10)])\n", + "## Add a header row\n", "\n", - "print('Hourly Visibility should be between 0 and 10: {}'.format(cond == 0))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.059017, - "end_time": "2020-06-23T18:39:50.621964", - "exception": false, - "start_time": "2020-06-23T18:39:50.562947", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Finally, we check if there are any duplicates with respect to our `DATE` index and check furthermore that our dates are in the correct order (that is, strictly increasing)." + "This tutorial notebook adds a header row to a csv file, if one isn't included yet." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.135424, - "end_time": "2020-06-23T18:39:50.813078", - "exception": false, - "start_time": "2020-06-23T18:39:50.677654", - "status": "completed" - }, - "tags": [] - }, + "id": "7a68dff7-cd67-4d8a-ae48-396734cf8d40", + "metadata": {}, "outputs": [], "source": [ - "cond = len(hourly_data[hourly_data.index.duplicated()].sort_index())\n", - "print('Date index contains no duplicate entries: {}'.format(cond == 0))" + "# Install pandas package, if it isn't already installed\n", + "# ! pip install pandas" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.119654, - "end_time": "2020-06-23T18:39:51.012609", - "exception": false, - "start_time": "2020-06-23T18:39:50.892955", - "status": "completed" - }, - "tags": [] - }, + "id": "8ccef880-e690-4f0c-8827-d98141b6b6fa", + "metadata": {}, "outputs": [], "source": [ - "# Make sure time index is sorted and increasing\n", - "print('Date index is strictly increasing: {}'\n", - " .format(hourly_data.index.is_monotonic_increasing))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.108063, - "end_time": "2020-06-23T18:39:51.209020", - "exception": false, - "start_time": "2020-06-23T18:39:51.100957", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "#### 2.5 Create a fixed interval dataset\n", + "from csv import Sniffer\n", + "import os\n", + "from pathlib import Path\n", "\n", - "Most time-series analysis requires (or certainly works much better with) data that has fixed measurement intervals. As you may have noticed from the various data samples above, the measurement intervals for this dataset are not exactly hourly. So, we will use `Pandas`' resampling functionality to create a dataset that has exact hourly measurement intervals." + "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.215191, - "end_time": "2020-06-23T18:39:51.541396", - "exception": false, - "start_time": "2020-06-23T18:39:51.326205", - "status": "completed" - }, - "tags": [] - }, + "id": "2ff485d7-e4e2-4e41-88dc-384fae5bc7b2", + "metadata": {}, "outputs": [], "source": [ - "# Resample (downsample) to hourly rows (we're shifting everything up by 9 minutes!)\n", - "hourly_data = hourly_data.resample('60min').last().shift(periods=1) # noqa Note: use resample('60min', base=51) to resample on the 51st of every hour" + "data_filename_in = os.getenv(\"DATASET_FILENAME\", \"iris.data\")" ] }, { "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.140495, - "end_time": "2020-06-23T18:39:51.763022", - "exception": false, - "start_time": "2020-06-23T18:39:51.622527", - "status": "completed" - }, - "tags": [] - }, + "id": "87e824a5-e237-40d2-8da3-4725d6d28b4c", + "metadata": {}, "source": [ - "We will now also replace missing values. For numerical values, we will linearly interpolate between the previous and next valid obvservations. For the categorical `HOURLYPressureTendency` field, we will replace missing values with the last valid observation." + "Determine whether the dataset file already includes a header row" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.241565, - "end_time": "2020-06-23T18:39:52.094112", - "exception": false, - "start_time": "2020-06-23T18:39:51.852547", - "status": "completed" - }, - "tags": [] - }, + "id": "c7549520-168e-431f-a756-0bf82c3a3465", + "metadata": {}, "outputs": [], "source": [ - "hourly_data['HOURLYPressureTendency'] =\\\n", - " hourly_data['HOURLYPressureTendency'].fillna(method='ffill') # fill with last valid observation\n", - "hourly_data = hourly_data.interpolate(method='linear') # interpolate missing values\n", - "hourly_data.drop(hourly_data.index[0], inplace=True) # drop first row" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.184815, - "end_time": "2020-06-23T18:39:52.363119", - "exception": false, - "start_time": "2020-06-23T18:39:52.178304", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "print(hourly_data.info())\n", - "print()\n", - "hourly_data.head()" + "with open(data_filename_in) as file_in:\n", + " has_header_row = Sniffer().has_header(file_in.read(4096))" ] }, { "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.067926, - "end_time": "2020-06-23T18:39:52.505261", - "exception": false, - "start_time": "2020-06-23T18:39:52.437335", - "status": "completed" - }, - "tags": [] - }, + "id": "45d1fc78-06bf-4d89-862c-3a4b7ff904ce", + "metadata": {}, "source": [ - "\n", - "#### 2.6 Feature encoding\n", - "\n", - "The final pre-processing step we will perform will be to handle two of our columns in a special way in order to correctly encode these features. They are:\n", - "\n", - "1. `HOURLYWindDirection` - wind direction\n", - "2. `HOURLYPressureTendency` - an indicator of pressure changes\n", - "\n", - "For `HOURLYWindDirection`, we encode the raw feature value as two new values, which measure the cyclical nature of wind direction - that is, we are encoding the compass-point nature of wind direction measurements." + "Add a header row, if none was detected." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.130511, - "end_time": "2020-06-23T18:39:52.708622", - "exception": false, - "start_time": "2020-06-23T18:39:52.578111", - "status": "completed" - }, - "tags": [] - }, + "id": "73f5d089-ada6-4f73-8f2b-bc34a66ee078", + "metadata": {}, "outputs": [], "source": [ - "# Transform HOURLYWindDirection into a cyclical variable using sin and cos transforms\n", - "hourly_data['HOURLYWindDirectionSin'] = np.sin(hourly_data['HOURLYWindDirection'] * (2. * np.pi / 360))\n", - "hourly_data['HOURLYWindDirectionCos'] = np.cos(hourly_data['HOURLYWindDirection'] * (2. * np.pi / 360))\n", - "hourly_data.drop(['HOURLYWindDirection'], axis=1, inplace=True)" + "if has_header_row:\n", + " iris_df = pd.read_csv(data_filename_in)\n", + " # headers = list(iris_df.columns)\n", + "else:\n", + " headers = [\"sepal length\", \"sepal_width\", \"petal_length\", \"petal_width\", \"class\"]\n", + " iris_df = pd.read_csv(data_filename_in, names=headers)" ] }, { "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.079907, - "end_time": "2020-06-23T18:39:52.855383", - "exception": false, - "start_time": "2020-06-23T18:39:52.775476", - "status": "completed" - }, - "tags": [] - }, + "id": "54ecc313-2c10-4f85-a5fa-6608050e8f03", + "metadata": {}, "source": [ - "For `HOURLYPressureTendency`, the feature value is in fact a `categorical` feature with three levels:\n", - "* `0-3` indicates an increase in pressure over the previous 3 hours\n", - "* `4` indicates no change during the previous 3 hours\n", - "* `5-8` indicates a decrease over the previous 3 hours\n", - "\n", - "Hence, we encode this feature into 3 dummy values representing these 3 potential states." + "Save the dataset file, which now includes a header row" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.264309, - "end_time": "2020-06-23T18:39:53.195280", - "exception": false, - "start_time": "2020-06-23T18:39:52.930971", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Transform HOURLYPressureTendency into 3 dummy variables based on NOAA documentation\n", - "hourly_data['HOURLYPressureTendencyIncr'] =\\\n", - " [1.0 if x in [0, 1, 2, 3]\n", - " else 0.0 for x in hourly_data['HOURLYPressureTendency']] # noqa 0 through 3 indicates an increase in pressure over previous 3 hours\n", - "hourly_data['HOURLYPressureTendencyDecr'] =\\\n", - " [1.0 if x in [5, 6, 7, 8]\n", - " else 0.0 for x in hourly_data['HOURLYPressureTendency']] # noqa 5 through 8 indicates a decrease over previous 3 hours\n", - "hourly_data['HOURLYPressureTendencyConst'] =\\\n", - " [1.0 if x == 4\n", - " else 0.0 for x in hourly_data['HOURLYPressureTendency']] # noqa 4 indicates no change during previous 3 hours\n", - "hourly_data.drop(['HOURLYPressureTendency'], axis=1, inplace=True)\n", - "hourly_data['HOURLYPressureTendencyIncr'] =\\\n", - " hourly_data['HOURLYPressureTendencyIncr'].astype(('float32'))\n", - "hourly_data['HOURLYPressureTendencyDecr'] =\\\n", - " hourly_data['HOURLYPressureTendencyDecr'].astype(('float32'))\n", - "hourly_data['HOURLYPressureTendencyConst'] =\\\n", - " hourly_data['HOURLYPressureTendencyConst'].astype(('float32'))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.110513, - "end_time": "2020-06-23T18:39:53.410299", - "exception": false, - "start_time": "2020-06-23T18:39:53.299786", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "#### 2.7 Rename columns\n", - "\n", - "Before saving the dataset, we will rename the columns for readability." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.104134, - "end_time": "2020-06-23T18:39:53.581955", - "exception": false, - "start_time": "2020-06-23T18:39:53.477821", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "hourly_data.columns" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.094553, - "end_time": "2020-06-23T18:39:53.766644", - "exception": false, - "start_time": "2020-06-23T18:39:53.672091", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# define the new column names\n", - "columns_new_name = [\n", - " 'visibility',\n", - " 'dry_bulb_temp_f',\n", - " 'wet_bulb_temp_f',\n", - " 'dew_point_temp_f',\n", - " 'relative_humidity',\n", - " 'wind_speed',\n", - " 'station_pressure',\n", - " 'sea_level_pressure',\n", - " 'precip',\n", - " 'altimeter_setting',\n", - " 'wind_direction_sin',\n", - " 'wind_direction_cos',\n", - " 'pressure_tendency_incr',\n", - " 'pressure_tendency_decr',\n", - " 'pressure_tendency_const'\n", - "]\n", - "\n", - "columns_name_map =\\\n", - " {c: columns_new_name[i] for i, c in enumerate(hourly_data.columns)}\n", - "\n", - "hourly_data_renamed = hourly_data.rename(columns=columns_name_map)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.181441, - "end_time": "2020-06-23T18:39:54.052044", - "exception": false, - "start_time": "2020-06-23T18:39:53.870603", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "print(hourly_data_renamed.info())\n", - "print()\n", - "hourly_data_renamed.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "papermill": { - "duration": 0.136374, - "end_time": "2020-06-23T18:39:54.265600", - "exception": false, - "start_time": "2020-06-23T18:39:54.129226", - "status": "completed" - }, - "tags": [] - }, + "id": "09aa38de-7776-439b-a352-daef24e484ff", + "metadata": {}, "outputs": [], "source": [ - "# Explore some general information about the dataset\n", - "print('# of megabytes held by dataframe: {}'.format(\n", - " str(round(sys.getsizeof(hourly_data_renamed) / 1000000, 2))))\n", - "print('# of features: {}'.format(str(hourly_data_renamed.shape[1])))\n", - "print('# of observations: {}'.format(str(hourly_data_renamed.shape[0])))\n", - "print('Start date: {}'.format(str(hourly_data_renamed.index[0])))\n", - "print('End date: {}'.format(str(hourly_data_renamed.index[-1])))\n", - "print('# of days: {}'.format(\n", - " str((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days)))\n", - "print('# of months: {}'.format(\n", - " str(round((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days / 30, 2))))\n", - "print('# of years: {}'.format(\n", - " str(round((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days / 365, 2))))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.092822, - "end_time": "2020-06-23T18:39:54.484105", - "exception": false, - "start_time": "2020-06-23T18:39:54.391283", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "\n", - "### 3. Save the Cleaned Data\n", - "\n", - "Finally, we save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:\n", - "\n", - "```\n", - "{'file_name': 'jfk_weather_cleaned.csv',\n", - " 'message': 'File saved to project storage.',\n", - " 'bucket_name': 'jfkweatherdata-donotdelete-pr-...',\n", - " 'asset_id': '...'}\n", - "```\n", - "\n", - "**Note**: In order for this step to work, your project token (see the first cell of this notebook) must have `Editor` role. By default this will overwrite any existing file." + "data_filename_out = str(Path(data_filename_in).with_suffix(\".csv\"))\n", + "iris_df.to_csv(data_filename_out, index=False)" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "papermill": { - "duration": 3.476652, - "end_time": "2020-06-23T18:39:58.038092", - "exception": false, - "start_time": "2020-06-23T18:39:54.561440", - "status": "completed" - }, - "tags": [] - }, + "id": "76cb6636-0527-4ad4-bf9b-88f854e2526e", + "metadata": {}, "outputs": [], "source": [ - "hourly_data_renamed.to_csv(\n", - " \"data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv\",\n", - " float_format='%g')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.076713, - "end_time": "2020-06-23T18:39:58.181900", - "exception": false, - "start_time": "2020-06-23T18:39:58.105187", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "#### Next steps\n", - "\n", - "- Close this notebook.\n", - "- Open the `Part 2 - Data Analysis` notebook to explore the cleaned dataset." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.098392, - "end_time": "2020-06-23T18:39:58.364486", - "exception": false, - "start_time": "2020-06-23T18:39:58.266094", - "status": "completed" - }, - "tags": [] - }, - "source": [ - " \n", - "### Authors" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.069249, - "end_time": "2020-06-23T18:39:58.507345", - "exception": false, - "start_time": "2020-06-23T18:39:58.438096", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).\n", - "\n", - "Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License." + "print(f\"Saved dataset file as '{data_filename_out}'.\")" ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -1041,20 +164,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" - }, - "papermill": { - "duration": 38.709418, - "end_time": "2020-06-23T18:39:58.863383", - "environment_variables": {}, - "exception": null, - "input_path": "Part 1 - Data Cleaning.ipynb", - "output_path": "Part 1 - Data Cleaning-output.ipynb", - "parameters": {}, - "start_time": "2020-06-23T18:39:20.153965", - "version": "2.1.1" + "version": "3.7.12" } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/pipelines/introduction-to-generic-pipelines/Part 2 - Data Analysis.ipynb b/pipelines/introduction-to-generic-pipelines/Part 2 - Data Analysis.ipynb index 8b6bb4c..d1362f8 100644 --- a/pipelines/introduction-to-generic-pipelines/Part 2 - Data Analysis.ipynb +++ b/pipelines/introduction-to-generic-pipelines/Part 2 - Data Analysis.ipynb @@ -2,556 +2,276 @@ "cells": [ { "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.026664, - "end_time": "2020-06-23T18:41:12.994224", - "exception": false, - "start_time": "2020-06-23T18:41:12.967560", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "# Exploratory Data Analysis of NOAA Weather Data \n", - "\n", - "This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/jfk-weather-data/).\n", - "\n", - "In this notebook we visualize and analyze the weather time-series dataset.\n", - "\n", - "### Table of Contents:\n", - "* [1. Read the Cleaned Data](#cell1)\n", - "* [2. Visualize the Data](#cell2)\n", - "* [3. Analyze Trends in the Data](#cell3)\n", - "* [Authors](#authors)\n", - "\n", - "#### Import required packages\n", - "\n", - "Install and import the required packages:\n", - "\n", - "* pandas\n", - "* matplotlib\n", - "* seaborn\n", - "* numpy" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:41:13.057341Z", - "iopub.status.busy": "2020-06-23T18:41:13.055947Z", - "iopub.status.idle": "2020-06-23T18:41:39.978213Z", - "shell.execute_reply": "2020-06-23T18:41:39.977136Z" - }, - "papermill": { - "duration": 26.955318, - "end_time": "2020-06-23T18:41:39.978610", - "exception": false, - "start_time": "2020-06-23T18:41:13.023292", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Installing packages needed for data processing and visualization\n", - "!pip3 install pandas matplotlib seaborn numpy " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:41:40.369313Z", - "iopub.status.busy": "2020-06-23T18:41:40.367774Z", - "iopub.status.idle": "2020-06-23T18:41:44.347540Z", - "shell.execute_reply": "2020-06-23T18:41:44.348452Z" - }, - "papermill": { - "duration": 4.170177, - "end_time": "2020-06-23T18:41:44.348890", - "exception": false, - "start_time": "2020-06-23T18:41:40.178713", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], + "id": "702729a9-40d0-47a1-94e1-96ac60a96ce2", + "metadata": {}, "source": [ - "# Importing the packages\n", - "from matplotlib import pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "import seaborn as sns\n", - "\n", - "plt.rcParams['figure.dpi'] = 160" + "## Copyright 2018-2022 Elyra Authors" ] }, { "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.128767, - "end_time": "2020-06-23T18:41:44.616934", - "exception": false, - "start_time": "2020-06-23T18:41:44.488167", - "status": "completed" - }, - "tags": [] - }, + "id": "b02ab100-90c5-4364-9e13-2c5e2b15dd58", + "metadata": {}, "source": [ - "\n", - "\n", - "### 1. Read the Cleaned Data\n", + "Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "you may not use this file except in compliance with the License.\n", + "You may obtain a copy of the License at\n", "\n", - "We start by reading in the cleaned dataset that was created in notebook `Part 1 - Data Cleaning`. \n", + "http://www.apache.org/licenses/LICENSE-2.0\n", "\n", - "*Note* if you haven't yet run that notebook, do that first otherwise the cells below will not work." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:41:44.870589Z", - "iopub.status.busy": "2020-06-23T18:41:44.869715Z", - "iopub.status.idle": "2020-06-23T18:41:47.222803Z", - "shell.execute_reply": "2020-06-23T18:41:47.223766Z" - }, - "papermill": { - "duration": 2.487078, - "end_time": "2020-06-23T18:41:47.224072", - "exception": false, - "start_time": "2020-06-23T18:41:44.736994", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "data = pd.read_csv('data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv', parse_dates=['DATE'])\n", - "# Set date index\n", - "data = data.set_index(pd.DatetimeIndex(data['DATE']))\n", - "data.drop(['DATE'], axis=1, inplace=True)\n", - "data.head()" + "Unless required by applicable law or agreed to in writing, software\n", + "distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "See the License for the specific language governing permissions and\n", + "limitations under the License." ] }, { "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.116207, - "end_time": "2020-06-23T18:41:47.455659", - "exception": false, - "start_time": "2020-06-23T18:41:47.339452", - "status": "completed" - }, - "tags": [] - }, + "id": "cb48220c-fb29-4b50-9868-a27d75523d9d", + "metadata": {}, "source": [ - "\n", - "\n", - "### 2. Visualize the Data\n", - "\n", - "In this section we visualize a few sections of the data, using `matplotlib`'s `pyplot` module. \n" + "## Analyze the cleaned dataset" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:41:47.682980Z", - "iopub.status.busy": "2020-06-23T18:41:47.681927Z", - "iopub.status.idle": "2020-06-23T18:41:47.689885Z", - "shell.execute_reply": "2020-06-23T18:41:47.688423Z" - }, - "papermill": { - "duration": 0.119332, - "end_time": "2020-06-23T18:41:47.690109", - "exception": false, - "start_time": "2020-06-23T18:41:47.570777", - "status": "completed" - }, - "tags": [] - }, + "execution_count": 7, + "id": "8ce06184-2db2-42c7-bc39-0d323c487981", + "metadata": {}, "outputs": [], "source": [ - "# Columns to visualize\n", - "plot_cols = ['dry_bulb_temp_f', 'relative_humidity', 'wind_speed', 'station_pressure', 'precip']" + "import os\n", + "import pandas as pd" ] }, { "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.087035, - "end_time": "2020-06-23T18:41:47.866770", - "exception": false, - "start_time": "2020-06-23T18:41:47.779735", - "status": "completed" - }, - "tags": [] - }, + "id": "6cb6e348-8297-41a5-b839-ba0c8db36109", + "metadata": {}, "source": [ - "#### Quick Peek at the Data\n", - "\n", - "We first visualize all the data we have to get a rough idea about how the data looks like. \n", - "\n", - "As we can see in the plot below, the hourly temperatures follow a clear seasonal trend. Wind speed, pressure, humidity and precipitation data seem to have much higher variance and randomness.\n", - "\n", - "It might be more meaningful to make a model to predict temperature, rather than some of the other more noisy data columns. " + "Load CSV file into Pandas DataFrame for analysis" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:41:48.119716Z", - "iopub.status.busy": "2020-06-23T18:41:48.117853Z", - "iopub.status.idle": "2020-06-23T18:41:52.078532Z", - "shell.execute_reply": "2020-06-23T18:41:52.079334Z" - }, - "papermill": { - "duration": 4.109965, - "end_time": "2020-06-23T18:41:52.079730", - "exception": false, - "start_time": "2020-06-23T18:41:47.969765", - "status": "completed" - }, - "tags": [] - }, + "execution_count": 13, + "id": "39f922b3-3c3f-474a-b7f5-af42db0cff3c", + "metadata": {}, "outputs": [], "source": [ - "# Quick overview of columns\n", - "plt.figure(figsize=(30, 12))\n", - "i = 1\n", - "for col in plot_cols:\n", - " plt.subplot(len(plot_cols), 1, i)\n", - " plt.plot(data[col].values)\n", - " plt.title(col)\n", - " i += 1\n", - "plt.subplots_adjust(hspace=0.5)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.140846, - "end_time": "2020-06-23T18:41:52.416480", - "exception": false, - "start_time": "2020-06-23T18:41:52.275634", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "#### Feature Dependencies\n", - "\n", - "Now we explore how the features (columns) of our data are related to each other. This helps in deciding which features to use when modelling a classifier or regresser. \n", - "We ideally want independent features to be classified independently and likewise dependent features to be contributing to the same model. \n", - "\n", - "We can see from the correlation plots how some features are somewhat correlated and could be used as additional data (perhaps for augmenting) when training a classifier. " + "data_filename_in = os.getenv(\"DATASET_FILENAME\", \"iris.csv\")\n", + "iris_df = pd.read_csv(data_filename_in)" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:41:52.778282Z", - "iopub.status.busy": "2020-06-23T18:41:52.776521Z", - "iopub.status.idle": "2020-06-23T18:41:53.792117Z", - "shell.execute_reply": "2020-06-23T18:41:53.786362Z" - }, - "papermill": { - "duration": 1.164636, - "end_time": "2020-06-23T18:41:53.792419", - "exception": false, - "start_time": "2020-06-23T18:41:52.627783", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], + "execution_count": 11, + "id": "734df1b4-2b48-4d69-9321-2440f8fdac95", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 149 entries, 0 to 148\n", + "Data columns (total 5 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 sepal length 149 non-null float64\n", + " 1 sepal_width 149 non-null float64\n", + " 2 petal_length 149 non-null float64\n", + " 3 petal_width 149 non-null float64\n", + " 4 class 149 non-null object \n", + "dtypes: float64(4), object(1)\n", + "memory usage: 5.9+ KB\n" + ] + } + ], "source": [ - "# Plot correlation matrix\n", - "f, ax = plt.subplots(figsize=(7, 7))\n", - "corr = data[plot_cols].corr()\n", - "sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool),\n", - " cmap=sns.diverging_palette(220, 10, as_cmap=True),\n", - " square=True, ax=ax)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.15953, - "end_time": "2020-06-23T18:41:54.149759", - "exception": false, - "start_time": "2020-06-23T18:41:53.990229", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Additionally we also visualize the joint distrubitions in the form of pairplots/scatter plots to see (qualitatively) the way in which these features are related in more detail over just the correlation.\n", - "They are essentially 2D joint distributions in the case of off-diagonal subplots and the histogram (an approximation to the probability distribution) in case of the diagonal subplots." + "iris_df.info()" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:41:54.421853Z", - "iopub.status.busy": "2020-06-23T18:41:54.420601Z", - "iopub.status.idle": "2020-06-23T18:42:22.529784Z", - "shell.execute_reply": "2020-06-23T18:42:22.531481Z" - }, - "papermill": { - "duration": 28.253543, - "end_time": "2020-06-23T18:42:22.531950", - "exception": false, - "start_time": "2020-06-23T18:41:54.278407", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Plot pairplots\n", - "sns.pairplot(data[plot_cols])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.211878, - "end_time": "2020-06-23T18:42:23.041456", - "exception": false, - "start_time": "2020-06-23T18:42:22.829578", - "status": "completed" - }, - "tags": [] - }, + "execution_count": 12, + "id": "098a72bc-4fb8-42f6-8d8c-eafee2a5486c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "class\n", + "Iris-versicolor 50\n", + "Iris-virginica 50\n", + "Iris-setosa 49\n", + "dtype: int64" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "\n", - "\n", - "### 3. Analyze Trends in the Data\n", - "\n", - "Now that we have explored the whole dataset and the features on a high level, let us focus on one particular feature - `dry_bulb_temp_f`, the dry bulb temperature in degrees Fahrenheit. This is what we mean when we refer to \"air temperature\". This is the most common feature used in temperature prediction, and here we explore it in further detail. \n", - "\n", - "We first start with plotting the data for all 9 years in monthly buckets then drill down to a single year to notice (qualitatively) the overall trend in the data. We can see from the plots that every year has roughly a sinousoidal nature to the temperature with some anomalies around 2013-2014. Upon further drilling down we see that each year's data is not the smooth sinousoid but rather a jagged and noisy one. But the overall trend still is a sinousoid. " + "iris_df.value_counts(\"class\")" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:42:23.463994Z", - "iopub.status.busy": "2020-06-23T18:42:23.463174Z", - "iopub.status.idle": "2020-06-23T18:42:25.738259Z", - "shell.execute_reply": "2020-06-23T18:42:25.735899Z" - }, - "papermill": { - "duration": 2.487948, - "end_time": "2020-06-23T18:42:25.738514", - "exception": false, - "start_time": "2020-06-23T18:42:23.250566", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "plt.figure(figsize=(15, 7))\n", - "\n", - "TEMP_COL = 'dry_bulb_temp_f'\n", - "# Plot temperature data converted to a monthly frequency\n", - "plt.subplot(1, 2, 1)\n", - "data[TEMP_COL].asfreq('M').plot()\n", - "plt.title('Monthly Temperature')\n", - "plt.ylabel('Temperature (F)')\n", - "\n", - "# Zoom in on a year and plot temperature data converted to a daily frequency\n", - "plt.subplot(1, 2, 2)\n", - "data['2017'][TEMP_COL].asfreq('D').plot()\n", - "plt.title('Daily Temperature in 2017')\n", - "plt.ylabel('Temperature (F)')\n", - "\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.279776, - "end_time": "2020-06-23T18:42:26.263290", - "exception": false, - "start_time": "2020-06-23T18:42:25.983514", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Next, we plot the change (delta) in temperature and notice that it is lowest around the middle of the year. That is expected behaviour as the gradient of the sinousoid near it's peak is zero. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:42:26.788207Z", - "iopub.status.busy": "2020-06-23T18:42:26.787291Z", - "iopub.status.idle": "2020-06-23T18:42:29.121789Z", - "shell.execute_reply": "2020-06-23T18:42:29.123377Z" - }, - "papermill": { - "duration": 2.611371, - "end_time": "2020-06-23T18:42:29.123704", - "exception": false, - "start_time": "2020-06-23T18:42:26.512333", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], + "execution_count": 14, + "id": "0dc00a4c-ba2c-4142-8a92-eb64150daa41", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal lengthsepal_widthpetal_lengthpetal_widthclass
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa
54.63.41.40.3Iris-setosa
65.03.41.50.2Iris-setosa
74.42.91.40.2Iris-setosa
84.93.11.50.1Iris-setosa
95.43.71.50.2Iris-setosa
\n", + "
" + ], + "text/plain": [ + " sepal length sepal_width petal_length petal_width class\n", + "0 4.9 3.0 1.4 0.2 Iris-setosa\n", + "1 4.7 3.2 1.3 0.2 Iris-setosa\n", + "2 4.6 3.1 1.5 0.2 Iris-setosa\n", + "3 5.0 3.6 1.4 0.2 Iris-setosa\n", + "4 5.4 3.9 1.7 0.4 Iris-setosa\n", + "5 4.6 3.4 1.4 0.3 Iris-setosa\n", + "6 5.0 3.4 1.5 0.2 Iris-setosa\n", + "7 4.4 2.9 1.4 0.2 Iris-setosa\n", + "8 4.9 3.1 1.5 0.1 Iris-setosa\n", + "9 5.4 3.7 1.5 0.2 Iris-setosa" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "plt.figure(figsize=(15, 7))\n", - "\n", - "# Plot percent change of daily temperature in 2017\n", - "plt.subplot(1, 2, 1)\n", - "data['2017'][TEMP_COL].asfreq('D').div(data['2017'][TEMP_COL].asfreq('D').shift()).plot()\n", - "plt.title('% Change in Daily Temperature in 2017')\n", - "plt.ylabel('% Change')\n", - "\n", - "# Plot absolute change of temperature in 2017 with daily frequency\n", - "plt.subplot(1, 2, 2)\n", - "data['2017'][TEMP_COL].asfreq('D').diff().plot()\n", - "plt.title('Absolute Change in Daily Temperature in 2017')\n", - "plt.ylabel('Temperature (F)')\n", - "\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.286847, - "end_time": "2020-06-23T18:42:29.665508", - "exception": false, - "start_time": "2020-06-23T18:42:29.378661", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Finally we apply some smoothing to the data in the form of a rolling/moving average. This is the simplest form of de-noising the data. As we can see from the plots, the average (plotted in blue) roughly traces the sinousoid and is now much smoother. This can improve the accuracy of a regression model trained to predict temperatures within a reasonable margin of error. " + "iris_df.head(10)" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T18:42:30.302485Z", - "iopub.status.busy": "2020-06-23T18:42:30.301288Z", - "iopub.status.idle": "2020-06-23T18:42:33.458643Z", - "shell.execute_reply": "2020-06-23T18:42:33.445922Z" - }, - "papermill": { - "duration": 3.513659, - "end_time": "2020-06-23T18:42:33.458941", - "exception": false, - "start_time": "2020-06-23T18:42:29.945282", - "status": "completed" - }, - "tags": [] - }, + "id": "371604d1-30e5-4035-944c-a6dbd8bf0426", + "metadata": {}, "outputs": [], - "source": [ - "plt.figure(figsize=(15, 7))\n", - "\n", - "# Plot rolling mean of temperature\n", - "plt.subplot(1, 2, 1)\n", - "data['2017'][TEMP_COL].rolling('5D').mean().plot(zorder=2) # Rolling average window is 5 days\n", - "data['2017'][TEMP_COL].plot(zorder=1)\n", - "plt.legend(['Rolling', 'Temp'])\n", - "plt.title('Rolling Avg in Hourly Temperature in 2017')\n", - "plt.ylabel('Temperature (F)')\n", - "\n", - "# Plot rolling mean of temperature\n", - "plt.subplot(1, 2, 2)\n", - "data['2017-01':'2017-03'][TEMP_COL].rolling('2D').mean().plot(zorder=2) # Rolling average window is 2 days\n", - "data['2017-01':'2017-03'][TEMP_COL].plot(zorder=1)\n", - "plt.legend(['Rolling', 'Temp'])\n", - "plt.title('Rolling Avg in Hourly Temperature in Winter 2017')\n", - "plt.ylabel('Temperature (F)')\n", - "\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.202051, - "end_time": "2020-06-23T18:42:33.888267", - "exception": false, - "start_time": "2020-06-23T18:42:33.686216", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "#### Next steps\n", - "\n", - "- Close this notebook.\n", - "- Open the `Part 3 - Time Series Forecasting` notebook to create time-series models to forecast temperatures." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.218222, - "end_time": "2020-06-23T18:42:34.348671", - "exception": false, - "start_time": "2020-06-23T18:42:34.130449", - "status": "completed" - }, - "tags": [] - }, - "source": [ - " \n", - "### Authors\n", - "\n", - "This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).\n", - "\n", - "Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License." - ] + "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -565,20 +285,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" - }, - "papermill": { - "duration": 82.979462, - "end_time": "2020-06-23T18:42:34.718798", - "environment_variables": {}, - "exception": null, - "input_path": "Part 2 - Data Analysis.ipynb", - "output_path": "Part 2 - Data Analysis-output.ipynb", - "parameters": {}, - "start_time": "2020-06-23T18:41:11.739336", - "version": "2.1.1" + "version": "3.7.12" } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/pipelines/introduction-to-generic-pipelines/Part 3 - Data Classification.ipynb b/pipelines/introduction-to-generic-pipelines/Part 3 - Data Classification.ipynb new file mode 100644 index 0000000..20ccaf8 --- /dev/null +++ b/pipelines/introduction-to-generic-pipelines/Part 3 - Data Classification.ipynb @@ -0,0 +1,154 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "702729a9-40d0-47a1-94e1-96ac60a96ce2", + "metadata": {}, + "source": [ + "## Copyright 2018-2022 Elyra Authors" + ] + }, + { + "cell_type": "markdown", + "id": "b02ab100-90c5-4364-9e13-2c5e2b15dd58", + "metadata": {}, + "source": [ + "Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "you may not use this file except in compliance with the License.\n", + "You may obtain a copy of the License at\n", + "\n", + "http://www.apache.org/licenses/LICENSE-2.0\n", + "\n", + "Unless required by applicable law or agreed to in writing, software\n", + "distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "See the License for the specific language governing permissions and\n", + "limitations under the License." + ] + }, + { + "cell_type": "markdown", + "id": "cb48220c-fb29-4b50-9868-a27d75523d9d", + "metadata": {}, + "source": [ + "## Classify data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c36d1686-5165-48ca-9906-a6bb99901b15", + "metadata": {}, + "outputs": [], + "source": [ + "# Uncomment this line to install scikit-learn\n", + "! pip install scikit-learn" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8ce06184-2db2-42c7-bc39-0d323c487981", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "from sklearn.model_selection import train_test_split" + ] + }, + { + "cell_type": "markdown", + "id": "6cb6e348-8297-41a5-b839-ba0c8db36109", + "metadata": {}, + "source": [ + "Load processed CSV file into Pandas DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39f922b3-3c3f-474a-b7f5-af42db0cff3c", + "metadata": {}, + "outputs": [], + "source": [ + "data_filename_in = os.getenv(\"DATASET_FILENAME\", \"iris.csv\")\n", + "iris_df = pd.read_csv(data_filename_in)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "540c6b4e-d0b1-4ea3-923a-f22fd19e31a7", + "metadata": {}, + "outputs": [], + "source": [ + "iris_df.shape" + ] + }, + { + "cell_type": "markdown", + "id": "535cc7df-fb5e-4e35-a965-feeb9ca93aa2", + "metadata": {}, + "source": [ + "Split the dataset into training set and testing set" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "371604d1-30e5-4035-944c-a6dbd8bf0426", + "metadata": {}, + "outputs": [], + "source": [ + "X = iris_df.drop([\"class\"], axis=1)\n", + "y = iris_df[\"class\"]\n", + "X_train, X_test, y_train, y_test = train_test_split(\n", + " X, y, test_size=0.35, random_state=5\n", + ")\n", + "\n", + "print(f\"Training feature {X_train.shape}\")\n", + "print(f\"Training outcome {y_train.shape}\")\n", + "print(f\"Test feature {X_test.shape}\")\n", + "print(f\"Test outcome {y_test.shape}\")" + ] + }, + { + "cell_type": "markdown", + "id": "1c2b6027-3f9d-4de9-9447-0555b62d7ed5", + "metadata": {}, + "source": [ + "Lorem Ipsum" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41590a00-68ff-4df2-8c90-b1819cfb7a20", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/pipelines/introduction-to-generic-pipelines/Part 3 - Time Series Forecasting.ipynb b/pipelines/introduction-to-generic-pipelines/Part 3 - Time Series Forecasting.ipynb deleted file mode 100644 index 56c064d..0000000 --- a/pipelines/introduction-to-generic-pipelines/Part 3 - Time Series Forecasting.ipynb +++ /dev/null @@ -1,1036 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.11289, - "end_time": "2020-06-23T19:09:28.704069", - "exception": false, - "start_time": "2020-06-23T19:09:28.591179", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "# Time Series Forecasting using the NOAA Weather Data of JFK Airport (New York)\n", - "\n", - "This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/jfk-weather-data/).\n", - "\n", - "In this notebook we explore approaches to predicting future temperatures by using the time-series dataset.\n", - "\n", - "### Table of Contents:\n", - "* [1. Read the Cleaned Data](#cell1)\n", - "* [2. Explore Baseline Models](#cell2)\n", - "* [3. Train Statistical Time-series Analysis Models](#cell3)\n", - "* [Authors](#cell4)\n", - "\n", - "#### Import required modules\n", - "\n", - "Import and configure the required modules." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:09:28.847867Z", - "iopub.status.busy": "2020-06-23T19:09:28.818724Z", - "iopub.status.idle": "2020-06-23T19:10:35.430435Z", - "shell.execute_reply": "2020-06-23T19:10:35.429274Z" - }, - "papermill": { - "duration": 66.690879, - "end_time": "2020-06-23T19:10:35.430701", - "exception": false, - "start_time": "2020-06-23T19:09:28.739822", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "!pip3 install statsmodels\n", - "!pip3 install sklearn\n", - "!pip3 install matplotlib" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:36.107462Z", - "iopub.status.busy": "2020-06-23T19:10:36.106220Z", - "iopub.status.idle": "2020-06-23T19:10:38.859278Z", - "shell.execute_reply": "2020-06-23T19:10:38.860272Z" - }, - "papermill": { - "duration": 3.115861, - "end_time": "2020-06-23T19:10:38.860579", - "exception": false, - "start_time": "2020-06-23T19:10:35.744718", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "from matplotlib import pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "from sklearn.metrics import mean_squared_error\n", - "from statsmodels.tsa.statespace.sarimax import SARIMAX" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.304667, - "end_time": "2020-06-23T19:10:39.610917", - "exception": false, - "start_time": "2020-06-23T19:10:39.306250", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "\n", - "### 1. Read the Cleaned Data\n", - "\n", - "We start by reading the cleaned dataset that was created in notebook `Part 1 - Data Cleaning`. \n", - "\n", - "**Note:** if you haven't yet run this notebook, run it first; otherwise the cells below will not work." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:40.252628Z", - "iopub.status.busy": "2020-06-23T19:10:40.247924Z", - "iopub.status.idle": "2020-06-23T19:10:42.107202Z", - "shell.execute_reply": "2020-06-23T19:10:42.106129Z" - }, - "papermill": { - "duration": 2.190699, - "end_time": "2020-06-23T19:10:42.107565", - "exception": false, - "start_time": "2020-06-23T19:10:39.916866", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "data = pd.read_csv('data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv', parse_dates=['DATE'])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:42.780891Z", - "iopub.status.busy": "2020-06-23T19:10:42.779724Z", - "iopub.status.idle": "2020-06-23T19:10:44.067874Z", - "shell.execute_reply": "2020-06-23T19:10:44.066877Z" - }, - "papermill": { - "duration": 1.586342, - "end_time": "2020-06-23T19:10:44.068083", - "exception": false, - "start_time": "2020-06-23T19:10:42.481741", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Set date index\n", - "data = data.set_index(pd.DatetimeIndex(data['DATE']))\n", - "data.drop(['DATE'], axis=1, inplace=True)\n", - "data.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.289625, - "end_time": "2020-06-23T19:10:44.726625", - "exception": false, - "start_time": "2020-06-23T19:10:44.437000", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "For purposes of time-series modeling, we will restrict our analysis to a 2-year sample of the dataset to avoid overly long model-training times. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:45.327525Z", - "iopub.status.busy": "2020-06-23T19:10:45.326317Z", - "iopub.status.idle": "2020-06-23T19:10:45.343228Z", - "shell.execute_reply": "2020-06-23T19:10:45.342339Z" - }, - "papermill": { - "duration": 0.321659, - "end_time": "2020-06-23T19:10:45.343406", - "exception": false, - "start_time": "2020-06-23T19:10:45.021747", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "sample = data['2016-01-01':'2018-01-01']\n", - "sample.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.293214, - "end_time": "2020-06-23T19:10:45.944510", - "exception": false, - "start_time": "2020-06-23T19:10:45.651296", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "#### Create Training/Validation/Test Splits\n", - "\n", - "Before we attempt any time-series analysis and prediction, we should split the dataset into training, validation and test sets. We use a portion of the data for training, and a portion of _future_ data for our validation and test sets.\n", - "\n", - "If we instead trained a model on the full dataset, the model would learn to be very good at making predictions on that particular dataset, essentially just copying the answers it knows. However, when presented with data the model _has not seen_ , it would perform poorly since it has not learned how to generalize its answers.\n", - "\n", - "By training on a portion of the dataset and testing the model's performance on another portion of the dataset (which data the model has not seen in training), we try to avoid our models \"over-fitting\" the dataset and make them better at predicting temperatures given unseen, future data. This process of splitting the dataset and evaluating a model's performance on the validation and test sets is commonly known as [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).\n", - "\n", - "By default here we use 80% of the data for the training set and 10% each for validation and test sets." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:46.594414Z", - "iopub.status.busy": "2020-06-23T19:10:46.591764Z", - "iopub.status.idle": "2020-06-23T19:10:46.617663Z", - "shell.execute_reply": "2020-06-23T19:10:46.600939Z" - }, - "papermill": { - "duration": 0.348951, - "end_time": "2020-06-23T19:10:46.620618", - "exception": false, - "start_time": "2020-06-23T19:10:46.271667", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "def split_data(data, val_size=0.1, test_size=0.1):\n", - " \"\"\"Splits data to training, validation and testing parts\n", - "\n", - " \"\"\"\n", - " ntest = int(round(len(data) * (1 - test_size)))\n", - " nval = int(round(len(data) * (1 - test_size - val_size)))\n", - "\n", - " df_train, df_val, df_test = data.iloc[:nval], data.iloc[nval:ntest], data.iloc[ntest:]\n", - "\n", - " return df_train, df_val, df_test\n", - "\n", - "\n", - "# Create data split\n", - "df_train, df_val, df_test = split_data(sample)\n", - "\n", - "print('Total data size: {} rows'.format(len(sample)))\n", - "print('Training set size: {} rows'.format(len(df_train)))\n", - "print('Validation set size: {} rows'.format(len(df_val)))\n", - "print('Test set size: {} rows'.format(len(df_test)))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.414963, - "end_time": "2020-06-23T19:10:47.400018", - "exception": false, - "start_time": "2020-06-23T19:10:46.985055", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "\n", - "\n", - "### 2. Explore Baseline Models\n", - "\n", - "In this section, we'll create a few simple predictive models of temperature, using shifting and rolling averages. These will serve as a baseline against which we can compare more sophisticated models.\n", - "\n", - "Using values at recent timesteps (such as the most recent timestep `t-1` and second-most recent timestep `t-2`) to predict the current value at time `t` is what's known as persistence modeling, or using the last observed value to predict the next following value. These preceding timesteps are often referred to in time-series analysis as `lags`. So, the value at time `t-1` is known as the `1st lag` and the value at time `t-2` is the `2nd lag`.\n", - "\n", - "We can also create baselines based on rolling (or moving) averages. This is a time-series constructed by averaging each lagged value up to the selected lag. For example, a 6-period (or 6-lag) rolling avearge is the average of the previous 6 hourly lags `t-6` to `t-1`.\n", - "\n", - "Our baseline models will be:\n", - "1. `1st lag` - i.e. values at `t-1`\n", - "2. `2nd lag` - i.e. values at `t-2`\n", - "3. `6-lag` rolling average\n", - "4. `12-lag` rolling average" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:48.228674Z", - "iopub.status.busy": "2020-06-23T19:10:48.227122Z", - "iopub.status.idle": "2020-06-23T19:10:48.275661Z", - "shell.execute_reply": "2020-06-23T19:10:48.277404Z" - }, - "papermill": { - "duration": 0.47806, - "end_time": "2020-06-23T19:10:48.277812", - "exception": false, - "start_time": "2020-06-23T19:10:47.799752", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# define the column containing the data we wish to model - in this case Dry Bulb Temperature (F)\n", - "Y_COL = 'dry_bulb_temp_f'\n", - "\n", - "# Use shifting and rolling averages to predict Y_COL (t)\n", - "n_in = 2\n", - "n_out = 1\n", - "features = [Y_COL]\n", - "n_features = len(features)\n", - "\n", - "# create the baseline on the entire sample dataset.\n", - "# we will evaluate the prediction error on the validation set\n", - "baseline = sample[[Y_COL]].loc[:]\n", - "baseline['{} (t-1)'.format(Y_COL)] = baseline[Y_COL].shift(1)\n", - "baseline['{} (t-2)'.format(Y_COL)] = baseline[Y_COL].shift(2)\n", - "baseline['{} (6hr rollavg)'.format(Y_COL)] = baseline[Y_COL].rolling('6H').mean()\n", - "baseline['{} (12hr rollavg)'.format(Y_COL)] = baseline[Y_COL].rolling('12H').mean()\n", - "baseline.dropna(inplace=True)\n", - "baseline.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.346401, - "end_time": "2020-06-23T19:10:48.951816", - "exception": false, - "start_time": "2020-06-23T19:10:48.605415", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Next, we will plot data from our validation dataset to get a sense for how well these baseline models predict the next hourly temperature. Note thatd we only use a few days of data in order to make the plot easier to view." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:49.600346Z", - "iopub.status.busy": "2020-06-23T19:10:49.598924Z", - "iopub.status.idle": "2020-06-23T19:10:49.605543Z", - "shell.execute_reply": "2020-06-23T19:10:49.606223Z" - }, - "papermill": { - "duration": 0.339705, - "end_time": "2020-06-23T19:10:49.606457", - "exception": false, - "start_time": "2020-06-23T19:10:49.266752", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# plot first 7 days of the validation set, 168 hours\n", - "start = df_val.index[0]\n", - "end = df_val.index[167]\n", - "sliced = baseline[start:end]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:50.283534Z", - "iopub.status.busy": "2020-06-23T19:10:50.282583Z", - "iopub.status.idle": "2020-06-23T19:10:51.038055Z", - "shell.execute_reply": "2020-06-23T19:10:51.036719Z" - }, - "papermill": { - "duration": 1.113089, - "end_time": "2020-06-23T19:10:51.038337", - "exception": false, - "start_time": "2020-06-23T19:10:49.925248", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Plot baseline predictions sample\n", - "cols = ['dry_bulb_temp_f', 'dry_bulb_temp_f (t-1)', 'dry_bulb_temp_f (t-2)',\n", - " 'dry_bulb_temp_f (6hr rollavg)', 'dry_bulb_temp_f (12hr rollavg)']\n", - "sliced[cols].plot()\n", - "\n", - "plt.legend(['t', 't-1', 't-2', '6hr', '12hr'], loc=2, ncol=3)\n", - "plt.title('Baselines for First 7 Days of Validation Set')\n", - "plt.ylabel('Temperature (F)')\n", - "plt.tight_layout()\n", - "plt.rcParams['figure.dpi'] = 100\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.449742, - "end_time": "2020-06-23T19:10:51.905524", - "exception": false, - "start_time": "2020-06-23T19:10:51.455782", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "#### Evaluate baseline models\n", - "\n", - "As you can perhaps see from the graph above, the _lagged_ baselines appear to do a better job of forecasting temperatures than the _rolling average_ baselines.\n", - "\n", - "In order to evaluate our baseline models more precisely, we need to answer the question _\"how well do our models predict future temperature?\"_. In regression problems involving prediction of a numerical value, we often use a measure of the difference between our predicted value and the actual value. This is referred to as an error measure or error metric. A common measure is the Mean Squared Error (MSE):\n", - "\n", - "\\begin{equation}\n", - "MSE = \\frac{1}{n} \\sum_{i=1}^{n}{(y_i - \\hat y_i)^{2}}\n", - "\\end{equation}\n", - "\n", - "This is the average of the squared differences between predicted values $ \\hat y $ and actual values $ y $.\n", - "\n", - "Because the MSE is in \"units squared\" it can be difficult to interpet, hence the Root Mean Squared Error (RMSE) is often used:\n", - "\n", - "\\begin{equation}\n", - "RMSE = \\sqrt {MSE} \n", - "\\end{equation}\n", - "\n", - "This is the square root of the MSE, and is in the same units as the values $ y $. We can compare the RMSE (and MSE) values for different models and say that the model that has the lower MSE is better at predicting temperatures, all things equal. Note that MSE and RMSE will grow large quickly if the differences between predicted and actual values are large. This may or may not be a desired quality of your error measure. In this case, it is probably a good thing, since a model that makes large mistakes in temperature prediction will be much less useful than one which makes small mistakes.\n", - "\n", - "Next, we calculate the RMSE measure for each of our baseline models, on the full validation set." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:52.661332Z", - "iopub.status.busy": "2020-06-23T19:10:52.659720Z", - "iopub.status.idle": "2020-06-23T19:10:52.711124Z", - "shell.execute_reply": "2020-06-23T19:10:52.709941Z" - }, - "papermill": { - "duration": 0.442586, - "end_time": "2020-06-23T19:10:52.711390", - "exception": false, - "start_time": "2020-06-23T19:10:52.268804", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Calculating baseline RMSE\n", - "start_val = df_val.index[0]\n", - "end_val = df_val.index[-1]\n", - "baseline_val = baseline[start_val:end_val]\n", - "\n", - "baseline_y = baseline_val[Y_COL]\n", - "baseline_t1 = baseline_val['dry_bulb_temp_f (t-1)']\n", - "baseline_t2 = baseline_val['dry_bulb_temp_f (t-2)']\n", - "baseline_avg6 = baseline_val['dry_bulb_temp_f (6hr rollavg)']\n", - "baseline_avg12 = baseline_val['dry_bulb_temp_f (12hr rollavg)']\n", - "\n", - "rmse_t1 = round(np.sqrt(mean_squared_error(baseline_y, baseline_t1)), 2)\n", - "rmse_t2 = round(np.sqrt(mean_squared_error(baseline_y, baseline_t2)), 2)\n", - "rmse_avg6 = round(np.sqrt(mean_squared_error(baseline_y, baseline_avg6)), 2)\n", - "rmse_avg12 = round(np.sqrt(mean_squared_error(baseline_y, baseline_avg12)), 2)\n", - "\n", - "print('Baseline t-1 RMSE: {0:.3f}'.format(rmse_t1))\n", - "print('Baseline t-2 RMSE: {0:.3f}'.format(rmse_t2))\n", - "print('Baseline 6hr rollavg RMSE: {0:.3f}'.format(rmse_avg6))\n", - "print('Baseline 12hr rollavg RMSE: {0:.3f}'.format(rmse_avg12))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.311905, - "end_time": "2020-06-23T19:10:53.380315", - "exception": false, - "start_time": "2020-06-23T19:10:53.068410", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "The RMSE results confirm what we saw in the graph above. It is clear that the _rolling average_ baselines perform poorly. In fact, the `t-2` lagged baseline is also not very good. It appears that the best baseline model is to simply use the current hour's temperature to predict the next hour's temperature!\n", - "\n", - "Can we do better than this simple baseline using more sophisticated models?\n", - "\n", - "\n", - "\n", - "### 2. Train Statistical Time-series Analysis Models\n", - "\n", - "\n", - "In the previous section, we saw that a simple `lag-1` baseline model performed reasonably well at forecasting temperature for the next hourly time step. This is perhaps not too surprising, given what we know about hourly temperatures. Generally, the temperature in a given hour will be quite closely related to the temperature in the previous hour. This phenomenon is very common in time-series analysis and is known as [autocorrelation](https://en.wikipedia.org/wiki/Autocorrelation) - that is, the time series is `correlated` with previous values of itself. More precisely, the values at time `t` are correlated with lagged values (which could be `t-1`, `t-2` and so on).\n", - "\n", - "Another thing we saw previously is the concept of _moving averages_. In this case the moving-average baseline was not that good at prediction. However it is common in many time-series for a moving average to capture some of the underlying structure and be useful for prediction.\n", - "\n", - "In order to make our model better at predicting temperature, ideally we would want to take these aspects into account. Fortunately, the statistical community has a long history of analyzing time series and has created many different forecasting models.\n", - "\n", - "Here, we will explore one called SARIMAX - the **S**easonal **A**uto**R**egressive **I**ntegrated **M**oving **A**verage with e**X**ogenous regressors model. \n", - "\n", - "This sounds like a very complex name, but if we look at the components of the name, we see that it includes `autocorrelation` (this is what auto regressive means) and `moving averages`, which are the components mentioned above. \n", - "\n", - "The SARIMAX model also allows including a _seasonal_ model component as well as handling *exogenous* variables, which are external to the time-series value itself. For example, for temperature prediction we may wish to take into account not just previous temperature values, but perhaps other weather features which may have an effect on temperature (such as humidity, rainfall, wind, and so on).\n", - "\n", - "For the purposes of this notebook, we will not explore modeling of seasonal components or exogenous variables.\n", - "\n", - "If we drop the \"S\" and \"X\" from the model, we are left with an [ARIMA model](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) (Auto-regressive Integrated Moving Average). This is a very commonly used model for time-series analysis and we will use it in this notebook by only specifying the relevant model components of the full SARIMAX model.\n", - "\n", - "#### 2.1 Replicating a baseline model\n", - "\n", - "As a starting point, we will see how we can use SARIMAX to create a simple model that in fact replicates one of the baselines we created previously. Auto-regression, as we have seen, means using values from preceding time periods to predict the current value. Recall that one of our baseline models was the `1st lag` or `t-1` model. In time-series analysis this is referred to as an **AR(1)** model, meaning an **A**uto-**R**egressive model for `lag 1`.\n", - "\n", - "Technically, the AR(1) model is not exactly the same as our baseline model. A statistical time series model like SARIMAX learns a set of `weights` to apply to each component of the model. These weights are set so as to best fit the dataset. We can think of our baseline as setting the `weight` for the `t-1` lag to be exactly `1`. In practice, our time-series model will not have a weight of exactly `1` (though it will likely be very close to that), hence the predictions will be slightly different.\n", - "\n", - "Now, lets fit our model to the dataset. First, we will set up the model inputs by taking the temperature column of our dataframe. We do this for training and validation sets." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:54.051426Z", - "iopub.status.busy": "2020-06-23T19:10:54.049923Z", - "iopub.status.idle": "2020-06-23T19:10:54.054002Z", - "shell.execute_reply": "2020-06-23T19:10:54.055130Z" - }, - "papermill": { - "duration": 0.363962, - "end_time": "2020-06-23T19:10:54.055438", - "exception": false, - "start_time": "2020-06-23T19:10:53.691476", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "X_train = df_train[Y_COL]\n", - "X_val = df_val[Y_COL]\n", - "X_both = np.hstack((X_train, X_val))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.406823, - "end_time": "2020-06-23T19:10:54.810765", - "exception": false, - "start_time": "2020-06-23T19:10:54.403942", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Here we created a variable called `X_both` to cover both the training and validation data. This is required later when we forecast values for our SARIMAX model, in order to give the model access to all the datapoints for which it must create forecasts. Note that the forecasts themselves will only be based on the _model weights_ learned from the training data (this is important for over-fitting as we have seen above)!\n", - "\n", - "The SARIMAX model takes an argument called `order`: this specifies the components of the model and itself has 3 parts: `(p, d, q)`. `p` denotes the lags for the AR model and `q` denotes the lags for the MA model. We will not cover the `d` parameter here. Taken together this specifies the parameters of the [ARIMA](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) model portion of SARIMAX.\n", - "\n", - "To create an AR(1) model, we set the `order` to be `(1, 0, 0)`. This sets up the AR model to be a `lag 1` model. Then, we fit our model on the training data and inspect a summary of the trained model. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:55.604704Z", - "iopub.status.busy": "2020-06-23T19:10:55.603585Z", - "iopub.status.idle": "2020-06-23T19:10:56.693475Z", - "shell.execute_reply": "2020-06-23T19:10:56.696932Z" - }, - "papermill": { - "duration": 1.443733, - "end_time": "2020-06-23T19:10:56.697419", - "exception": false, - "start_time": "2020-06-23T19:10:55.253686", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "order = (1, 0, 0)\n", - "model_ar1 = SARIMAX(X_train, order=order)\n", - "results_ar1 = model_ar1.fit()\n", - "results_ar1.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.344413, - "end_time": "2020-06-23T19:10:57.353941", - "exception": false, - "start_time": "2020-06-23T19:10:57.009528", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "There's quite a lot of information printed out in the model summary above. Much of it is related to the statistical properties of our model.\n", - "\n", - "The most important thing for now is to look at the second table, where we can see a `coef` value of `0.9996` for the weight `ar.L1`. This tells us the model has set a weight for the `1st lag` component of the AR model to be `0.9996`. This is almost `1` and hence we should expect the prediction results to indeed be close to our `t-1` baseline.\n", - "\n", - "Let's create our model forecast on the validation dataset. We will then plot a few data points like we did with our baseline models (using 7 days of validation data) and compute the RMSE value based on the full validation set." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:57.993230Z", - "iopub.status.busy": "2020-06-23T19:10:57.991479Z", - "iopub.status.idle": "2020-06-23T19:10:58.136887Z", - "shell.execute_reply": "2020-06-23T19:10:58.136120Z" - }, - "papermill": { - "duration": 0.442245, - "end_time": "2020-06-23T19:10:58.137195", - "exception": false, - "start_time": "2020-06-23T19:10:57.694950", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "full_data_ar1 = SARIMAX(X_both, order=order)\n", - "model_forecast_ar1 = full_data_ar1.filter(results_ar1.params)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:10:58.833429Z", - "iopub.status.busy": "2020-06-23T19:10:58.832088Z", - "iopub.status.idle": "2020-06-23T19:10:59.277366Z", - "shell.execute_reply": "2020-06-23T19:10:59.278452Z" - }, - "papermill": { - "duration": 0.805526, - "end_time": "2020-06-23T19:10:59.278758", - "exception": false, - "start_time": "2020-06-23T19:10:58.473232", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "start = len(X_train)\n", - "end = len(X_both)\n", - "forecast_ar1 = model_forecast_ar1.predict(start=start, end=end - 1, dynamic=False)\n", - "\n", - "# plot actual vs predicted values for the same 7-day window for easier viewing\n", - "plt.plot(sliced[Y_COL].values)\n", - "plt.plot(forecast_ar1[:168], color='r', linestyle='--')\n", - "plt.legend(['t', 'AR(1)'], loc=2)\n", - "plt.title('AR(1) Model Predictions for First 7 Days of Validation Set')\n", - "plt.ylabel('Temperature (F)')\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.32596, - "end_time": "2020-06-23T19:10:59.905046", - "exception": false, - "start_time": "2020-06-23T19:10:59.579086", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We can see that the plot looks almost identical to the plot above, for the `t` and `t-1 baseline` values.\n", - "\n", - "Next, we compute the RMSE values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:11:00.577864Z", - "iopub.status.busy": "2020-06-23T19:11:00.576393Z", - "iopub.status.idle": "2020-06-23T19:11:00.580976Z", - "shell.execute_reply": "2020-06-23T19:11:00.578703Z" - }, - "papermill": { - "duration": 0.307046, - "end_time": "2020-06-23T19:11:00.581216", - "exception": false, - "start_time": "2020-06-23T19:11:00.274170", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# compute print RMSE values\n", - "rmse_ar1 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar1))\n", - "print('AR(1) RMSE: {0:.3f}'.format(rmse_ar1))\n", - "print('Baseline t-1 RMSE: {0:.3f}'.format(rmse_t1))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.271662, - "end_time": "2020-06-23T19:11:01.146967", - "exception": false, - "start_time": "2020-06-23T19:11:00.875305", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We can see that the RMSE values for the validation set also almost identical.\n", - "\n", - "#### 2.2 Create a more complex model\n", - "\n", - "One of our baseline models was a `lag 2` model, i.e. `t-2`. We saw that it performed a lot worse than the `t-1` baseline. Intuitively, this makes sense, since we are throwing away a lot of information about the most recent lag `t-1`. However, the `t-2` lag still provides some useful information. In fact, for temperature prediction it's likely that the last few hours can provide some value.\n", - "\n", - "Fortunately, our ARIMA model framework provides an easy way to incorporate further lag information. We can construct a model that includes _both_ the `t-1` and `t-2` lags. This is an **AR(2)** model (meaning an auto-regressive model up to lag `2`). We can specify this with the model order parameter `p=2`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:11:01.736464Z", - "iopub.status.busy": "2020-06-23T19:11:01.734470Z", - "iopub.status.idle": "2020-06-23T19:11:02.901803Z", - "shell.execute_reply": "2020-06-23T19:11:02.900812Z" - }, - "papermill": { - "duration": 1.472119, - "end_time": "2020-06-23T19:11:02.902042", - "exception": false, - "start_time": "2020-06-23T19:11:01.429923", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "order = (2, 0, 0)\n", - "model_ar2 = SARIMAX(X_train, order=order)\n", - "results_ar2 = model_ar2.fit()\n", - "results_ar2.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.3617, - "end_time": "2020-06-23T19:11:03.596315", - "exception": false, - "start_time": "2020-06-23T19:11:03.234615", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "This time, the results table indicates a weight for variable `ar.L1` _and_ `ar.L2`. Note the values are now quite different from `1` (or `0.5` say, for a simple equally-weighted model). Next, we compute the RMSE on the validation set. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:11:04.355345Z", - "iopub.status.busy": "2020-06-23T19:11:04.354117Z", - "iopub.status.idle": "2020-06-23T19:11:04.590484Z", - "shell.execute_reply": "2020-06-23T19:11:04.589051Z" - }, - "papermill": { - "duration": 0.633402, - "end_time": "2020-06-23T19:11:04.590648", - "exception": false, - "start_time": "2020-06-23T19:11:03.957246", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "full_data_ar2 = SARIMAX(X_both, order=order)\n", - "model_forecast_ar2 = full_data_ar2.filter(results_ar2.params)\n", - "\n", - "start = len(X_train)\n", - "end = len(X_both)\n", - "forecast_ar2 = model_forecast_ar2.predict(start=start, end=end - 1, dynamic=False)\n", - "\n", - "# compute print RMSE values\n", - "rmse_ar2 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar2))\n", - "print('AR(2) RMSE: {0:.3f}'.format(rmse_ar2))\n", - "print('AR(1) RMSE: {0:.3f}'.format(rmse_ar1))\n", - "print('Baseline t-1 RMSE: {0:.3f}'.format(rmse_t1))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.31576, - "end_time": "2020-06-23T19:11:05.274842", - "exception": false, - "start_time": "2020-06-23T19:11:04.959082", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We've improved the RMSE value by including information from the first two lags.\n", - "\n", - "In fact, you will see that if you continue to increase the `p` parameter value, the RMSE will continue to decrease, indicating that a few recent lags provide useful information to our model.\n", - "\n", - "#### 2.3 Incorporate moving averages\n", - "\n", - "Finally, what if we also include moving average information in our model? The ARIMA framework makes this easy to do, by setting the order parameter `q`. A value of `q=1` specifies a **MA(1)** model (including the first lag `t-1`), while `q=6` would include all the lags from `t-1` to `t-6`.\n", - "\n", - "Note that the moving average model component is a little different from the simple moving or rolling averages computed in the baseline models. The [definition of the MA model](https://en.wikipedia.org/wiki/Moving-average_model) is rather technical, but conceptually you can think of it as using a form of weighted moving average (compared to our baseline which would be a simple, unweighted average).\n", - "\n", - "Let's add an MA(1) component to our AR(2) model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:11:05.922543Z", - "iopub.status.busy": "2020-06-23T19:11:05.921432Z", - "iopub.status.idle": "2020-06-23T19:11:08.188730Z", - "shell.execute_reply": "2020-06-23T19:11:08.187498Z" - }, - "papermill": { - "duration": 2.562284, - "end_time": "2020-06-23T19:11:08.188977", - "exception": false, - "start_time": "2020-06-23T19:11:05.626693", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "order = (2, 0, 1)\n", - "model_ar2ma1 = SARIMAX(X_train, order=order)\n", - "results_ar2ma1 = model_ar2ma1.fit()\n", - "results_ar2ma1.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.422086, - "end_time": "2020-06-23T19:11:08.972375", - "exception": false, - "start_time": "2020-06-23T19:11:08.550289", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We see the results table shows an additional weight value for `ma.L1`, our MA(1) component. Next, we compare the RMSE to the other models and finally plot all the model forecasts together - _note_ we use a much smaller 48-hour window to make the plot readable for illustrative purposes. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:11:09.576061Z", - "iopub.status.busy": "2020-06-23T19:11:09.574047Z", - "iopub.status.idle": "2020-06-23T19:11:09.932432Z", - "shell.execute_reply": "2020-06-23T19:11:09.931384Z" - }, - "papermill": { - "duration": 0.688404, - "end_time": "2020-06-23T19:11:09.932680", - "exception": false, - "start_time": "2020-06-23T19:11:09.244276", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "full_data_ar2ma1 = SARIMAX(X_both, order=order)\n", - "model_forecast_ar2ma1 = full_data_ar2ma1.filter(results_ar2ma1.params)\n", - "\n", - "start = len(X_train)\n", - "end = len(X_both)\n", - "forecast_ar2ma1 = model_forecast_ar2ma1.predict(start=start, end=end - 1, dynamic=False)\n", - "\n", - "# compute print RMSE values\n", - "rmse_ar2ma1 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar2ma1))\n", - "print('AR(2) MA(1) RMSE: {0:.3f}'.format(rmse_ar2ma1))\n", - "print('AR(2) RMSE: {0:.3f}'.format(rmse_ar2))\n", - "print('AR(1) RMSE: {0:.3f}'.format(rmse_ar1))\n", - "print('Baseline t-1 RMSE: {0:.3f}'.format(rmse_t1))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2020-06-23T19:11:10.569884Z", - "iopub.status.busy": "2020-06-23T19:11:10.564385Z", - "iopub.status.idle": "2020-06-23T19:11:11.060827Z", - "shell.execute_reply": "2020-06-23T19:11:11.061781Z" - }, - "papermill": { - "duration": 0.80721, - "end_time": "2020-06-23T19:11:11.061997", - "exception": false, - "start_time": "2020-06-23T19:11:10.254787", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# plot actual vs predicted values for a smaller 2-day window for easier viewing\n", - "hrs = 48\n", - "plt.plot(sliced[Y_COL][:hrs].values)\n", - "plt.plot(forecast_ar1[:hrs], color='r', linestyle='--')\n", - "plt.plot(forecast_ar2[:hrs], color='g', linestyle='--')\n", - "plt.plot(forecast_ar2ma1[:hrs], color='c', linestyle='--')\n", - "plt.legend(['t', 'AR(1)', 'AR(2)', 'AR(2) MA(1)'], loc=2, ncol=1)\n", - "plt.title('ARIMA Model Predictions for First 48 hours of Validation Set')\n", - "plt.ylabel('Temperature (F)')\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.281868, - "end_time": "2020-06-23T19:11:11.631851", - "exception": false, - "start_time": "2020-06-23T19:11:11.349983", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We've again managed to reduce the RMSE value for our model, indicating that adding the MA(1) component has improved our forecast!\n", - "\n", - "Congratulations! You've applied the basics of time-series analysis for forecasting hourly temperatures. See if you can further improve the RMSE values by exploring the different values for the model parameters `p`, `q` and even `d`!\n", - "\n", - " \n", - "### Authors\n", - "\n", - "This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).\n", - "\n", - "Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.10" - }, - "papermill": { - "duration": 105.745224, - "end_time": "2020-06-23T19:11:12.732957", - "environment_variables": {}, - "exception": null, - "input_path": "Part 3 - Time Series Forecasting.ipynb", - "output_path": "Part 3 - Time Series Forecasting-output.ipynb", - "parameters": {}, - "start_time": "2020-06-23T19:09:26.987733", - "version": "2.1.1" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/pipelines/introduction-to-generic-pipelines/README.md b/pipelines/introduction-to-generic-pipelines/README.md index 66ab075..2a640db 100644 --- a/pipelines/introduction-to-generic-pipelines/README.md +++ b/pipelines/introduction-to-generic-pipelines/README.md @@ -17,11 +17,11 @@ limitations under the License. --> ## Getting started with generic pipelines -A [pipeline](https://elyra.readthedocs.io/en/latest/user_guide/pipelines.html) comprises one or more nodes that are (in many cases) connected with each other to define execution dependencies. Each node is implemented by a [component](https://elyra.readthedocs.io/en/latest/user_guide/pipeline-components.html) and typically performs only a single task, such as loading data, processing data, training a model, or sending an email. +A [pipeline](https://elyra.readthedocs.io/en/stable/user_guide/pipelines.html) comprises one or more nodes that are (in many cases) connected with each other to define execution dependencies. Each node is implemented by a [component](https://elyra.readthedocs.io/en/stable/user_guide/pipeline-components.html) and typically performs only a single task, such as loading data, processing data, training a model, or sending an email. ![A basic pipeline](doc/images/pipelines-nodes.png) -A _generic pipeline_ comprises nodes that are implemented using _generic components_. In the current release Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Kubeflow Pipelines, and Apache Airflow. +A [_generic pipeline_](https://elyra.readthedocs.io/en/stable/user_guide/pipelines.html#generic-pipelines) comprises nodes that are implemented using [_generic components_](https://elyra.readthedocs.io/en/stable/user_guide/pipeline-components.html#generic-components). In the current release Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Kubeflow Pipelines, and Apache Airflow. ![Generic pipelines and supported runtime environments](doc/images/pipeline-runtimes-environments.png) @@ -35,9 +35,9 @@ In this introductory tutorial you will learn how to create a generic pipeline an ### Prerequisites -- [JupyterLab 3.x with the Elyra extension v3.x (or newer) installed](https://elyra.readthedocs.io/en/latest/getting_started/installation.html). +- [JupyterLab 3.x with the Elyra extension v3.13 (or newer) installed](https://elyra.readthedocs.io/en/stable/getting_started/installation.html). -> The tutorial instructions were last updated using Elyra version 3.0. +> The tutorial instructions were last updated using Elyra version 3.13. ### Setup @@ -51,7 +51,7 @@ This tutorial uses the `introduction to generic pipelines` sample from the https ![Tutorial assets in File Browser](doc/images/tutorial-directory.png) - The cloned repository includes a set of files that download an open [weather data set from the Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/jfk-weather-data/), cleanse the data, analyze the data, and perform time-series predictions. + The tutorial directory includes a set of Jupyter notebooks and Python scripts that download, process, and analyze the [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). You are ready to start the tutorial. @@ -65,7 +65,9 @@ You are ready to start the tutorial. ![Visual pipeline editor](doc/images/vpe.png) -1. In the JupyterLab _File Browser_ panel, right click on the untitled pipeline, and select ✎ _Rename_. +1. Click the _settings_ link on the canvas and review the pipeline editor configuration options that your Elyra installation supports. + +1. In the JupyterLab _File Browser_ panel, right click on the `untitled.pipeline` file, and select ✎ _Rename_. ![Rename pipeline](doc/images/rename-pipeline.png) @@ -77,19 +79,21 @@ You are ready to start the tutorial. ![Open the properties panel](doc/images/open-properties-panel.png) -1. Select the _Pipeline properties_ tab and enter a pipeline description. +1. Select the _Pipeline properties_ tab. Pipeline properties configure default property values that are applied to every applicable node and general metadata. Since generic pipelines may only include generic nodes, only the defaults in section _Generic Node Defaults_ are of interest in this tutorial. You'll learn more about these defaults in the next sections. + +1. Enter a pipeline description. ![Add pipeline description](doc/images/add-pipeline-description.png) 1. Close the properties panel. -Next, you'll add a file to the pipeline that downloads an open data set archive from public cloud storage. +Next, you'll add a file to the pipeline that downloads the Iris flower data file from the web. ### Add a notebook or script to the pipeline -This tutorial includes a Jupyter notebook `load_data.ipynb` and a Python script `load_data.py` that perform the same data loading task. +This tutorial includes a Jupyter notebook `load_data.ipynb` and a Python script `load_data.py`. -> For illustrative purposes the instructions use the notebook, but feel free to use the Python script. (The key takeaway is that you can mix and match notebooks and scripts, as desired.) +> For illustrative purposes the instructions use the Jupyter notebook, but feel free to use the Python script. (The key takeaway is that you can mix and match notebooks and scripts, as desired.) To add a notebook or script to the pipeline: @@ -101,10 +105,16 @@ To add a notebook or script to the pipeline: ![Component configuration error](doc/images/component-configuration-error.png) -1. Select the newly added node on the canvas, right click, and select _Open Properties_ from the context menu. +1. Select the newly added node on the canvas, right click, and select _Open Properties_ from the context menu. If you've customized the pipeline editor configuration, you can also double-click on the node. ![Open node properties](doc/images/open-node-properties.png) +Properties for generic nodes are divided into four sections: + - The _metadata_ section includes the component name, the component description, and the node label. + - The _inputs_ section defines component inputs, such as the Jupyter notebook or Python script name and local file dependencies. + - The _outputs_ section defines files that the Jupyter notebook or Python script produces and intends to make available to other pipeline nodes. + - The _additional properties_ section defines resources that modify the generic component. + 1. Configure the node properties. ![Configure node properties](doc/images/configure-node-properties.png) @@ -137,7 +147,7 @@ To add a notebook or script to the pipeline: If desired, you can customize additional inputs by defining environment variables. -1. Click _refresh_ to scan the file for environment variable references. Refer to the [best practices for file-based pipeline nodes](https://elyra.readthedocs.io/en/latest/user_guide/best-practices-file-based-nodes.html#environment-variables) to learn more about how Elyra discovers environment variables in notebooks and scripts. +1. Click _refresh_ to scan the file for environment variable references. Refer to the [best practices for file-based pipeline nodes](https://elyra.readthedocs.io/en/stable/user_guide/best-practices-file-based-nodes.html#environment-variables) to learn more about how Elyra discovers environment variables in notebooks and scripts. ![Scan file for environment variables](doc/images/scan-file.png) @@ -229,9 +239,9 @@ You can access output artifacts from the _File Browser_. In the screen capture b ### Run a generic pipeline using the CLI -Elyra provides a [command line interface](https://elyra.readthedocs.io/en/latest/user_guide/command-line-interface.html) that you can use to manage metadata and work with pipelines. +Elyra provides a [command line interface](https://elyra.readthedocs.io/en/stable/user_guide/command-line-interface.html) that you can use to manage metadata and work with pipelines. -To run a pipeline locally using the [`elyra-pipeline`](https://elyra.readthedocs.io/en/latest/user_guide/command-line-interface.html#working-with-pipelines) CLI: +To run a pipeline locally using the [`elyra-pipeline`](https://elyra.readthedocs.io/en/stable/user_guide/command-line-interface.html#working-with-pipelines) CLI: 1. Open a terminal window that has access to the Elyra installation. @@ -269,7 +279,7 @@ Each of the notebooks can run in the `Pandas` container image and doesn't have a ### Resources -- [_Pipelines_ topic in the Elyra _User Guide_](https://elyra.readthedocs.io/en/stable/user_guide/pipelines.html) +- [_Creating pipelines using the Visual Pipeline Editor_ topic in the Elyra _User Guide_](https://elyra.readthedocs.io/en/stable/user_guide/pipelines.html#creating-pipelines-using-the-visual-pipeline-editor) - [_Pipeline components_ topic in the Elyra _User Guide_](https://elyra.readthedocs.io/en/stable/user_guide/pipeline-components.html) - [_Best practices for file-based pipeline nodes_ topic in the Elyra _User Guide_](https://elyra.readthedocs.io/en/stable/user_guide/best-practices-file-based-nodes.html) - [_Command line interface_ topic in the Elyra _User Guide_](https://elyra.readthedocs.io/en/stable/user_guide/command-line-interface.html) diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/add-pipeline-description.png b/pipelines/introduction-to-generic-pipelines/doc/images/add-pipeline-description.png index a596c49..4ccf3d9 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/add-pipeline-description.png and b/pipelines/introduction-to-generic-pipelines/doc/images/add-pipeline-description.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/completed-tutorial-pipeline.png b/pipelines/introduction-to-generic-pipelines/doc/images/completed-tutorial-pipeline.png index acf3f51..06aa4b3 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/completed-tutorial-pipeline.png and b/pipelines/introduction-to-generic-pipelines/doc/images/completed-tutorial-pipeline.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/configure-file-dependencies.png b/pipelines/introduction-to-generic-pipelines/doc/images/configure-file-dependencies.png index c9bf603..9a0c340 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/configure-file-dependencies.png and b/pipelines/introduction-to-generic-pipelines/doc/images/configure-file-dependencies.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/configure-node-properties.png b/pipelines/introduction-to-generic-pipelines/doc/images/configure-node-properties.png index f310f6c..30f73b6 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/configure-node-properties.png and b/pipelines/introduction-to-generic-pipelines/doc/images/configure-node-properties.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/configure-resources.png b/pipelines/introduction-to-generic-pipelines/doc/images/configure-resources.png index ae1311c..99c1b8c 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/configure-resources.png and b/pipelines/introduction-to-generic-pipelines/doc/images/configure-resources.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/configure-runtime-image.png b/pipelines/introduction-to-generic-pipelines/doc/images/configure-runtime-image.png index e1beee2..8f8fe67 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/configure-runtime-image.png and b/pipelines/introduction-to-generic-pipelines/doc/images/configure-runtime-image.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/edit-node-label.png b/pipelines/introduction-to-generic-pipelines/doc/images/edit-node-label.png index 2e0b42a..d6022ff 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/edit-node-label.png and b/pipelines/introduction-to-generic-pipelines/doc/images/edit-node-label.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/empty-generic-pipeline.png b/pipelines/introduction-to-generic-pipelines/doc/images/empty-generic-pipeline.png index 50ad727..d214fb9 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/empty-generic-pipeline.png and b/pipelines/introduction-to-generic-pipelines/doc/images/empty-generic-pipeline.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/jupyterlab-launcher.png b/pipelines/introduction-to-generic-pipelines/doc/images/jupyterlab-launcher.png index 4730cd2..0b438eb 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/jupyterlab-launcher.png and b/pipelines/introduction-to-generic-pipelines/doc/images/jupyterlab-launcher.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/open-properties-panel.png b/pipelines/introduction-to-generic-pipelines/doc/images/open-properties-panel.png index 20e4000..2a61f7f 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/open-properties-panel.png and b/pipelines/introduction-to-generic-pipelines/doc/images/open-properties-panel.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/rename-pipeline.png b/pipelines/introduction-to-generic-pipelines/doc/images/rename-pipeline.png index 21db361..6c840f9 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/rename-pipeline.png and b/pipelines/introduction-to-generic-pipelines/doc/images/rename-pipeline.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/select-file-to-run.png b/pipelines/introduction-to-generic-pipelines/doc/images/select-file-to-run.png index ed8f4e6..1eec32f 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/select-file-to-run.png and b/pipelines/introduction-to-generic-pipelines/doc/images/select-file-to-run.png differ diff --git a/pipelines/introduction-to-generic-pipelines/doc/images/vpe.png b/pipelines/introduction-to-generic-pipelines/doc/images/vpe.png index 787bb31..499ac95 100644 Binary files a/pipelines/introduction-to-generic-pipelines/doc/images/vpe.png and b/pipelines/introduction-to-generic-pipelines/doc/images/vpe.png differ diff --git a/pipelines/introduction-to-generic-pipelines/hello-generic-world.pipeline b/pipelines/introduction-to-generic-pipelines/hello-generic-world.pipeline new file mode 100644 index 0000000..138f08c --- /dev/null +++ b/pipelines/introduction-to-generic-pipelines/hello-generic-world.pipeline @@ -0,0 +1,92 @@ +{ + "doc_type": "pipeline", + "version": "3.0", + "json_schema": "http://api.dataplatform.ibm.com/schemas/common-pipeline/pipeline-flow/pipeline-flow-v3-schema.json", + "id": "elyra-auto-generated-pipeline", + "primary_pipeline": "primary", + "pipelines": [ + { + "id": "primary", + "nodes": [ + { + "id": "cd98c9f5-73d9-40c1-a11e-495ce699e8f0", + "type": "execution_node", + "op": "execute-notebook-node", + "app_data": { + "component_parameters": { + "dependencies": [], + "include_subdirectories": false, + "outputs": [], + "env_vars": [], + "kubernetes_pod_annotations": [], + "kubernetes_pod_labels": [], + "kubernetes_secrets": [], + "kubernetes_shared_mem_size": {}, + "kubernetes_tolerations": [], + "mounted_volumes": [], + "filename": "load_data.ipynb" + }, + "label": "Load tutorial dataset", + "ui_data": { + "label": "Load tutorial dataset", + "image": "/static/elyra/notebook.svg", + "x_pos": 144, + "y_pos": 100, + "description": "Run notebook file", + "decorations": [ + { + "id": "error", + "image": "data:image/svg+xml;utf8,%3Csvg%20focusable%3D%22false%22%20preserveAspectRatio%3D%22xMidYMid%20meet%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20fill%3D%22%23da1e28%22%20width%3D%2216%22%20height%3D%2216%22%20viewBox%3D%220%200%2016%2016%22%20aria-hidden%3D%22true%22%3E%3Ccircle%20cx%3D%228%22%20cy%3D%228%22%20r%3D%228%22%20fill%3D%22%23ffffff%22%3E%3C%2Fcircle%3E%3Cpath%20d%3D%22M8%2C1C4.2%2C1%2C1%2C4.2%2C1%2C8s3.2%2C7%2C7%2C7s7-3.1%2C7-7S11.9%2C1%2C8%2C1z%20M7.5%2C4h1v5h-1C7.5%2C9%2C7.5%2C4%2C7.5%2C4z%20M8%2C12.2%09c-0.4%2C0-0.8-0.4-0.8-0.8s0.3-0.8%2C0.8-0.8c0.4%2C0%2C0.8%2C0.4%2C0.8%2C0.8S8.4%2C12.2%2C8%2C12.2z%22%3E%3C%2Fpath%3E%3Cpath%20d%3D%22M7.5%2C4h1v5h-1C7.5%2C9%2C7.5%2C4%2C7.5%2C4z%20M8%2C12.2c-0.4%2C0-0.8-0.4-0.8-0.8s0.3-0.8%2C0.8-0.8%09c0.4%2C0%2C0.8%2C0.4%2C0.8%2C0.8S8.4%2C12.2%2C8%2C12.2z%22%20data-icon-path%3D%22inner-path%22%20opacity%3D%220%22%3E%3C%2Fpath%3E%3C%2Fsvg%3E", + "outline": false, + "position": "topRight", + "x_pos": -24, + "y_pos": -8 + } + ] + } + }, + "inputs": [ + { + "id": "inPort", + "app_data": { + "ui_data": { + "cardinality": { + "min": 0, + "max": -1 + }, + "label": "Input Port" + } + } + } + ], + "outputs": [ + { + "id": "outPort", + "app_data": { + "ui_data": { + "cardinality": { + "min": 0, + "max": -1 + }, + "label": "Output Port" + } + } + } + ] + } + ], + "app_data": { + "ui_data": { + "comments": [] + }, + "version": 8, + "properties": { + "name": "hello-generic-world", + "runtime": "Generic" + } + }, + "runtime_ref": "" + } + ], + "schemas": [] +} \ No newline at end of file diff --git a/pipelines/introduction-to-generic-pipelines/load_data.ipynb b/pipelines/introduction-to-generic-pipelines/load_data.ipynb index 463fed2..6f8f9af 100644 --- a/pipelines/introduction-to-generic-pipelines/load_data.ipynb +++ b/pipelines/introduction-to-generic-pipelines/load_data.ipynb @@ -2,243 +2,136 @@ "cells": [ { "cell_type": "markdown", - "id": "10e2d643", + "id": "42e8bbd2-ee3f-4e29-86a1-58fa9a8b313c", "metadata": { - "papermill": { - "duration": 0.008783, - "end_time": "2021-07-14T23:45:30.348760", - "exception": false, - "start_time": "2021-07-14T23:45:30.339977", - "status": "completed" - }, + "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ - "## Download a data set\n", - "\n", - "This notebook downloads a data set file from a public location. If the data set file is a compressed archive it will be decompressed. Upon completion the raw data set files are located in the `data\\` directory.\n", - "\n", - "This notebook requires the following environment variables:\n", - " - `DATASET_URL` Public data set URL, e.g. `https://dax-cdn.cdn.appdomain.cloud/dax-fashion-mnist/1.0.2/fashion-mnist.tar.gz`" + "## Copyright 2018-2022 Elyra Authors" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "8a2f3aab", - "metadata": { - "papermill": { - "duration": 0.070832, - "end_time": "2021-07-14T23:45:30.427472", - "exception": false, - "start_time": "2021-07-14T23:45:30.356640", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], + "cell_type": "markdown", + "id": "5d45a17f-02f6-419d-b61c-6facdb9e30ce", + "metadata": {}, "source": [ - "import glob\n", - "import json\n", - "import os\n", - "from pathlib import Path\n", - "import requests\n", - "import tarfile\n", - "from urllib.parse import urlparse" + "Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "you may not use this file except in compliance with the License.\n", + "You may obtain a copy of the License at\n", + "\n", + "http://www.apache.org/licenses/LICENSE-2.0\n", + "\n", + "Unless required by applicable law or agreed to in writing, software\n", + "distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "See the License for the specific language governing permissions and\n", + "limitations under the License." ] }, { "cell_type": "markdown", - "id": "0efbe465", + "id": "c2e2f18d-98bf-4e45-b4b0-4272ac411fdc", "metadata": { - "papermill": { - "duration": 0.005183, - "end_time": "2021-07-14T23:45:30.437749", - "exception": false, - "start_time": "2021-07-14T23:45:30.432566", - "status": "completed" - }, "tags": [] }, "source": [ - "Verify that the `DATASET_URL` environment variable is set. If it is not set, a RuntimeError is raised." + "## Download the data set\n", + "\n", + "This tutorial notebook downloads the Iris dataset from a public web resource. The download URL is configurable\n", + "using environment variable `IRIS_DATASET_URL`." ] }, { "cell_type": "code", "execution_count": null, - "id": "a4ffffc4", - "metadata": { - "papermill": { - "duration": 0.009866, - "end_time": "2021-07-14T23:45:30.452696", - "exception": false, - "start_time": "2021-07-14T23:45:30.442830", - "status": "completed" - }, - "tags": [] - }, + "id": "6e5ecdc6-cb07-47ee-922d-a8a9d4b2b934", + "metadata": {}, "outputs": [], "source": [ - "data_file = os.getenv('DATASET_URL',\n", - " 'https://dax-cdn.cdn.appdomain.cloud/'\n", - " 'dax-noaa-weather-data-jfk-airport/1.1.4/'\n", - " 'noaa-weather-data-jfk-airport.tar.gz')" + "import os\n", + "import requests" ] }, { "cell_type": "markdown", - "id": "85cfa27c", - "metadata": { - "papermill": { - "duration": 0.004601, - "end_time": "2021-07-14T23:45:30.462495", - "exception": false, - "start_time": "2021-07-14T23:45:30.457894", - "status": "completed" - }, - "tags": [] - }, + "id": "505efd47-9f40-444b-a6d7-c47b24ce0dcb", + "metadata": {}, "source": [ - "Download the data set from the location specified in `dataset_url`, extract it (if it is compressed) and store it in the directory identified by `data_dir_name`." + "Configure dataset download URL. Use custom location, if one was specified, or the default. A valid custom location is \"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\"" ] }, { "cell_type": "code", "execution_count": null, - "id": "3aabd20a", - "metadata": { - "papermill": { - "duration": 1.085636, - "end_time": "2021-07-14T23:45:31.552664", - "exception": false, - "start_time": "2021-07-14T23:45:30.467028", - "status": "completed" - }, - "tags": [] - }, + "id": "b32246eb-ad22-4737-b0d4-2e0aaad0daca", + "metadata": {}, "outputs": [], "source": [ - "data_dir_name = 'data'\n", - "\n", - "print('Downloading data file {} ...'.format(data_file))\n", - "r = requests.get(data_file)\n", - "if r.status_code != 200:\n", - " raise RuntimeError('Could not fetch {}: HTTP status code {}'\n", - " .format(data_file, r.status_code))\n", - "else:\n", - " # extract data set file name from URL\n", - " data_file_name = Path((urlparse(data_file).path)).name\n", - " # create the directory where the downloaded file will be stored\n", - " data_dir = Path(data_dir_name)\n", - " data_dir.mkdir(parents=True, exist_ok=True)\n", - " downloaded_data_file = data_dir / data_file_name\n", - "\n", - " print('Saving downloaded file \"{}\" as ...'.format(data_file_name))\n", - " with open(downloaded_data_file, 'wb') as downloaded_file:\n", - " downloaded_file.write(r.content)\n", - "\n", - " if r.headers['content-type'] == 'application/x-tar':\n", - " print('Extracting downloaded file in directory \"{}\" ...'\n", - " .format(data_dir))\n", - " with tarfile.open(downloaded_data_file, 'r') as tar:\n", - " tar.extractall(data_dir)\n", - " print('Removing downloaded file ...')\n", - " downloaded_data_file.unlink()" + "data_file_url = os.getenv(\n", + " \"DATASET_URL\",\n", + " \"https://raw.githubusercontent.com/elyra-ai/examples/\"\n", + " \"main/pipelines/introduction-to-generic-pipelines/data/iris.data\",\n", + ")" ] }, { "cell_type": "markdown", - "id": "c0edca62", - "metadata": { - "papermill": { - "duration": 0.005151, - "end_time": "2021-07-14T23:45:31.564073", - "exception": false, - "start_time": "2021-07-14T23:45:31.558922", - "status": "completed" - }, - "tags": [] - }, + "id": "79365404-3bb1-4655-afdc-afc6800981b3", + "metadata": {}, "source": [ - "Display list of extracted data files" + "Download the dataset" ] }, { "cell_type": "code", "execution_count": null, - "id": "d7d542ea", - "metadata": { - "papermill": { - "duration": 0.011331, - "end_time": "2021-07-14T23:45:31.580464", - "exception": false, - "start_time": "2021-07-14T23:45:31.569133", - "status": "completed" - }, - "tags": [] - }, + "id": "c7ca0a32-2a99-4d63-8ba0-b93e8a8507c0", + "metadata": {}, "outputs": [], "source": [ - "for entry in glob.glob(data_dir_name + \"/**/*\", recursive=True):\n", - " print(entry)" + "print(f\"Downloading dataset file {data_file_url} ...\")\n", + "r = requests.get(data_file_url)\n", + "if r.status_code != 200:\n", + " raise RuntimeError(\n", + " f\"Error downloading {data_file_url}: HTTP status code {r.status_code}\"\n", + " )" ] }, { "cell_type": "markdown", - "id": "f9c7cb8d", - "metadata": { - "papermill": { - "duration": 0.005349, - "end_time": "2021-07-14T23:45:31.591143", - "exception": false, - "start_time": "2021-07-14T23:45:31.585794", - "status": "completed" - }, - "tags": [] - }, + "id": "c0f9750a-b4c8-4679-b252-dc9029173389", + "metadata": {}, "source": [ - "A notebook can produce output that is visualized in the Kubeflow Pipelines UI. For illustrative purposes we log the data set download URL. Refer to the [documentation](https://elyra.readthedocs.io/en/latest/recipes/visualizing-output-in-the-kfp-ui.html) to learn about supported visualization types and additional examples." + "Save the dataset" ] }, { "cell_type": "code", "execution_count": null, - "id": "e0f2fcde", - "metadata": { - "papermill": { - "duration": 0.011146, - "end_time": "2021-07-14T23:45:31.607518", - "exception": false, - "start_time": "2021-07-14T23:45:31.596372", - "status": "completed" - }, - "tags": [] - }, + "id": "f291655a-f784-4a78-91b4-3bb95bad156b", + "metadata": {}, "outputs": [], "source": [ - "if os.environ.get('ELYRA_RUNTIME_ENV') == 'kfp':\n", - " # For information about Elyra environment variables refer to\n", - " # https://elyra.readthedocs.io/en/stable/user_guide/best-practices-file-based-nodes.html#proprietary-environment-variables # noqa: E501\n", - "\n", - " metadata = {\n", - " 'outputs': [\n", - " {\n", - " 'storage': 'inline',\n", - " 'source': '# Data archive URL: {}'\n", - " .format(data_file),\n", - " 'type': 'markdown',\n", - " }]\n", - " }\n", - "\n", - " with open('mlpipeline-ui-metadata.json', 'w') as f:\n", - " json.dump(metadata, f)" + "data_filename = \"iris.data\"\n", + "with open(data_filename, \"w\") as downloaded_file:\n", + " downloaded_file.write(r.text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5ae42ae-cf3a-4aee-bee7-cdea88e59e72", + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"Saved dataset file as '{data_filename}'.\")" ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -252,18 +145,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" - }, - "papermill": { - "duration": 2.43342, - "end_time": "2021-07-14T23:45:31.825444", - "environment_variables": {}, - "exception": null, - "input_path": "/Users/patti/elyra_dev/update-examples/workspace/examples/pipelines/introduction-to-generic-pipelines/load_data.ipynb", - "output_path": "/Users/patti/elyra_dev/update-examples/workspace/examples/pipelines/introduction-to-generic-pipelines/load_data.ipynb", - "parameters": {}, - "start_time": "2021-07-14T23:45:29.392024", - "version": "2.1.1" + "version": "3.7.12" } }, "nbformat": 4, diff --git a/pipelines/introduction-to-generic-pipelines/load_data.py b/pipelines/introduction-to-generic-pipelines/load_data.py index d47a513..0d5d935 100644 --- a/pipelines/introduction-to-generic-pipelines/load_data.py +++ b/pipelines/introduction-to-generic-pipelines/load_data.py @@ -14,57 +14,24 @@ # limitations under the License. # import os -import tarfile -from pathlib import Path -from urllib.parse import urlparse - import requests - -def download_from_public_url(url): - - data_dir_name = 'data' - - print('Downloading data file {} ...'.format(url)) - r = requests.get(url) - if r.status_code != 200: - raise RuntimeError('Could not fetch {}: HTTP status code {}' - .format(url, r.status_code)) - else: - # extract data set file name from URL - data_file_name = Path((urlparse(url).path)).name - # create the directory where the downloaded file will be stored - data_dir = Path(data_dir_name) - data_dir.mkdir(parents=True, exist_ok=True) - downloaded_data_file = data_dir / data_file_name - - print('Saving downloaded file "{}" as ...'.format(data_file_name)) - with open(downloaded_data_file, 'wb') as downloaded_file: - downloaded_file.write(r.content) - - if r.headers['content-type'] == 'application/x-tar': - print('Extracting downloaded file in directory "{}" ...' - .format(data_dir)) - with tarfile.open(downloaded_data_file, 'r') as tar: - tar.extractall(data_dir) - print('Removing downloaded file ...') - downloaded_data_file.unlink() - - -if __name__ == "__main__": - - # This script downloads a compressed data set archive from a public - # location e.g. http://server/path/to/archive and extracts it. - # The archive location can be specified using the DATASET_URL environment - # variable DATASET_URL=http://server/path/to/archive. - - # initialize download URL from environment variable - dataset_url = os.environ.get('DATASET_URL') - - # No data set URL was provided. - if dataset_url is None: - raise RuntimeError( - 'Cannot run script. A data set URL must be provided as input.') - - # Try to process the URL - download_from_public_url(dataset_url) +data_file_url = os.getenv( + "DATASET_URL", + "https://raw.githubusercontent.com/elyra-ai/examples/" + "main/pipelines/introduction-to-generic-pipelines/data/iris.data", +) + +# Download the dataset +print(f"Downloading data file {data_file_url} ...") +r = requests.get(data_file_url) +if r.status_code != 200: + raise RuntimeError( + f"Error downloading {data_file_url}: HTTP status code {r.status_code}" + ) + +# Save the dataset +datafile_name = "iris.data" +with open(datafile_name, "w") as downloaded_file: + downloaded_file.write(r.text) +print(f"Saved data file as {datafile_name}.")