From 13d241f88ef66797aabe40095f2f81c16e7f01c5 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Tue, 10 Sep 2019 21:34:54 +0200 Subject: [PATCH 1/3] Add 1_table_oriented.ipynb from master --- notebooks/1_table_oriented.ipynb | 620 +++++++++++++++++++++++++++++++ 1 file changed, 620 insertions(+) create mode 100644 notebooks/1_table_oriented.ipynb diff --git a/notebooks/1_table_oriented.ipynb b/notebooks/1_table_oriented.ipynb new file mode 100644 index 0000000..3a195be --- /dev/null +++ b/notebooks/1_table_oriented.ipynb @@ -0,0 +1,620 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Objectives\n", + "\n", + "- Load the Python Data Analysis Library (Pandas).\n", + "- Describe what a DataFrame and a Series are\n", + "- Understand DataFrame attributes versus methods\n", + "\n", + "### Content to cover\n", + "\n", + "* import pandas as pd\n", + "* DataFrame/Series\n", + "* .index/.columns versus .head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pandas is table oriented" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I want to start using Pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To load the Pandas package and start working with it, import the package. The community agreed shortcut for pandas is `pd`, so loading Pandas as `pd` is assumed standard practice for all of the Pandas documentation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pandas table data representation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![](../schemas/01_table_dataframe.svg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers) and the cabin class (categories 1, 2 or 3) data." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAgePclass
0Braund, Mr. Owen Harris223
1Allen, Mr. William Henry353
2Bonnell, Miss. Elizabeth581
\n", + "
" + ], + "text/plain": [ + " Name Age Pclass\n", + "0 Braund, Mr. Owen Harris 22 3\n", + "1 Allen, Mr. William Henry 35 3\n", + "2 Bonnell, Miss. Elizabeth 58 1" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_dataframe = pd.DataFrame({\n", + " 'Name': [\"Braund, Mr. Owen Harris\", \n", + " \"Allen, Mr. William Henry\", \n", + " \"Bonnell, Miss. Elizabeth\"], \n", + " 'Age': [22, 35, 58],\n", + " 'Pclass': pd.Categorical([3, 3, 1])}\n", + " )\n", + "my_dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A `DataFrame` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categgorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R.\n", + "\n", + "__Note:__ In most situations, data tables stored in a file format are the starting point of an analysis. The [next tutorial](2_read_write.ipynb) provides more insight to reading data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Attributes of a table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> What are the column names, row names and type of data in my data table?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each `DataFrame` has a number of attributes. These are characteristics of the table and can be requested by `.` in combination with the attribute name. FTo start with, the following attributes are worthwhile to remember:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The __column__ names of the `DataFrame`:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['Name', 'Age', 'Pclass'], dtype='object')" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_dataframe.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The row labels are defined by the __index__ of a `DataFrame`:" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "RangeIndex(start=0, stop=3, step=1)" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_dataframe.index" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The __shape__, the number of rows and columns, of a `DataFrame`:" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3, 3)" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_dataframe.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The type of data (integers, float, characters, datetime,...) of the individual columns is expressed in the __dtypes__ attribute: " + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Name object\n", + "Age int64\n", + "Pclass category\n", + "dtype: object" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_dataframe.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each column contains data from a single data type. `object` is Pandas terminology for character data.\n", + "\n", + "__To user guide:__ For an overview of the supported dtypes of Pandas, see :ref:`basics.dtypes`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Functionalities of a table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I'm interested in a short summary of my data table" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 3 entries, 0 to 2\n", + "Data columns (total 3 columns):\n", + "Name 3 non-null object\n", + "Age 3 non-null int64\n", + "Pclass 3 non-null category\n", + "dtypes: category(1), int64(1), object(1)\n", + "memory usage: 275.0+ bytes\n" + ] + } + ], + "source": [ + "my_dataframe.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output provides some information on the `DataFrame`:\n", + "\n", + "- It is indeed a `DataFrame`.\n", + "- Each row was assigned an index (row label) of 0 to N-1, where N is the number of rows in the `DataFrame`. Pandas will do this by default if an index is not specified. Don't worry, this can be changed later.\n", + "- There are 3 entries, i.e. rows.\n", + "- The table has 3 columns, each of them with all values provided, so no missing values.\n", + "- One of the columns consists of character data, one of integers and thhe latter is categorical data.\n", + "- The approximate amount of RAM used to hold the DataFrame is provided as well." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As illustrated by the `info()` method, you can _do_ things with a `DataFrame`. Pandas provides a lot of functionalities to work with `DataFrame`, each of them a _method_ you can apply to a `DataFrame`. As methods are functions, do not forget to use parenthesis `()`. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I'm interested in some basic statistics of the numerical data of my data table" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Age
count3.000000
mean38.333333
std18.230012
min22.000000
25%28.500000
50%35.000000
75%46.500000
max58.000000
\n", + "
" + ], + "text/plain": [ + " Age\n", + "count 3.000000\n", + "mean 38.333333\n", + "std 18.230012\n", + "min 22.000000\n", + "25% 28.500000\n", + "50% 35.000000\n", + "75% 46.500000\n", + "max 58.000000" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_dataframe.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As the `Name` and `PClass` columns are character and categorical data respectively, these are by default not taken into account by the `describe` method. \n", + "\n", + "__To user guide:__ check more options on `describe` :ref:`basics.describe`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pandas Series" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![](../schemas/01_table_series.svg)" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 22\n", + "1 35\n", + "2 58\n", + "dtype: int64" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ages = pd.Series([22, 35, 58])\n", + "ages" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A single column version of a `DataFrame` is a Pandas `Series`. It does not have columns names, but still has the row index:" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "RangeIndex(start=0, stop=3, step=1)" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ages.index" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similar to a Pandas `DataFrame`, you can _do_ things with a `Series` and apply a method:" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 3.000000\n", + "mean 38.333333\n", + "std 18.230012\n", + "min 22.000000\n", + "25% 28.500000\n", + "50% 35.000000\n", + "75% 46.500000\n", + "max 58.000000\n", + "dtype: float64" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ages.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__To user guide:__ Why both `Series` and `DataFrame` are required, see :ref:`TODO` ([label](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html#why-more-than-one-data-structure) to add in sphinx)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## REMEMBER\n", + "\n", + "- Working with Pandas always requires `import Pandas as pd`\n", + "- A table of data is stored as a Pandas `DataFrame`\n", + "- A single column version of a `DataFrame` is a `Series`\n", + "- A Pandas DataFrame and Series do have attributes (i.e. characteristics) and methods (i.e. actions on it).\n", + "- Methods require `()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__To user guide:__ A more extended introduction to `DataFrame` and `Series` is provided in :ref:`dsintro`." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 3c1838b2245befb564f1fa63030d48d3100a3dc9 Mon Sep 17 00:00:00 2001 From: stijnvanhoey Date: Mon, 23 Sep 2019 12:55:33 +0200 Subject: [PATCH 2/3] Update from master --- notebooks/1_table_oriented.ipynb | 348 ++++++++++--------------------- 1 file changed, 106 insertions(+), 242 deletions(-) diff --git a/notebooks/1_table_oriented.ipynb b/notebooks/1_table_oriented.ipynb index 3a195be..07a4c9d 100644 --- a/notebooks/1_table_oriented.ipynb +++ b/notebooks/1_table_oriented.ipynb @@ -1,33 +1,5 @@ { "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Objectives\n", - "\n", - "- Load the Python Data Analysis Library (Pandas).\n", - "- Describe what a DataFrame and a Series are\n", - "- Understand DataFrame attributes versus methods\n", - "\n", - "### Content to cover\n", - "\n", - "* import pandas as pd\n", - "* DataFrame/Series\n", - "* .index/.columns versus .head()\n" - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -44,7 +16,7 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ @@ -55,14 +27,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To load the Pandas package and start working with it, import the package. The community agreed shortcut for pandas is `pd`, so loading Pandas as `pd` is assumed standard practice for all of the Pandas documentation." + "To load the pandas package and start working with it, import the package. The community agreed shortcut for pandas is `pd`, so loading pandas as `pd` is assumed standard practice for all of the pandas documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Pandas table data representation" + "### Pandas data table representation" ] }, { @@ -81,7 +53,7 @@ }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 35, "metadata": {}, "outputs": [ { @@ -112,19 +84,19 @@ " \n", " \n", " \n", - " 0\n", + " 0\n", " Braund, Mr. Owen Harris\n", " 22\n", " 3\n", " \n", " \n", - " 1\n", + " 1\n", " Allen, Mr. William Henry\n", " 35\n", " 3\n", " \n", " \n", - " 2\n", + " 2\n", " Bonnell, Miss. Elizabeth\n", " 58\n", " 1\n", @@ -140,7 +112,7 @@ "2 Bonnell, Miss. Elizabeth 58 1" ] }, - "execution_count": 47, + "execution_count": 35, "metadata": {}, "output_type": "execute_result" } @@ -151,7 +123,8 @@ " \"Allen, Mr. William Henry\", \n", " \"Bonnell, Miss. Elizabeth\"], \n", " 'Age': [22, 35, 58],\n", - " 'Pclass': pd.Categorical([3, 3, 1])}\n", + " 'Pclass': pd.Categorical([3, 3, 1])},\n", + " index = [0, 1, 2]\n", " )\n", "my_dataframe" ] @@ -160,209 +133,197 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A `DataFrame` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categgorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R.\n", + "A `DataFrame` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R. \n", "\n", - "__Note:__ In most situations, data tables stored in a file format are the starting point of an analysis. The [next tutorial](2_read_write.ipynb) provides more insight to reading data." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Attributes of a table" + "- The table has 3 columns, each of them with a column label. The column labels are respectively `Name`, `Age` and `Pclass`.\n", + "- The column `Name` consists of textual data with each value a string, the column `Age` are numbers and the latter is categorical data (each category represents a cabin class).\n", + "\n", + "In spreadsheet software, the table representation of our data would look very similar:\n", + "\n", + "![](../schemas/01_table_spreadsheet.png)\n", + "\n", + "\n", + " \n", + "
\n", + " \n", + "__Note:__ You probably do not want to manually input the data of a DataFrame! In most situations, data tables stored in a file format are the starting point of an analysis. The [next tutorial](2_read_write.ipynb) provides more insight to reading data from a variety of data sources.\n", + "\n", + "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> What are the column names, row names and type of data in my data table?" + "# Each column in a `DataFrame` is a `Series`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Each `DataFrame` has a number of attributes. These are characteristics of the table and can be requested by `.` in combination with the attribute name. FTo start with, the following attributes are worthwhile to remember:" + "![](../schemas/01_table_series.svg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The __column__ names of the `DataFrame`:" + "> I'm just interested in working with the data in the column `Age`" ] }, { "cell_type": "code", - "execution_count": 48, + "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "Index(['Name', 'Age', 'Pclass'], dtype='object')" + "0 22\n", + "1 35\n", + "2 58\n", + "Name: Age, dtype: int64" ] }, - "execution_count": 48, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "my_dataframe.columns" + "my_dataframe[\"Age\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The row labels are defined by the __index__ of a `DataFrame`:" + "When selecting a single column of a pandas `DataFrame`, the result is a pandas `Series`. To select the column, use the column label in between square brackets `[]`. Already wondering about other ways to select data, jump straight to [the tutorial on subsetting](3_subset_data.ipynb)." ] }, { - "cell_type": "code", - "execution_count": 49, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "RangeIndex(start=0, stop=3, step=1)" - ] - }, - "execution_count": 49, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "my_dataframe.index" + "
\n", + " \n", + "When you are familiar to Python :ref:`dictionaries `, the selection of a single column is very similar to selection of dictionary values base on the key.\n", + "\n", + "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The __shape__, the number of rows and columns, of a `DataFrame`:" + "You can create a `Series` from scratch as well:" ] }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "(3, 3)" + "0 22\n", + "1 35\n", + "2 58\n", + "Name: Age, dtype: int64" ] }, - "execution_count": 50, + "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "my_dataframe.shape" + "ages = pd.Series([22, 35, 58], name = \"Age\")\n", + "ages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The type of data (integers, float, characters, datetime,...) of the individual columns is expressed in the __dtypes__ attribute: " - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Name object\n", - "Age int64\n", - "Pclass category\n", - "dtype: object" - ] - }, - "execution_count": 51, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "my_dataframe.dtypes" + "A pandas `Series` has no column labels, as it is just a single column of a `DataFrame`. A Series does have row labels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Each column contains data from a single data type. `object` is Pandas terminology for character data.\n", - "\n", - "__To user guide:__ For an overview of the supported dtypes of Pandas, see :ref:`basics.dtypes`" + "# Do something with a DataFrame or Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Functionalities of a table" + "> I want to know the maximum Age of the passengers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> I'm interested in a short summary of my data table" + "We can do this on the `DataFrame` by selecting the `Age` column and applying `max()`:" ] }, { "cell_type": "code", - "execution_count": 52, + "execution_count": 32, "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 3 entries, 0 to 2\n", - "Data columns (total 3 columns):\n", - "Name 3 non-null object\n", - "Age 3 non-null int64\n", - "Pclass 3 non-null category\n", - "dtypes: category(1), int64(1), object(1)\n", - "memory usage: 275.0+ bytes\n" - ] + "data": { + "text/plain": [ + "58" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "my_dataframe.info()" + "my_dataframe[\"Age\"].max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The output provides some information on the `DataFrame`:\n", - "\n", - "- It is indeed a `DataFrame`.\n", - "- Each row was assigned an index (row label) of 0 to N-1, where N is the number of rows in the `DataFrame`. Pandas will do this by default if an index is not specified. Don't worry, this can be changed later.\n", - "- There are 3 entries, i.e. rows.\n", - "- The table has 3 columns, each of them with all values provided, so no missing values.\n", - "- One of the columns consists of character data, one of integers and thhe latter is categorical data.\n", - "- The approximate amount of RAM used to hold the DataFrame is provided as well." + "Or to the `Series`:" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "58" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ages.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "As illustrated by the `info()` method, you can _do_ things with a `DataFrame`. Pandas provides a lot of functionalities to work with `DataFrame`, each of them a _method_ you can apply to a `DataFrame`. As methods are functions, do not forget to use parenthesis `()`. " + "As illustrated by the `max()` method, you can _do_ things with a `DataFrame` or `Series`. Pandas provides a lot of functionalities each of them a _method_ you can apply to a `DataFrame` or `Series`. As methods are like functions, do not forget to use parenthesis `()`. Already looking forward to get more advanced summary statistics, go directly to the [tutorial on statistics](6_calculate_statistics.ipynb). Or rather want to do calculations with entire columns, go straight to [tutorial on calculating with columns](5_add_columns.ipynb)." ] }, { @@ -374,7 +335,7 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": 34, "metadata": {}, "outputs": [ { @@ -403,35 +364,35 @@ " \n", " \n", " \n", - " count\n", + " count\n", " 3.000000\n", " \n", " \n", - " mean\n", + " mean\n", " 38.333333\n", " \n", " \n", - " std\n", + " std\n", " 18.230012\n", " \n", " \n", - " min\n", + " min\n", " 22.000000\n", " \n", " \n", - " 25%\n", + " 25%\n", " 28.500000\n", " \n", " \n", - " 50%\n", + " 50%\n", " 35.000000\n", " \n", " \n", - " 75%\n", + " 75%\n", " 46.500000\n", " \n", " \n", - " max\n", + " max\n", " 58.000000\n", " \n", " \n", @@ -450,7 +411,7 @@ "max 58.000000" ] }, - "execution_count": 53, + "execution_count": 34, "metadata": {}, "output_type": "execute_result" } @@ -463,7 +424,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As the `Name` and `PClass` columns are character and categorical data respectively, these are by default not taken into account by the `describe` method. \n", + "The `describe` method provides quick overview of the numerical data in a `DataFrame`. As the `Name` and `PClass` columns are character and categorical data respectively, these are by default not taken into account by the `describe` method. \n", "\n", "__To user guide:__ check more options on `describe` :ref:`basics.describe`" ] @@ -472,107 +433,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Pandas Series" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![](../schemas/01_table_series.svg)" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 22\n", - "1 35\n", - "2 58\n", - "dtype: int64" - ] - }, - "execution_count": 54, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "ages = pd.Series([22, 35, 58])\n", - "ages" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A single column version of a `DataFrame` is a Pandas `Series`. It does not have columns names, but still has the row index:" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "RangeIndex(start=0, stop=3, step=1)" - ] - }, - "execution_count": 55, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "ages.index" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Similar to a Pandas `DataFrame`, you can _do_ things with a `Series` and apply a method:" - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "count 3.000000\n", - "mean 38.333333\n", - "std 18.230012\n", - "min 22.000000\n", - "25% 28.500000\n", - "50% 35.000000\n", - "75% 46.500000\n", - "max 58.000000\n", - "dtype: float64" - ] - }, - "execution_count": 56, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "ages.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__To user guide:__ Why both `Series` and `DataFrame` are required, see :ref:`TODO` ([label](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html#why-more-than-one-data-structure) to add in sphinx)" + "
\n", + " \n", + "__Note:__ This is just a starting point. Besides the looks, also the data manipulations and calculations you would do in spreadsheet software are supported by Pandas. Continue reading the next tutorials to get you started!\n", + "\n", + "
" ] }, { @@ -581,11 +446,10 @@ "source": [ "## REMEMBER\n", "\n", - "- Working with Pandas always requires `import Pandas as pd`\n", + "- Import the package, aka `import Pandas as pd`\n", "- A table of data is stored as a Pandas `DataFrame`\n", - "- A single column version of a `DataFrame` is a `Series`\n", - "- A Pandas DataFrame and Series do have attributes (i.e. characteristics) and methods (i.e. actions on it).\n", - "- Methods require `()`" + "- Each column in a `DataFrame` is a `Series`\n", + "- You can do things by applying a method to a `DataFrame` or `Series`" ] }, { From 344cba96b89e38bccc2061367b96c489042c62c2 Mon Sep 17 00:00:00 2001 From: stijnvanhoey Date: Mon, 7 Oct 2019 18:46:00 +0200 Subject: [PATCH 3/3] update from master --- notebooks/1_table_oriented.ipynb | 94 +++++++++++++++----------------- 1 file changed, 43 insertions(+), 51 deletions(-) diff --git a/notebooks/1_table_oriented.ipynb b/notebooks/1_table_oriented.ipynb index 07a4c9d..432f6c6 100644 --- a/notebooks/1_table_oriented.ipynb +++ b/notebooks/1_table_oriented.ipynb @@ -16,7 +16,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -27,7 +27,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To load the pandas package and start working with it, import the package. The community agreed shortcut for pandas is `pd`, so loading pandas as `pd` is assumed standard practice for all of the pandas documentation." + "To load the pandas package and start working with it, import the package. The community agreed alias for pandas is `pd`, so loading pandas as `pd` is assumed standard practice for all of the pandas documentation." ] }, { @@ -48,12 +48,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers) and the cabin class (categories 1, 2 or 3) data." + "> I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers) and sex (male/female) data." ] }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -79,7 +79,7 @@ " \n", " Name\n", " Age\n", - " Pclass\n", + " Sex\n", " \n", " \n", " \n", @@ -87,46 +87,45 @@ " 0\n", " Braund, Mr. Owen Harris\n", " 22\n", - " 3\n", + " male\n", " \n", " \n", " 1\n", " Allen, Mr. William Henry\n", " 35\n", - " 3\n", + " male\n", " \n", " \n", " 2\n", " Bonnell, Miss. Elizabeth\n", " 58\n", - " 1\n", + " female\n", " \n", " \n", "\n", "" ], "text/plain": [ - " Name Age Pclass\n", - "0 Braund, Mr. Owen Harris 22 3\n", - "1 Allen, Mr. William Henry 35 3\n", - "2 Bonnell, Miss. Elizabeth 58 1" + " Name Age Sex\n", + "0 Braund, Mr. Owen Harris 22 male\n", + "1 Allen, Mr. William Henry 35 male\n", + "2 Bonnell, Miss. Elizabeth 58 female" ] }, - "execution_count": 35, + "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "my_dataframe = pd.DataFrame({\n", - " 'Name': [\"Braund, Mr. Owen Harris\", \n", + "df = pd.DataFrame({\n", + " \"Name\": [\"Braund, Mr. Owen Harris\", \n", " \"Allen, Mr. William Henry\", \n", " \"Bonnell, Miss. Elizabeth\"], \n", - " 'Age': [22, 35, 58],\n", - " 'Pclass': pd.Categorical([3, 3, 1])},\n", - " index = [0, 1, 2]\n", + " \"Age\": [22, 35, 58],\n", + " \"Sex\": [\"male\", \"male\", \"female\"]}\n", " )\n", - "my_dataframe" + "df" ] }, { @@ -135,27 +134,19 @@ "source": [ "A `DataFrame` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R. \n", "\n", - "- The table has 3 columns, each of them with a column label. The column labels are respectively `Name`, `Age` and `Pclass`.\n", - "- The column `Name` consists of textual data with each value a string, the column `Age` are numbers and the latter is categorical data (each category represents a cabin class).\n", + "- The table has 3 columns, each of them with a column label. The column labels are respectively `Name`, `Age` and `Sex`.\n", + "- The column `Name` consists of textual data with each value a string, the column `Age` are numbers and the column `Sex` is textual data.\n", "\n", "In spreadsheet software, the table representation of our data would look very similar:\n", "\n", - "![](../schemas/01_table_spreadsheet.png)\n", - "\n", - "\n", - " \n", - "
\n", - " \n", - "__Note:__ You probably do not want to manually input the data of a DataFrame! In most situations, data tables stored in a file format are the starting point of an analysis. The [next tutorial](2_read_write.ipynb) provides more insight to reading data from a variety of data sources.\n", - "\n", - "
" + "![](../schemas/01_table_spreadsheet.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Each column in a `DataFrame` is a `Series`" + "### Each column in a `DataFrame` is a `Series`" ] }, { @@ -174,7 +165,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -186,20 +177,20 @@ "Name: Age, dtype: int64" ] }, - "execution_count": 27, + "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "my_dataframe[\"Age\"]" + "df[\"Age\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "When selecting a single column of a pandas `DataFrame`, the result is a pandas `Series`. To select the column, use the column label in between square brackets `[]`. Already wondering about other ways to select data, jump straight to [the tutorial on subsetting](3_subset_data.ipynb)." + "When selecting a single column of a pandas `DataFrame`, the result is a pandas `Series`. To select the column, use the column label in between square brackets `[]`." ] }, { @@ -208,7 +199,7 @@ "source": [ "
\n", " \n", - "When you are familiar to Python :ref:`dictionaries `, the selection of a single column is very similar to selection of dictionary values base on the key.\n", + "If you are familiar to Python :ref:`dictionaries `, the selection of a single column is very similar to selection of dictionary values based on the key.\n", "\n", "
" ] @@ -222,7 +213,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -234,7 +225,7 @@ "Name: Age, dtype: int64" ] }, - "execution_count": 36, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -255,7 +246,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Do something with a DataFrame or Series" + "### Do something with a DataFrame or Series" ] }, { @@ -274,7 +265,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -283,13 +274,13 @@ "58" ] }, - "execution_count": 32, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "my_dataframe[\"Age\"].max()" + "df[\"Age\"].max()" ] }, { @@ -301,7 +292,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -310,7 +301,7 @@ "58" ] }, - "execution_count": 33, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -323,7 +314,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As illustrated by the `max()` method, you can _do_ things with a `DataFrame` or `Series`. Pandas provides a lot of functionalities each of them a _method_ you can apply to a `DataFrame` or `Series`. As methods are like functions, do not forget to use parenthesis `()`. Already looking forward to get more advanced summary statistics, go directly to the [tutorial on statistics](6_calculate_statistics.ipynb). Or rather want to do calculations with entire columns, go straight to [tutorial on calculating with columns](5_add_columns.ipynb)." + "As illustrated by the `max()` method, you can _do_ things with a `DataFrame` or `Series`. Pandas provides a lot of functionalities each of them a _method_ you can apply to a `DataFrame` or `Series`. As methods are functions, do not forget to use parentheses `()`." ] }, { @@ -335,7 +326,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -411,20 +402,21 @@ "max 58.000000" ] }, - "execution_count": 34, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "my_dataframe.describe()" + "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The `describe` method provides quick overview of the numerical data in a `DataFrame`. As the `Name` and `PClass` columns are character and categorical data respectively, these are by default not taken into account by the `describe` method. \n", + "The `describe` method provides quick overview of the numerical data in a `DataFrame`. As the `Name` and `Sex` columns are textual data, these are by default not taken into account by the `describe` method. Many pandas operations return a `DataFrame` or a `Series`. The `describe` method is an example of a pandas operation returning a pandas `Series`.\n", + "\n", "\n", "__To user guide:__ check more options on `describe` :ref:`basics.describe`" ] @@ -435,7 +427,7 @@ "source": [ "
\n", " \n", - "__Note:__ This is just a starting point. Besides the looks, also the data manipulations and calculations you would do in spreadsheet software are supported by Pandas. Continue reading the next tutorials to get you started!\n", + "__Note:__ This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet software are supported by pandas. Continue reading the next tutorials to get you started!\n", "\n", "
" ] @@ -447,7 +439,7 @@ "## REMEMBER\n", "\n", "- Import the package, aka `import Pandas as pd`\n", - "- A table of data is stored as a Pandas `DataFrame`\n", + "- A table of data is stored as a pandas `DataFrame`\n", "- Each column in a `DataFrame` is a `Series`\n", "- You can do things by applying a method to a `DataFrame` or `Series`" ]