Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified data/bake_sale.xlsx
Binary file not shown.
36 changes: 20 additions & 16 deletions databases.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"\n",
"### Prerequisites\n",
"\n",
"You will need the **pandas**, **SQLModel**, and **ibis** packages for this chapter. You probably already have **pandas** installed; to install **SQLModel** and **ibis** respectively run `uv add sqlmodel` and `uv add ibis-framework` on your computer's command line. First, let's bring in some general packages and turn off verbose warnings."
"You will need the **polars**, **SQLModel**, and **ibis** packages for this chapter. You probably already have **polars** installed; to install **SQLModel** and **ibis** respectively run `uv add sqlmodel` and `uv add ibis-framework` on your computer's command line. First, let's bring in some general packages and turn off verbose warnings."
]
},
{
Expand All @@ -39,10 +39,9 @@
"metadata": {},
"source": [
"## Database Basics\n",
"\n",
"At the simplest level, you can think about a database as a collection of data frames, called **tables** in database terminology.\n",
"Like a **pandas** data frame, a database table is a collection of named columns, where every value in the column is the same type.\n",
"There are three high level differences between data frames and database tables:\n",
"At the simplest level, you can think about a database as a collection of data frames, called **tables** in database terminology. \n",
"Like a **Polars** DataFrame, a database table is a collection of named columns, where every value in a column shares the same data type. \n",
"There are three high-level differences between data frames and database tables:\n",
"\n",
"- Database tables are stored on disk (ie on file) and can be arbitrarily large.\n",
" Data frames are stored in memory, and are fundamentally limited (although that limit is still big enough for many problems). You can think about the difference between on disk and in memory as being like the difference between long-term and short-term memory (and you have much more limited capacity in the latter).\n",
Expand All @@ -68,7 +67,7 @@
"\n",
"- You'll always use a database interface that provides a connection to the database, for example Python's built-in **sqlite** package\n",
"\n",
"- You'll also use a package that pushes and/or pulls data to/from the database, for example **pandas**\n",
"- You'll also use a package that pushes and/or pulls data to/from the database, for example **polars**\n",
"\n",
"The precise details of the connection varies a lot from DBMS to DBMS so unfortunately we can't cover all the details here. The initial setup will often take a little fiddling (and maybe some research) to get right, but you'll generally only need to do it once. We'll do the best we can to cover some basics here.\n",
"\n",
Expand Down Expand Up @@ -112,7 +111,7 @@
"id": "2992b718",
"metadata": {},
"source": [
"Note that the output here is in the form a Python object called a tuple. If we wanted to put this into a **pandas** data frame, we can just pass it straight in:"
"Note that the output here is in the form of a Python object called a tuple. If we want to convert this into a **Polars** DataFrame, we can pass it to `pl.DataFrame()`. When working with tuples, you may need to provide column names using the **schema** argument or specify **orient=\"row\"** so Polars correctly interprets the structure."
]
},
{
Expand All @@ -122,9 +121,11 @@
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import polars as pl\n",
"\n",
"df = pl.DataFrame(rows, orient=\"row\")\n",
"\n",
"pd.DataFrame(rows)"
"df"
]
},
{
Expand Down Expand Up @@ -316,9 +317,9 @@
"source": [
"### Joins\n",
"\n",
"If you're familiar with joins in **pandas**, SQL joins are very similar. Let's see if we can join the 'album' and 'track' tables to find the *name* of the albums in the above query.\n",
"If youre familiar with joins in **polars**, SQL joins are very similar. Lets see if we can join the 'album' and 'track' tables to find the *name* of the albums in the above query.\n",
"\n",
"Note that as soon as we have the *same* column names in more than one table, we need to specify the table we are referring to when we use that column name. There are different options for joins (eg `INNER`, `LEFT`) that you can find out more about [here](https://en.wikipedia.org/wiki/Join_(SQL)).\n"
"In polars, you use the `df.join()` method, which defaults to an \"inner\" join. Note that if you have the same column names in both tables, Polars will often append a suffix (like _right) to the duplicate names to keep them distinct, unless you specify otherwise. There are different options for joins (eg `INNER`, `LEFT`) that you can find out more about [here](https://en.wikipedia.org/wiki/Join_(SQL)).\n"
]
},
{
Expand Down Expand Up @@ -403,9 +404,9 @@
"id": "495f97e5",
"metadata": {},
"source": [
"## SQL with **pandas**\n",
"## SQL with **polars**\n",
"\n",
"**pandas** is well-equipped for working with SQL. We can simply push the query we just created straight through using its `read_sql()` function—but bear in mind we need to pass in the connection we created to the database too:"
"**polars** is well-equipped for working with SQL. We can simply push the query we just created straight through using its `read_database()` function—but bear in mind we need to pass in the connection we created to the database too:"
]
},
{
Expand All @@ -415,7 +416,10 @@
"metadata": {},
"outputs": [],
"source": [
"pd.read_sql(sql_join, con)"
"df = pl.read_database(\n",
" query=sql_join, # your SQL query (string)\n",
" connection=con, # your connection object (SQLAlchemy, psycopg2 cursor, etc.)\n",
")"
]
},
{
Expand All @@ -435,7 +439,7 @@
"source": [
"## SQL with **ibis**\n",
"\n",
"It's not exactly satisfactory to have to write out your SQL queries in text. What if we could create commands directly from **pandas** commands? You can't *quite* do that, but there's a package that gets you pretty close and it's called [**ibis**](https://ibis-project.org/). **ibis** is particularly useful when you are reading from a database and want to query it just like you would a **pandas** data frame.\n",
"It's not exactly satisfactory to have to write out your SQL queries in text. What if we could create commands directly from **polars** commands? You can't *quite* do that, but there's a package that gets you pretty close and it's called [**ibis**](https://ibis-project.org/). **ibis** is particularly useful when you are reading from a database and want to query it just like you would a **polars** data frame.\n",
"\n",
"**Ibis** can connect to local databases (eg a SQLite database), server-based databases (eg Postgres), or cloud-based databased (eg Google's BigQuery). The syntax to make a connection is, for example, `ibis.bigquery.connect`.\n",
"\n",
Expand All @@ -462,7 +466,7 @@
"id": "6dcd7d71",
"metadata": {},
"source": [
"Okay, now let's reproduce the following query: \"SELECT albumid, AVG(milliseconds)/1e3/60 FROM track GROUP BY albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;\". We'll use a groupby, a mutate (which you can think of like **pandas**' assign statement), a sort, and then `limit()` to only show the first five entries."
"Okay, now let's reproduce the following query: \"SELECT albumid, AVG(milliseconds)/1e3/60 FROM track GROUP BY albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;\". We'll use a group_by, a mutate (which you can think of like **polars** assign statement), a sort, and then `limit()` to only show the first five entries."
]
},
{
Expand Down
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ readme = "README.md"
requires-python = ">=3.12.0,<3.13"
dependencies = [
"beautifulsoup4>=4.12.3",
"fastexcel>=0.19.0",
"graphviz>=0.20.3",
"ibis-framework[sqlite]>=9.5.0",
"ipykernel>=6.29.5",
Expand Down Expand Up @@ -36,6 +37,7 @@ dependencies = [
"toml>=0.10.2",
"watermark>=2.5.0",
"wbgapi>=1.0.14",
"xlsxwriter>=3.2.0",
"yfinance>=1.2.1",
]

Expand Down
Loading
Loading