Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
352 changes: 352 additions & 0 deletions 1_FD.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,352 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# SQL in Python - Connecting to and retrieving data from PostgreSQL"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Previously, you have learned how to connect to a SQL database by using a SQL client such as DBeaver.\n",
"Apart from connecting to databases, DBeaver also allows you to run SQL queries against the database, create new tables and populate them with data as well as retrieving the data.\n",
"\n",
"Populating tables with data that you have locally on your machine usually requires you to save it in a file, like a CSV, and import it using the DBeaver UI.\n",
"\n",
"Often, before you reached the final step of uploading your dataset, you have performed data cleaning procedures to bring your data into shape. This means we would import the data into Python, clean it, export it to a CSV file, import it into DBeaver and upload the data into the database.\n",
"\n",
"This process requires multiple steps and more than one software. Fortunately, we can reduce the steps by connecting to the database from Python directly, eliminating the need for a separate SQL client.\n",
"\n",
"**In this notebook you will see 2 ways to connect to SQL-Databases and export the data to a CSV file.**\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a connection to a PostgreSQL database with Python"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 2 python packages that are the \"go-to\" when it comes to connecting to SQL-Databases: `psycopg2` and `sqlalchemy` \n",
"\n",
"First an example with psycopg2:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd # to read sql data into a pandas dataframe\n",
"import psycopg2 # to connect to SQL database"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"In order to create a connection to our PostgreSQL database we need the following information (usually posted by coaches during SQL days in Slack):\n",
"\n",
"- database = name of the database\n",
"- user = name of the user\n",
"- password = password of the user\n",
"- host = address of the machine the database is hosted on\n",
"- port = virtual gate number through which communication will be allowed\n",
"\n",
"Because we don't want that the database information is published on GitHub, we put it into a `.env` file which is added into the `.gitignore`. \n",
"\n",
"In these kind of files you can store information that is not supposed to be published.\n",
"With the `dotenv` package you can read the `.env` files and get the variables in there.\n",
"The file was 'force added' to the repo using `git add -f .env` command. Please follow instructions inside the `.env` file to ensure you have the right credentials inside. \n",
"\n",
"Then, run the following code cell (no need to edit): "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import os # provides functions for interacting with operating system\n",
"from dotenv import load_dotenv # reads key-value pairs from a .env file and can set them as environment variables\n",
"\n",
"load_dotenv() # takes environment variables from .env\n",
"\n",
"DATABASE = os.getenv('DATABASE')\n",
"USER_DB = os.getenv('USER_DB')\n",
"PASSWORD = os.getenv('PASSWORD')\n",
"HOST = os.getenv('HOST')\n",
"PORT = os.getenv('PORT')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The function from the psycopg2 package to create a connection is called `connect()`. \n",
"\n",
"`connect()` expects the parameters listed above as input in order to connect to the database. \n",
"\n",
">**Note**: If you edited your `.env` file correcty, but still get an error when trying to connect, \"Restart\" your Kernel. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Create / open connection object conn (no need to edit code)\n",
"conn = psycopg2.connect(\n",
" database=DATABASE,\n",
" user=USER_DB,\n",
" password=PASSWORD,\n",
" host=HOST,\n",
" port=PORT\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrieving data from the database\n",
"\n",
"Before we can use our connection to get data, we have to create a cursor. A cursor allows Python code to execute PostgreSQL commmands in a database session.\n",
"A cursor has to be created with the `cursor()` method of our connection object conn."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cur = conn.cursor() # create cursor for our opened connection in object conn"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can run SQL-Queries with `cur.execute('QUERY')` and then run `cur.fetchall()` to get the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cur.execute('SELECT * FROM datasets.kaggle_survey LIMIT 10') # executes given SQL query\n",
"cur.fetchall() # gets data called by query"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"With `conn.close()` you can close the connection again."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#close the connection\n",
"conn.close()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"But we want to work with the data. The easiest way is to import the data into pandas dataframes. We can use `pd.read_sql_query` or `pd.read_sql_table` or for convenience `pd.read_sql`.\n",
"\n",
"This function is a convenience wrapper around `read_sql_table` and `read_sql_query` (for backward compatibility). It will delegate to the specific function depending on the provided input. A SQL query will be routed to `read_sql_query` , while a database table name will be routed to `read_sql_table` . Note that the delegated function might have more specific notes about their functionality which are not listed here. Find more in the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_sql_query.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Open connection again because we closed it (no need to edit code)\n",
"conn = psycopg2.connect(\n",
" database=DATABASE,\n",
" user=USER_DB,\n",
" password=PASSWORD,\n",
" host=HOST,\n",
" port=PORT\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import the data into a pandas dataframe\n",
"query_string = \"SELECT * FROM datasets.kaggle_survey LIMIT 10\" # define SQL query\n",
"df_psycopg = pd.read_sql(query_string, conn) # read queried data from SQL database into pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#close the connection\n",
"conn.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_psycopg.head() # look at first five lines of dataframe"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### SQLALCHEMY"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`sqlalchemy` works similarly. Here you have to create an engine with the database string (a link that includes every information we entered in the conn object)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sqlalchemy import create_engine # for creating an engine\n",
"\n",
"#read the database string DB_STRING from the .env\n",
"load_dotenv()\n",
"\n",
"DB_STRING = os.getenv('DB_STRING') # gets database string DB_STRING from .env file and assigns it as value for new variable DB_STRING\n",
"\n",
"db = create_engine(DB_STRING) # creates engine from database string DB_STRING"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"And then you can import that engine with a query into a pandas dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd # if not done already"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#import the data to a pandas dataframe\n",
"query_string = \"SELECT * FROM datasets.kaggle_survey\" # write SQL-query into variable query_string\n",
"df_sqlalchemy = pd.read_sql(query_string, db) # read queried data from SQL database into pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_sqlalchemy.head() # look at first five lines of dataframe"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we don't want to run the queries over and over again we can then export the data into a csv and import that file into our main jupyter notebook: [Visualisation_Exercise](https://github.com/neuefische/ds-visualisation/blob/main/2_Visualisation_Exercise.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#export the data to a csv-file\n",
"df_sqlalchemy.to_csv('kaggle_survey.csv',index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "d7a35bba081246a577863ed5357213b0bf3e2936bc08045816acb79d76e359dd"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading
Loading