Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,13 @@ target/
profile_default/
ipython_config.py

# Ignore Mac DS_Store files
.DS_Store

#data files
*.gz
*.csv

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
Expand Down
7 changes: 6 additions & 1 deletion binder/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,9 @@ dependencies:
# JupyterLab extensions
- jupyterlab>=3
- dask-labextension
- ipywidgets
- ipywidgets
- graphviz
- python-graphviz
- scikit-learn
- dask-ml
- coiled
Empty file added data/.gitkeep
Empty file.
164 changes: 164 additions & 0 deletions notebooks/0_Dask_what_and_when.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fa397203",
"metadata": {},
"source": [
"<img src=\"https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask_horizontal_no_pad.svg\"\n",
" width=\"30%\"\n",
" alt=\"Dask logo\\\" />\n",
"\n",
"\n",
"# What is it and when to use it? \n",
"\n",
"\n",
"If you ever heard of Dask you might have some form of these questions. If you have never heard of Dask but you want to know what it is and when/if you should use it, then you are in the right place. \n",
"\n",
"Before we give a short overview and attempt to answer these questions, we strongly recommend you to check the amazing documentation that the Dask community has in place. \n",
"\n",
"- Documentation: https://docs.dask.org\n",
"\n",
"Contribute to the project:\n",
"\n",
"- Github: https://github.com/dask/dask\n",
"\n",
"Engage with the community:\n",
"\n",
"- Slack: https://dask.slack.com/"
]
},
{
"cell_type": "markdown",
"id": "5c317728",
"metadata": {},
"source": [
"### What is Dask? \n",
"\n",
"Dask is a flexible library for parallel computing in Python, that follows the syntax of the PyData ecosystem. If you are familiar with Numpy, pandas and scikit-learn then think of Dask as their faster cousin. For example:\n",
"\n",
"```python\n",
"import pandas as pd import dask.dataframe as dd\n",
"df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv')\n",
"df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()\n",
"```\n",
"\n",
" Since they are all family, Dask allows you to scale your existing workflows with a small amount of changes. Dask enables you to accelerate computations and perform those that don't fit in memory. It works in your laptop but it also scales out to large clusters while providing a dashboard with great diagnostic tools. "
]
},
{
"cell_type": "markdown",
"id": "c70d4db3",
"metadata": {},
"source": [
"<img src=\"https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask-overview.svg\" \n",
" width=\"75%\"\n",
" alt=\"Dask overview\\\" />"
]
},
{
"cell_type": "markdown",
"id": "3a8359e2",
"metadata": {},
"source": [
"### Dask jurgon: Client, Scheduler and Workers \n",
"\n",
"- Client: The user-facing entry point for cluster users. In other words, the client lives where your python code lives, and it communicates to the scheduler, passing along the tasks to be executed.\n",
"- Scheduler: The task manager, it sends the tasks to the workers.\n",
"- Workers: The ones that compute the tasks.\n",
"\n",
"Note: The Scheduler and the Workers are on the same network, they could live in your laptop or on a separate cluster\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/coiled/pydata-global-dask/master/images/dask-cluster.svg\"\n",
" width=\"75%\"\n",
" alt=\"Dask cluster\\\">"
]
},
{
"cell_type": "markdown",
"id": "d78d7794",
"metadata": {
"tags": []
},
"source": [
"## When to use Dask?\n",
"\n",
"Before trying to use Dask, there are some questions to determine if Dask might be suitable for you. \n",
"\n",
"- Does your data fit in memory? \n",
" - Yes: Use pandas or numpy. \n",
" - No : Dask might be able to help. \n",
"- Do your computations take for ever?\n",
" - Yes: Dask might be able to help. \n",
" - No : Awesome.\n",
"- Do you have embarrassingly parallelizable code?\n",
" - Yes: Dask might be able to help.\n",
" - No?: If you are not sure here are some [examples](https://examples.dask.org/applications/embarrassingly-parallel.html) \n",
" - No: I'm sorry, although Dask might have some hope for you.\n",
" \n",
" \n",
"**Bottom Left:** You don't need Dask. \n",
"**Elsewhere:** Dask fair game.\n",
"\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/dask/dask-ml/main/docs/source/images/dimensions_of_scale.svg\"\n",
" width=\"65%\"\n",
" alt=\"Dask zones\">\n",
"\n",
"\n",
"**Disclaimers:**\n",
"\n",
"1. When we say \"Dask might be able to help\" it is because you should try first to accelerate your code with Numpy and or Numba, checking types used on your Dataframes, and then maybe consider Dask. Now even when using Dask, we can't guarantee that things will be faster, it depends on what is the code behind. \n",
"\n",
"2. Even when you have large datasets, at some point you want to double check if you have reduced things to a manageable level where going back to pandas or Numpy might be the best call.\n",
"\n",
"**Best practices:**\n",
"\n",
"The learning curve to use Dask can be a bit intimidating, that's why we want to point you out to some best practices links that will make the process smoother. We will go over some of these topics but we want to leave here these links for future reference\n",
"\n",
"- Are you working with arrays? Check this [array best practices](https://docs.dask.org/en/latest/array-best-practices.html)\n",
"- Dealing with DataFrames? Check this [DataFrames best practices](https://docs.dask.org/en/latest/dataframe-best-practices.html)\n",
"- Are you trying to accelerate your code using `delayed`? Check this [delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)\n",
"- For overall good practices check [Dask good practices](https://docs.dask.org/en/latest/best-practices.html)"
]
},
{
"cell_type": "markdown",
"id": "ea50907f",
"metadata": {},
"source": [
"## Why Dask? \n",
"\n",
"If you are interested in knowing why Dask might be a good option for you we recommend you to check the Dask documentation [Why Dask?](https://docs.dask.org/en/latest/why.html)\n",
"\n",
"But if you are already convinced that Dask is right for you and/or want to learn more about it. The topics that we will cover on this mini-tutorial are:\n",
"\n",
"1. Dask Delayed: How to parallelize existing Python code and your custom algorithms. \n",
"2. Schedulers: Single Machine vs Distributed, and the Dashboard. \n",
"3. From pandas to Dask: How to manipulate bigger-than-memory DataFrames using Dask. \n",
"4. Dask-ML: Scalable machine learning using Dask. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
742 changes: 742 additions & 0 deletions notebooks/1_Delayed.ipynb

Large diffs are not rendered by default.

Loading