diff --git a/boolean-data.ipynb b/boolean-data.ipynb index 931bbb2..d389d25 100644 --- a/boolean-data.ipynb +++ b/boolean-data.ipynb @@ -714,7 +714,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/categorical-data.ipynb b/categorical-data.ipynb index 5796a26..ab6ba30 100644 --- a/categorical-data.ipynb +++ b/categorical-data.ipynb @@ -378,7 +378,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/command-line.md b/command-line.md index 1079842..f9eeb20 100644 --- a/command-line.md +++ b/command-line.md @@ -1,6 +1,6 @@ # The Command Line {#sec-command-line} -In this chapter, you'll meet the *command line* and learn how to use it. Beyond a few key commands like `uv add ` you don't strictly need to know how to use the command line to follow the rest of this book. However, even a tiny bit of knowledge of the command line goes a long way in coding and will serve you well. +In this chapter, you'll meet the _command line_ and learn how to use it. Beyond a few key commands like `uv add ` you don't strictly need to know how to use the command line to follow the rest of this book. However, even a tiny bit of knowledge of the command line goes a long way in coding and will serve you well. To try out any of the commands in this chapter on your machine, you can select 'New Terminal' from the menu bar in Visual Studio Code (Mac and Linux), use the Windows Subsystem for Linux or git bash (Windows), or use a free [online terminal](https://cocalc.com/doc/terminal.html). @@ -8,17 +8,17 @@ This chapter has benefited from numerous sources, including absolutely excellent ## What is the command line? -The command line is a way to directly issue text-based commands to a computer one line at a time (as distinct from a graphical user interface, or GUI, that you navigate with a mouse). It goes under many names: shell, bash, terminal, CLI, and command line. These are actually different things but most people tend to use them to mean the same thing most of the time. The *shell* is the part of an operating system that you interact with but mostly people use shell to mean the command line. *bash* is the programming language that is used in the command line; it's actually a synonym for 'Born Again SHell'. The *terminal* is sometimes used to refer to the command line on Macs. Finally, a *CLI* is just an acronym for command line interface, and is often used in the context of an application; for example, uv has a command line interface because you run it on the command line to install packages (`uv add packagename`). +The command line is a way to directly issue text-based commands to a computer one line at a time (as distinct from a graphical user interface, or GUI, that you navigate with a mouse). It goes under many names: shell, bash, terminal, CLI, and command line. These are actually different things but most people tend to use them to mean the same thing most of the time. The _shell_ is the part of an operating system that you interact with but mostly people use shell to mean the command line. _bash_ is the programming language that is used in the command line; it's actually a synonym for 'Born Again SHell'. The _terminal_ is sometimes used to refer to the command line on Macs. Finally, a _CLI_ is just an acronym for command line interface, and is often used in the context of an application; for example, uv has a command line interface because you run it on the command line to install packages (`uv add packagename`). It's worth mentioning that there's a big difference between the command line on UNIX based systems (MacOS and Linux), and on Windows systems. Here, we'll only address the UNIX version. There is a command line on Windows but it's not widely used for coding. If you're on a Windows machine, you can access a UNIX command line using the Windows Subsystem for Linux. ## Why is the command line useful? -The command line has many uses. Graphical user interfaces are, generally, a bit easier to use *but* they're not very repeatable or scalable. Because the command line uses text-based instructions and can be programmed, it is both repeatable and scalable; properties that are very useful for research and analysis. +The command line has many uses. Graphical user interfaces are, generally, a bit easier to use _but_ they're not very repeatable or scalable. Because the command line uses text-based instructions and can be programmed, it is both repeatable and scalable; properties that are very useful for research and analysis. The broad reasons you might use the command line to issue instructions include: -- software functionality: some software *only* has a command line interface +- software functionality: some software _only_ has a command line interface - efficiency: your computer has limited memory, which graphical user interfaces use a lot of—the command line uses less @@ -71,7 +71,7 @@ The flags or options, such as `-n` in the example above, typically begin with a Spaces take on a special role when using the command line. For this reason, it's good practice to avoid spaces in file names. If you need to refer to a filename with spaces in, you’ll need to use quotes or escape the spaces in the file names using a `\`, for example `this is my file.txt` becomes `this\ is\ my\ file.txt` ::: -To run programmes from the command line, all you need is the name of the programme as the command: in fact, commands *are* programmes. The `date` command refers to an actual programme on your computer that you can find. And this also explains a bit of what's going on when you *run a script from the command line* (more on that later). +To run programmes from the command line, all you need is the name of the programme as the command: in fact, commands _are_ programmes. The `date` command refers to an actual programme on your computer that you can find. And this also explains a bit of what's going on when you _run a script from the command line_ (more on that later). Once you've run a few commands, you'll notice that you can't navigate around the command line like you can a text file or Python script. Here are some tips for navigating the command line: @@ -93,11 +93,11 @@ Once you've run a few commands, you'll notice that you can't navigate around the ### Navigating directories -While we're on navigating, it's useful to understand *where* in the computer you are when you open the command line. If you open a terminal pane within VS Code, you will start (by default at least) within the same folder as your project. Starting a terminal instance outside of VS Code will get you a terminal in a root directory for your computer; for example, on a Mac, opening a new terminal window starts you in `/Users/yourusername/`. +While we're on navigating, it's useful to understand _where_ in the computer you are when you open the command line. If you open a terminal pane within VS Code, you will start (by default at least) within the same folder as your project. Starting a terminal instance outside of VS Code will get you a terminal in a root directory for your computer; for example, on a Mac, opening a new terminal window starts you in `/Users/yourusername/`. To find out "where" you are when you open a terminal, you can use the `pwd` command, which stands for "print working directory". -The table below shows some useful commands for moving around your computer using the command line. Note that `cd` accepts a location *relative* to your current directory. +The table below shows some useful commands for moving around your computer using the command line. Note that `cd` accepts a location _relative_ to your current directory. | Command | What it does | | --------------------- | ------------------------------------------------------------ | @@ -124,7 +124,7 @@ Say you have a script called `analysis.py`, you can run it with Python on the co uv run python analysis.py ``` -which calls Python as a programme and gives it `analysis.py` as the argument. If you have multiple versions of Python, which you should do if you're following best practice and using a version per project, then you can see *which* version of Python is being used with +which calls Python as a programme and gives it `analysis.py` as the argument. If you have multiple versions of Python, which you should do if you're following best practice and using a version per project, then you can see _which_ version of Python is being used with ```bash which python @@ -134,39 +134,39 @@ which python Now we'll see some useful commands for the terminal. - | Command                                          | What it does | - | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | - | `man ` | Shows a manual for the given command | - | `touch ` | Creates an empty file named `` | - | `code ` | Open a file in VS Code (creating it, if it does not exist) | - | `mkdir ` | creates a new folder called `foldername` | - | `echo ` | Prints `` | - | `cat ` | Print the full contents of `` | - | `head ` | Print the start of a file | - | `tail ` | Print the end of a file | - | `> ` | Redirects output from screen to ``. For example, `echo "Hello World" > hello.txt` | - | `>> ` | Redirects output from screen to the end of ``, ie appends output rather than overwrites it | - | ` | ` | The pipe symbol: uses output from one command as input into another. For example, `head -n 10 data.csv | > hello_world.txt` would write the first 10 lines of data.csv into a file called hello_world.txt | - | `less ` | Print out the contents of a file in paginated form. Use `ctrl+v` and `Alt+v` (or `⌘+v` and `⌥+v` on Mac) to move up and down. Press `q` to quit. | - | `wc -l` | Returns number of lines in input, for example `cat | wc -l`. Use `wc` alone for word count. | - | `sort` | Arrange lines in a file in alphabetical order | - | `uniq` | Remove duplicate lines from input, for example `cat | uniq` or `uniq -d` to show duplicate files | - | `mv` | Move or rename a file; for example, `mv file1 file2` would rename `file1` to `file2` while `mv file1 ~` would move `file1` to the home directory | - | `cp` | Copy a file; for example, `cp file1 file2` would copy `file1` to `file2` while `cp file1 ~` would make a copy of `file1` in the home directory | - | `rm ` | Permanently remove a file | - | `rmdir ` | Permanently remove an empty directory | - | `rm -rf ` | ⚠ Permanently remove everything in a directory ⚠ | - | `grep ` | Search for a given term, for example `cat hello_world.txt | grep world` | - | `ls` | Basically, this means list stuff (files and folders) in the current directory | - | `ls -a` | List stuff in the current directory even if it's hidden | - | `ls -l` | List stuff in a more readable format and show permissions | - | `ls -S` | List stuff by size | - | `file ` | Give information on the file type of `` | - | `find` | Find specific files on your computer, can be piped into other commands for example `find *.md -size +5k -type f | xargs wc -l` will count the number of lines `wc -l` of all files, `-type f`, ending in `.md` that are greater than 5 kilobytes in size, `-size +5k`. | - | `diff -u ` | Show a single summary of the differences between two files. | + | Command                                          | What it does | + | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | + | `man ` | Shows a manual for the given command | + | `touch ` | Creates an empty file named `` | + | `code ` | Open a file in VS Code (creating it, if it does not exist) | + | `mkdir ` | creates a new folder called `foldername` | + | `echo ` | Prints `` | + | `cat ` | Print the full contents of `` | + | `head ` | Print the start of a file | + | `tail ` | Print the end of a file | + | `> ` | Redirects output from screen to ``. For example, `echo "Hello World" > hello.txt` | + | `>> ` | Redirects output from screen to the end of ``, ie appends output rather than overwrites it | + | ` | ` | The pipe symbol: uses output from one command as input into another. For example, `head -n 10 data.csv | > hello_world.txt` would write the first 10 lines of data.csv into a file called hello_world.txt | + | `less ` | Print out the contents of a file in paginated form. Use `ctrl+v` and `Alt+v` (or `⌘+v` and `⌥+v` on Mac) to move up and down. Press `q` to quit. | + | `wc -l` | Returns number of lines in input, for example `cat | wc -l`. Use `wc` alone for word count. | + | `sort` | Arrange lines in a file in alphabetical order | + | `uniq` | Remove duplicate lines from input, for example `cat | uniq`or`uniq -d` to show duplicate files | + | `mv` | Move or rename a file; for example, `mv file1 file2` would rename `file1` to `file2` while `mv file1 ~` would move `file1` to the home directory | + | `cp` | Copy a file; for example, `cp file1 file2` would copy `file1` to `file2` while `cp file1 ~` would make a copy of `file1` in the home directory | + | `rm ` | Permanently remove a file | + | `rmdir ` | Permanently remove an empty directory | + | `rm -rf ` | ⚠ Permanently remove everything in a directory ⚠ | + | `grep ` | Search for a given term, for example `cat hello_world.txt | grep world` | + | `ls` | Basically, this means list stuff (files and folders) in the current directory | + | `ls -a` | List stuff in the current directory even if it's hidden | + | `ls -l` | List stuff in a more readable format and show permissions | + | `ls -S` | List stuff by size | + | `file ` | Give information on the file type of `` | + | `find` | Find specific files on your computer, can be piped into other commands for example `find \*.md -size +5k -type f | xargs wc -l`will count the number of lines`wc -l`of all files,`-type f`, ending in `.md`that are greater than 5 kilobytes in size,`-size +5k`. | + | `diff -u ` | Show a single summary of the differences between two files. | ![More details of the grep command](https://pbs.twimg.com/media/DcPeD_CW0AEkSar?format=jpg&name=small) -*More details of the grep command, by [\@b0rk](https://twitter.com/b0rk).* +_More details of the grep command, by [\@b0rk](https://twitter.com/b0rk)._ You can write for loops in bash (remember, it's a language). The general structure is @@ -187,7 +187,7 @@ A more interesting example is giving the number of lines of text, number of word ```bash for i in $(ls *.csv) -do +do wc $i done ``` @@ -204,7 +204,7 @@ done A couple of new features appeared in the examples above. -`*` is a *wildcard character*, it tells bash to look for anything that ends in ".csv". This is not the only special case; `?` serves a similar purpose of standing in for any character but just *one* character rather than arbitrarily many. If you had a folder with `file1.csv`, `file2.csv`, etc., up to 9, then you could use `file?.csv` to refer to all of them but this would not pick up `file10.csv`. +`*` is a _wildcard character_, it tells bash to look for anything that ends in ".csv". This is not the only special case; `?` serves a similar purpose of standing in for any character but just _one_ character rather than arbitrarily many. If you had a folder with `file1.csv`, `file2.csv`, etc., up to 9, then you could use `file?.csv` to refer to all of them but this would not pick up `file10.csv`. Another special character we've already seen is the curly brace, `{}`. Whenever you have a common substring in a series of commands using curly braces tells the command line to expand what's in them automatically. In an example above, this is used on 1 to 5. But it can also be used in, for example, file names: @@ -272,7 +272,7 @@ You can find more of these special variables [here](https://tldp.org/LDP/abs/htm [**pandoc**](https://pandoc.org/) is absolutely brilliant: if you need to convert files containing text from one format to another, it really is a swiss-army knife. There isn't space here to list the ridiculous number of documents it can convert between, but, importantly, it can translate back and forth between all of the following: markdown, $\LaTeX$, Microsoft Word's docx, OpenOffice's ODT, HTML, and Jupyter Notebook. -It can also write from any of those formats (and more) in one direction *to* PDF, Microsoft Powerpoint, and $\LaTeX$ Beamer. +It can also write from any of those formats (and more) in one direction _to_ PDF, Microsoft Powerpoint, and $\LaTeX$ Beamer. To use **pandoc**, install it following the instructions on the website and then call it like this: @@ -284,9 +284,9 @@ This is an example where the input is a .tex document and the output, `-o`, is a You can get quite fancy with **pandoc**, for example you can translate a whole book's worth of latex into a Word doc complete with a Word style, a bibliography via biblatex, equations, and figures. Nothing can save Word from being painful to use, but **pandoc** certainly helps. -[**eza**](https://eza.rocks/) is an upgrade on the `ls` command. It is designed to be an improved file lister with more features and better defaults. It uses colours to distinguish file types and metadata. Follow the instructions on the website to install it on your operating system. To replace `ls` with `eza`, you can use a terminal *alias*. There's a good guide [available here](https://denisrasulev.medium.com/eza-the-best-ls-command-replacement-9621252323e). +[**eza**](https://eza.rocks/) is an upgrade on the `ls` command. It is designed to be an improved file lister with more features and better defaults. It uses colours to distinguish file types and metadata. Follow the instructions on the website to install it on your operating system. To replace `ls` with `eza`, you can use a terminal _alias_. There's a good guide [available here](https://denisrasulev.medium.com/eza-the-best-ls-command-replacement-9621252323e). -**nano** is a built-in text editor that runs *within* the terminal. This can be really useful if you're working on the cloud (but it's not got the rich features of a GUI-based text editor like VS Code). To open a file using **nano**, the command is `nano file.txt`. Nano displays instructions on how to navigate when it loads up but exiting is the hardest part: when you're done, hit `Ctrl+X`, then `y` to save, and then `enter` to exit. +**nano** is a built-in text editor that runs _within_ the terminal. This can be really useful if you're working on the cloud (but it's not got the rich features of a GUI-based text editor like VS Code). To open a file using **nano**, the command is `nano file.txt`. Nano displays instructions on how to navigate when it loads up but exiting is the hardest part: when you're done, hit `Ctrl+X`, then `y` to save, and then `enter` to exit. [**wget**](https://www.gnu.org/software/wget/) is a command-line utility for downloading files from the internet. It's very simple to use, the syntax is just `wget [options] [url]`. For example, to download the starwars csv file used in this book, the command is diff --git a/communicate-plots.ipynb b/communicate-plots.ipynb index fc14db8..0f0f567 100644 --- a/communicate-plots.ipynb +++ b/communicate-plots.ipynb @@ -1346,7 +1346,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/data-import.ipynb b/data-import.ipynb index c240a16..9852811 100644 --- a/data-import.ipynb +++ b/data-import.ipynb @@ -414,7 +414,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/data-tidy.ipynb b/data-tidy.ipynb index e0ba7a7..157081f 100644 --- a/data-tidy.ipynb +++ b/data-tidy.ipynb @@ -485,7 +485,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/data-transform.ipynb b/data-transform.ipynb index 83588d0..8dfc2c1 100644 --- a/data-transform.ipynb +++ b/data-transform.ipynb @@ -1104,7 +1104,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/data-visualise.ipynb b/data-visualise.ipynb index 488a1db..8f3d721 100644 --- a/data-visualise.ipynb +++ b/data-visualise.ipynb @@ -1119,7 +1119,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/data/bake_sale.xlsx b/data/bake_sale.xlsx index e122900..5de6f9a 100644 Binary files a/data/bake_sale.xlsx and b/data/bake_sale.xlsx differ diff --git a/databases.ipynb b/databases.ipynb index 661cfd1..b183959 100644 --- a/databases.ipynb +++ b/databases.ipynb @@ -797,7 +797,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/dates-and-times.ipynb b/dates-and-times.ipynb index 4fe0c68..47b53ad 100644 --- a/dates-and-times.ipynb +++ b/dates-and-times.ipynb @@ -1103,7 +1103,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/exploratory-data-analysis.ipynb b/exploratory-data-analysis.ipynb index 2bf323b..071a96b 100644 --- a/exploratory-data-analysis.ipynb +++ b/exploratory-data-analysis.ipynb @@ -1119,7 +1119,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/functions.ipynb b/functions.ipynb index 7d55cbc..5ed6b2a 100644 --- a/functions.ipynb +++ b/functions.ipynb @@ -506,7 +506,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/introduction.ipynb b/introduction.ipynb index 60bfa83..04c2ff9 100644 --- a/introduction.ipynb +++ b/introduction.ipynb @@ -187,7 +187,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/iteration.ipynb b/iteration.ipynb index fc1452a..b44f79a 100644 --- a/iteration.ipynb +++ b/iteration.ipynb @@ -21,7 +21,7 @@ "\n", "One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated. Another tool for reducing duplication is *iteration*, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.\n", "\n", - "In this chapter you'll learn about iteration in three ways: explicit iteration, using for loops and while loops; iteration via comprehensions (eg list comprehensions); and iteration for **pandas** data frames." + "In this chapter you'll learn about iteration in three ways: explicit iteration, using for loops and while loops; iteration via comprehensions (eg list comprehensions); and iteration for **polars** data frames." ] }, { @@ -51,7 +51,7 @@ "source": [ "### Prerequisites\n", "\n", - "This chapter will use the **pandas** data analysis package." + "This chapter will use the **polars** data analysis package." ] }, { @@ -452,11 +452,13 @@ "id": "5ec0643e", "metadata": {}, "source": [ - "## Iteration with **pandas** Data Frames\n", + "## Iteration with **polars** Data Frames\n", "\n", - "For loops, while loops, and comprehensions all work on **pandas** data frames, but they are generally a bad way to get things done because they are slow and not memory efficient. To aid cases where iteration is needed, **pandas** has built-in methods for iteration depending on what you need to do.\n", + "For loops, while loops, and comprehensions can be used with data frames, but in **Polars**, they are even more strongly discouraged than in pandas. **Polars** is built on a columnar, vectorized, and expression-based engine, so row-by-row iteration breaks performance and prevents optimizations.\n", "\n", "These built-in methods for iteration have an overlap with what we've seen in @sec-data-transform but we'll dig a little deeper into `assign()`/assignment operations, `apply()`, and `eval()` here.\n", + "\n", + "Instead of iterating, **Polars** encourages you to use expressions and lazy evaluation, which are much faster and more memory efficient.\n", "\n" ] }, @@ -480,9 +482,9 @@ "outputs": [], "source": [ "import numpy as np\n", - "import pandas as pd\n", + "import polars as pl\n", "\n", - "df = pd.DataFrame(np.random.normal(size=(6, 4)), columns=[\"a\", \"b\", \"c\", \"d\"])\n", + "df = pl.DataFrame(np.random.normal(size=(6, 4)), schema=[\"a\", \"b\", \"c\", \"d\"])\n", "df" ] }, @@ -491,7 +493,7 @@ "id": "0e826ad5", "metadata": {}, "source": [ - "**pandas** has many built-in functions that are already built to iterate over rows and columns; for example, to compute the median of rows or columns respectively:" + "**polars** has built-in expressions designed to operate over columns and rows. For example, to compute the median:" ] }, { @@ -501,7 +503,7 @@ "metadata": {}, "outputs": [], "source": [ - "df.median(axis=\"rows\") # can also use axis=1" + "df.select(pl.all().median())" ] }, { @@ -511,7 +513,7 @@ "metadata": {}, "outputs": [], "source": [ - "df.median(axis=\"columns\") # can also use axis=0" + "df.select(pl.concat_list(pl.all()).list.median().alias(\"row_median\"))" ] }, { @@ -535,7 +537,7 @@ "def add_five_slow(df):\n", " for i in range(len(df)):\n", " for j in range(len(df.columns)):\n", - " df.iloc[i, j] = df.iloc[i, j] + 5\n", + " df[i, j] = df[i, j] + 5\n", "\n", "\n", "%timeit add_five_slow(df)" @@ -546,7 +548,7 @@ "id": "8246132e", "metadata": {}, "source": [ - "But to do this, every individual cell must be accessed and operated on—so it is very slow, taking milliseconds. **pandas** has far faster ways of performing the same operation. For simple operations on data frames with consistent type, you can simply add five to the whole data frame:" + "But to do this, every individual cell must be accessed and operated on—so it is very slow, taking milliseconds. **polars** has far faster ways of performing the same operation. For simple operations on data frames with consistent type, you can simply add five to the whole data frame:" ] }, { @@ -572,9 +574,9 @@ "id": "7313616e", "metadata": {}, "source": [ - "This also works on a per column basis, so you can do `df[\"a\"] = df[\"a\"] + 5` and so on.\n", + "This also works on a per column basis, so you can do `df.with_columns(pl.col(\"a\") + 5)` and so on.\n", "\n", - "These operations have equivalents using the `assign()` operator, which allows for *method chaining*; stringing multiple operations together. The `assign()` operator version of `df[\"new_a\"] = df[\"a\"] + 5` would be" + "These operations have equivalents using method chaining; stringing multiple operations together. The version of `df.with_columns(new_a = pl.col(\"a\") + 5)` would be:" ] }, { @@ -584,7 +586,7 @@ "metadata": {}, "outputs": [], "source": [ - "df = df.assign(new_a=lambda x: x[\"a\"] + 5)" + "df = df.with_columns(new_a=pl.col(\"a\") + 5)" ] }, { @@ -592,9 +594,11 @@ "id": "76aec162", "metadata": {}, "source": [ - "### Apply\n", + "### Expressions (Polars' Alternative to apply)\n", + "\n", + "What happens if you have a more complicated operation you want to perform? In pandas, you might reach for `apply()`. In **polars**, you almost never need an equivalent because its expression API is incredibly expressive.\n", "\n", - "What happens if you have a more complicated function you want to iterate over? This is where **pandas**' `apply()` comes in, and can be used with assignment. `apply()` can also be used across rows or columns. Like `assign()`, it can be combined with a lambda function and used with either the whole data frame or just a column (in which case no need to specify `axis=`)." + "Most \"complicated\" operations can be expressed directly using **polars'** built-in expressions:" ] }, { @@ -604,7 +608,16 @@ "metadata": {}, "outputs": [], "source": [ - "df.apply(lambda x: x[\"a\"] - x[\"new_a\"].mean() * x[\"c\"] / x[\"b\"], axis=1)" + "# Don't do this (slow, row-wise)\n", + "mean_new_a = df.select(pl.col(\"new_a\").mean()).item()\n", + "df.with_columns(\n", + " result=pl.struct([\"a\", \"b\", \"c\"]).map_elements(\n", + " lambda x: x[\"a\"] - mean_new_a * x[\"c\"] / x[\"b\"], return_dtype=pl.Float64\n", + " )\n", + ")\n", + "\n", + "# Do this instead (fast, vectorized)\n", + "df.with_columns(result=pl.col(\"a\") - pl.col(\"new_a\").mean() * pl.col(\"c\") / pl.col(\"b\"))" ] }, { @@ -612,24 +625,7 @@ "id": "78b558f4", "metadata": {}, "source": [ - "Note that this is just an example: you could still do this entire operation without using apply! But you will sometimes find yourself with cases where you do need to use it.\n", - "\n", - "Apply also works with functions, including user-defined functions:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "136d435d", - "metadata": {}, - "outputs": [], - "source": [ - "def complicated_function(x):\n", - " return x - x.mean()\n", - "\n", - "\n", - "df = df.apply(complicated_function, axis=1)\n", - "df" + "The first expression would work, but it evaluates the computation row by row using a **python lambda**, which is slow and prevents **polars** from optimizing the query. The second approach uses native expressions, allowing **polars** to execute the computation efficiently in a fully vectorized and optimized manner." ] }, { @@ -637,9 +633,7 @@ "id": "171be2c9", "metadata": {}, "source": [ - "### Eval(uate)\n", - "\n", - "`eval()` evaluates a string describing operations on DataFrame columns to create new columns. It operates on columns only, not rows or elements. Here's an example:" + "In **polars**, there's no `eval()` — you use expressions directly instead:\n" ] }, { @@ -649,7 +643,7 @@ "metadata": {}, "outputs": [], "source": [ - "df[\"ratio\"] = df.eval(\"a / new_a\")\n", + "df = df.with_columns((pl.col(\"a\") / pl.col(\"new_a\")).alias(\"ratio\"))\n", "df" ] }, @@ -658,7 +652,18 @@ "id": "8b275b5b", "metadata": {}, "source": [ - "Evaluate can also be used to create new boolean columns using, for example, a string `\"a > 0.5\"` in the above example." + "You can also create boolean columns the same way:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f301c8cb", + "metadata": {}, + "outputs": [], + "source": [ + "df = df.with_columns((pl.col(\"a\") > 0.5).alias(\"a_gt_0.5\"))\n", + "df" ] } ], @@ -687,7 +692,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/joins.ipynb b/joins.ipynb index e58d2d3..d494d84 100644 --- a/joins.ipynb +++ b/joins.ipynb @@ -255,7 +255,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/missing-values.ipynb b/missing-values.ipynb index 262f015..67e95ef 100644 --- a/missing-values.ipynb +++ b/missing-values.ipynb @@ -596,7 +596,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/numbers.ipynb b/numbers.ipynb index 388db20..0b96fbb 100644 --- a/numbers.ipynb +++ b/numbers.ipynb @@ -792,7 +792,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/prerequisites.ipynb b/prerequisites.ipynb index 68bea6d..07119a8 100644 --- a/prerequisites.ipynb +++ b/prerequisites.ipynb @@ -305,7 +305,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.0" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/rectangling.ipynb b/rectangling.ipynb index e565c1d..2ddf375 100644 --- a/rectangling.ipynb +++ b/rectangling.ipynb @@ -628,7 +628,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/regex.ipynb b/regex.ipynb index a8055fc..520a8c8 100644 --- a/regex.ipynb +++ b/regex.ipynb @@ -233,7 +233,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" } }, "nbformat": 4, diff --git a/spreadsheets.ipynb b/spreadsheets.ipynb index 1c1f378..6d62900 100644 --- a/spreadsheets.ipynb +++ b/spreadsheets.ipynb @@ -449,7 +449,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/strings.ipynb b/strings.ipynb index e401876..b1a00fe 100644 --- a/strings.ipynb +++ b/strings.ipynb @@ -1089,7 +1089,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/vis-layers.ipynb b/vis-layers.ipynb index ba22bb2..6a92e84 100644 --- a/vis-layers.ipynb +++ b/vis-layers.ipynb @@ -995,7 +995,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/webscraping-and-apis.ipynb b/webscraping-and-apis.ipynb index 5171a4d..802151d 100644 --- a/webscraping-and-apis.ipynb +++ b/webscraping-and-apis.ipynb @@ -566,7 +566,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/whole-game.ipynb b/whole-game.ipynb index eee6c2c..c0b6e5f 100644 --- a/whole-game.ipynb +++ b/whole-game.ipynb @@ -107,7 +107,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/workflow-basics.ipynb b/workflow-basics.ipynb index a0a7e47..64621de 100644 --- a/workflow-basics.ipynb +++ b/workflow-basics.ipynb @@ -528,7 +528,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/workflow-style.ipynb b/workflow-style.ipynb index 6921e86..3de114b 100644 --- a/workflow-style.ipynb +++ b/workflow-style.ipynb @@ -271,7 +271,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/workflow-writing-code.ipynb b/workflow-writing-code.ipynb index 28e2edd..bb442d2 100644 --- a/workflow-writing-code.ipynb +++ b/workflow-writing-code.ipynb @@ -150,7 +150,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.0" + "version": "3.12.13" } }, "nbformat": 4,