diff --git a/command-line.md b/command-line.md index 1079842..2cdce9b 100644 --- a/command-line.md +++ b/command-line.md @@ -1,6 +1,6 @@ # The Command Line {#sec-command-line} -In this chapter, you'll meet the *command line* and learn how to use it. Beyond a few key commands like `uv add ` you don't strictly need to know how to use the command line to follow the rest of this book. However, even a tiny bit of knowledge of the command line goes a long way in coding and will serve you well. +In this chapter, you'll meet the _command line_ and learn how to use it. Beyond a few key commands like `uv add ` you don't strictly need to know how to use the command line to follow the rest of this book. However, even a tiny bit of knowledge of the command line goes a long way in coding and will serve you well. To try out any of the commands in this chapter on your machine, you can select 'New Terminal' from the menu bar in Visual Studio Code (Mac and Linux), use the Windows Subsystem for Linux or git bash (Windows), or use a free [online terminal](https://cocalc.com/doc/terminal.html). @@ -8,17 +8,17 @@ This chapter has benefited from numerous sources, including absolutely excellent ## What is the command line? -The command line is a way to directly issue text-based commands to a computer one line at a time (as distinct from a graphical user interface, or GUI, that you navigate with a mouse). It goes under many names: shell, bash, terminal, CLI, and command line. These are actually different things but most people tend to use them to mean the same thing most of the time. The *shell* is the part of an operating system that you interact with but mostly people use shell to mean the command line. *bash* is the programming language that is used in the command line; it's actually a synonym for 'Born Again SHell'. The *terminal* is sometimes used to refer to the command line on Macs. Finally, a *CLI* is just an acronym for command line interface, and is often used in the context of an application; for example, uv has a command line interface because you run it on the command line to install packages (`uv add packagename`). +The command line is a way to directly issue text-based commands to a computer one line at a time (as distinct from a graphical user interface, or GUI, that you navigate with a mouse). It goes under many names: shell, bash, terminal, CLI, and command line. These are actually different things but most people tend to use them to mean the same thing most of the time. The _shell_ is the part of an operating system that you interact with but mostly people use shell to mean the command line. _bash_ is the programming language that is used in the command line; it's actually a synonym for 'Born Again SHell'. The _terminal_ is sometimes used to refer to the command line on Macs. Finally, a _CLI_ is just an acronym for command line interface, and is often used in the context of an application; for example, uv has a command line interface because you run it on the command line to install packages (`uv add packagename`). It's worth mentioning that there's a big difference between the command line on UNIX based systems (MacOS and Linux), and on Windows systems. Here, we'll only address the UNIX version. There is a command line on Windows but it's not widely used for coding. If you're on a Windows machine, you can access a UNIX command line using the Windows Subsystem for Linux. ## Why is the command line useful? -The command line has many uses. Graphical user interfaces are, generally, a bit easier to use *but* they're not very repeatable or scalable. Because the command line uses text-based instructions and can be programmed, it is both repeatable and scalable; properties that are very useful for research and analysis. +The command line has many uses. Graphical user interfaces are, generally, a bit easier to use _but_ they're not very repeatable or scalable. Because the command line uses text-based instructions and can be programmed, it is both repeatable and scalable; properties that are very useful for research and analysis. The broad reasons you might use the command line to issue instructions include: -- software functionality: some software *only* has a command line interface +- software functionality: some software _only_ has a command line interface - efficiency: your computer has limited memory, which graphical user interfaces use a lot of—the command line uses less @@ -71,7 +71,7 @@ The flags or options, such as `-n` in the example above, typically begin with a Spaces take on a special role when using the command line. For this reason, it's good practice to avoid spaces in file names. If you need to refer to a filename with spaces in, you’ll need to use quotes or escape the spaces in the file names using a `\`, for example `this is my file.txt` becomes `this\ is\ my\ file.txt` ::: -To run programmes from the command line, all you need is the name of the programme as the command: in fact, commands *are* programmes. The `date` command refers to an actual programme on your computer that you can find. And this also explains a bit of what's going on when you *run a script from the command line* (more on that later). +To run programmes from the command line, all you need is the name of the programme as the command: in fact, commands _are_ programmes. The `date` command refers to an actual programme on your computer that you can find. And this also explains a bit of what's going on when you _run a script from the command line_ (more on that later). Once you've run a few commands, you'll notice that you can't navigate around the command line like you can a text file or Python script. Here are some tips for navigating the command line: @@ -93,20 +93,20 @@ Once you've run a few commands, you'll notice that you can't navigate around the ### Navigating directories -While we're on navigating, it's useful to understand *where* in the computer you are when you open the command line. If you open a terminal pane within VS Code, you will start (by default at least) within the same folder as your project. Starting a terminal instance outside of VS Code will get you a terminal in a root directory for your computer; for example, on a Mac, opening a new terminal window starts you in `/Users/yourusername/`. +While we're on navigating, it's useful to understand _where_ in the computer you are when you open the command line. If you open a terminal pane within VS Code, you will start (by default at least) within the same folder as your project. Starting a terminal instance outside of VS Code will get you a terminal in a root directory for your computer; for example, on a Mac, opening a new terminal window starts you in `/Users/yourusername/`. To find out "where" you are when you open a terminal, you can use the `pwd` command, which stands for "print working directory". -The table below shows some useful commands for moving around your computer using the command line. Note that `cd` accepts a location *relative* to your current directory. +The table below shows some useful commands for moving around your computer using the command line. Note that `cd` accepts a location _relative_ to your current directory. - | Command | What it does | - | --------------------- | ------------------------------------------------------------ | - | `pwd` | Shows current directory | - | `cd` | Change directory command | - | `cd ..` | Go up one level in the directory (`cd ../..` for two levels) | - | `cd ~` | Go to your home directory | - | `cd -` | Go to the previous directory | - | `cd documents/papers` | Go directly to a directory named 'papers' | +| Command | What it does | +| --------------------- | ------------------------------------------------------------ | +| `pwd` | Shows current directory | +| `cd` | Change directory command | +| `cd ..` | Go up one level in the directory (`cd ../..` for two levels) | +| `cd ~` | Go to your home directory | +| `cd -` | Go to the previous directory | +| `cd documents/papers` | Go directly to a directory named 'papers' | ## Using Python on the command line @@ -124,7 +124,7 @@ Say you have a script called `analysis.py`, you can run it with Python on the co uv run python analysis.py ``` -which calls Python as a programme and gives it `analysis.py` as the argument. If you have multiple versions of Python, which you should do if you're following best practice and using a version per project, then you can see *which* version of Python is being used with +which calls Python as a programme and gives it `analysis.py` as the argument. If you have multiple versions of Python, which you should do if you're following best practice and using a version per project, then you can see _which_ version of Python is being used with ```bash which python @@ -134,39 +134,39 @@ which python Now we'll see some useful commands for the terminal. - | Command                                          | What it does | - | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | - | `man ` | Shows a manual for the given command | - | `touch ` | Creates an empty file named `` | - | `code ` | Open a file in VS Code (creating it, if it does not exist) | - | `mkdir ` | creates a new folder called `foldername` | - | `echo ` | Prints `` | - | `cat ` | Print the full contents of `` | - | `head ` | Print the start of a file | - | `tail ` | Print the end of a file | - | `> ` | Redirects output from screen to ``. For example, `echo "Hello World" > hello.txt` | - | `>> ` | Redirects output from screen to the end of ``, ie appends output rather than overwrites it | - | ` | ` | The pipe symbol: uses output from one command as input into another. For example, `head -n 10 data.csv | > hello_world.txt` would write the first 10 lines of data.csv into a file called hello_world.txt | - | `less ` | Print out the contents of a file in paginated form. Use `ctrl+v` and `Alt+v` (or `⌘+v` and `⌥+v` on Mac) to move up and down. Press `q` to quit. | - | `wc -l` | Returns number of lines in input, for example `cat | wc -l`. Use `wc` alone for word count. | - | `sort` | Arrange lines in a file in alphabetical order | - | `uniq` | Remove duplicate lines from input, for example `cat | uniq` or `uniq -d` to show duplicate files | - | `mv` | Move or rename a file; for example, `mv file1 file2` would rename `file1` to `file2` while `mv file1 ~` would move `file1` to the home directory | - | `cp` | Copy a file; for example, `cp file1 file2` would copy `file1` to `file2` while `cp file1 ~` would make a copy of `file1` in the home directory | - | `rm ` | Permanently remove a file | - | `rmdir ` | Permanently remove an empty directory | - | `rm -rf ` | ⚠ Permanently remove everything in a directory ⚠ | - | `grep ` | Search for a given term, for example `cat hello_world.txt | grep world` | - | `ls` | Basically, this means list stuff (files and folders) in the current directory | - | `ls -a` | List stuff in the current directory even if it's hidden | - | `ls -l` | List stuff in a more readable format and show permissions | - | `ls -S` | List stuff by size | - | `file ` | Give information on the file type of `` | - | `find` | Find specific files on your computer, can be piped into other commands for example `find *.md -size +5k -type f | xargs wc -l` will count the number of lines `wc -l` of all files, `-type f`, ending in `.md` that are greater than 5 kilobytes in size, `-size +5k`. | - | `diff -u ` | Show a single summary of the differences between two files. | +| Command                                          | What it does | +| --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| `man ` | Shows a manual for the given command | +| `touch ` | Creates an empty file named `` | +| `code ` | Open a file in VS Code (creating it, if it does not exist) | +| `mkdir ` | creates a new folder called `foldername` | +| `echo ` | Prints `` | +| `cat ` | Print the full contents of `` | +| `head ` | Print the start of a file | +| `tail ` | Print the end of a file | +| `> ` | Redirects output from screen to ``. For example, `echo "Hello World" > hello.txt` | +| `>> ` | Redirects output from screen to the end of ``, ie appends output rather than overwrites it | +| ` | ` | The pipe symbol: uses output from one command as input into another. For example, `head -n 10 data.csv | > hello_world.txt` would write the first 10 lines of data.csv into a file called hello_world.txt | +| `less ` | Print out the contents of a file in paginated form. Use `ctrl+v` and `Alt+v` (or `⌘+v` and `⌥+v` on Mac) to move up and down. Press `q` to quit. | +| `wc -l` | Returns number of lines in input, for example `cat | wc -l`. Use `wc` alone for word count. | +| `sort` | Arrange lines in a file in alphabetical order | +| `uniq` | Remove duplicate lines from input, for example `cat | uniq`or`uniq -d` to show duplicate files | +| `mv` | Move or rename a file; for example, `mv file1 file2` would rename `file1` to `file2` while `mv file1 ~` would move `file1` to the home directory | +| `cp` | Copy a file; for example, `cp file1 file2` would copy `file1` to `file2` while `cp file1 ~` would make a copy of `file1` in the home directory | +| `rm ` | Permanently remove a file | +| `rmdir ` | Permanently remove an empty directory | +| `rm -rf ` | ⚠ Permanently remove everything in a directory ⚠ | +| `grep ` | Search for a given term, for example `cat hello_world.txt | grep world` | +| `ls` | Basically, this means list stuff (files and folders) in the current directory | +| `ls -a` | List stuff in the current directory even if it's hidden | +| `ls -l` | List stuff in a more readable format and show permissions | +| `ls -S` | List stuff by size | +| `file ` | Give information on the file type of `` | +| `find` | Find specific files on your computer, can be piped into other commands for example `find \*.md -size +5k -type f | xargs wc -l`will count the number of lines`wc -l`of all files,`-type f`, ending in `.md`that are greater than 5 kilobytes in size,`-size +5k`. | +| `diff -u ` | Show a single summary of the differences between two files. | ![More details of the grep command](https://pbs.twimg.com/media/DcPeD_CW0AEkSar?format=jpg&name=small) -*More details of the grep command, by [\@b0rk](https://twitter.com/b0rk).* +_More details of the grep command, by [\@b0rk](https://twitter.com/b0rk)._ You can write for loops in bash (remember, it's a language). The general structure is @@ -187,7 +187,7 @@ A more interesting example is giving the number of lines of text, number of word ```bash for i in $(ls *.csv) -do +do wc $i done ``` @@ -204,7 +204,7 @@ done A couple of new features appeared in the examples above. -`*` is a *wildcard character*, it tells bash to look for anything that ends in ".csv". This is not the only special case; `?` serves a similar purpose of standing in for any character but just *one* character rather than arbitrarily many. If you had a folder with `file1.csv`, `file2.csv`, etc., up to 9, then you could use `file?.csv` to refer to all of them but this would not pick up `file10.csv`. +`*` is a _wildcard character_, it tells bash to look for anything that ends in ".csv". This is not the only special case; `?` serves a similar purpose of standing in for any character but just _one_ character rather than arbitrarily many. If you had a folder with `file1.csv`, `file2.csv`, etc., up to 9, then you could use `file?.csv` to refer to all of them but this would not pick up `file10.csv`. Another special character we've already seen is the curly brace, `{}`. Whenever you have a common substring in a series of commands using curly braces tells the command line to expand what's in them automatically. In an example above, this is used on 1 to 5. But it can also be used in, for example, file names: @@ -272,7 +272,7 @@ You can find more of these special variables [here](https://tldp.org/LDP/abs/htm [**pandoc**](https://pandoc.org/) is absolutely brilliant: if you need to convert files containing text from one format to another, it really is a swiss-army knife. There isn't space here to list the ridiculous number of documents it can convert between, but, importantly, it can translate back and forth between all of the following: markdown, $\LaTeX$, Microsoft Word's docx, OpenOffice's ODT, HTML, and Jupyter Notebook. -It can also write from any of those formats (and more) in one direction *to* PDF, Microsoft Powerpoint, and $\LaTeX$ Beamer. +It can also write from any of those formats (and more) in one direction _to_ PDF, Microsoft Powerpoint, and $\LaTeX$ Beamer. To use **pandoc**, install it following the instructions on the website and then call it like this: @@ -284,9 +284,9 @@ This is an example where the input is a .tex document and the output, `-o`, is a You can get quite fancy with **pandoc**, for example you can translate a whole book's worth of latex into a Word doc complete with a Word style, a bibliography via biblatex, equations, and figures. Nothing can save Word from being painful to use, but **pandoc** certainly helps. -[**eza**](https://eza.rocks/) is an upgrade on the `ls` command. It is designed to be an improved file lister with more features and better defaults. It uses colours to distinguish file types and metadata. Follow the instructions on the website to install it on your operating system. To replace `ls` with `eza`, you can use a terminal *alias*. There's a good guide [available here](https://denisrasulev.medium.com/eza-the-best-ls-command-replacement-9621252323e). +[**eza**](https://eza.rocks/) is an upgrade on the `ls` command. It is designed to be an improved file lister with more features and better defaults. It uses colours to distinguish file types and metadata. Follow the instructions on the website to install it on your operating system. To replace `ls` with `eza`, you can use a terminal _alias_. There's a good guide [available here](https://denisrasulev.medium.com/eza-the-best-ls-command-replacement-9621252323e). -**nano** is a built-in text editor that runs *within* the terminal. This can be really useful if you're working on the cloud (but it's not got the rich features of a GUI-based text editor like VS Code). To open a file using **nano**, the command is `nano file.txt`. Nano displays instructions on how to navigate when it loads up but exiting is the hardest part: when you're done, hit `Ctrl+X`, then `y` to save, and then `enter` to exit. +**nano** is a built-in text editor that runs _within_ the terminal. This can be really useful if you're working on the cloud (but it's not got the rich features of a GUI-based text editor like VS Code). To open a file using **nano**, the command is `nano file.txt`. Nano displays instructions on how to navigate when it loads up but exiting is the hardest part: when you're done, hit `Ctrl+X`, then `y` to save, and then `enter` to exit. [**wget**](https://www.gnu.org/software/wget/) is a command-line utility for downloading files from the internet. It's very simple to use, the syntax is just `wget [options] [url]`. For example, to download the starwars csv file used in this book, the command is diff --git a/iteration.ipynb b/iteration.ipynb index fc1452a..3db7c82 100644 --- a/iteration.ipynb +++ b/iteration.ipynb @@ -21,12 +21,12 @@ "\n", "One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated. Another tool for reducing duplication is *iteration*, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.\n", "\n", - "In this chapter you'll learn about iteration in three ways: explicit iteration, using for loops and while loops; iteration via comprehensions (eg list comprehensions); and iteration for **pandas** data frames." + "In this chapter you'll learn about iteration in three ways: explicit iteration, using for loops and while loops; iteration via comprehensions (eg list comprehensions); and iteration for **polars** data frames." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "id": "51a55374", "metadata": { "tags": [ @@ -51,7 +51,7 @@ "source": [ "### Prerequisites\n", "\n", - "This chapter will use the **pandas** data analysis package." + "This chapter will use the **polars** data analysis package." ] }, { @@ -68,10 +68,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "id": "a2bbd41c", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Lovelace\n", + "Smith\n", + "Pigou\n", + "Babbage\n" + ] + } + ], "source": [ "name_list = [\"Lovelace\", \"Smith\", \"Pigou\", \"Babbage\"]\n", "\n", @@ -106,10 +117,19 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "id": "835ebda7", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "hello\n", + "---end entry---\n" + ] + } + ], "source": [ "for entry in [\"hello\"]:\n", " print(entry)\n", @@ -126,10 +146,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "id": "2a19ac2e", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "h\n", + "---end entry---\n", + "e\n", + "---end entry---\n", + "l\n", + "---end entry---\n", + "l\n", + "---end entry---\n", + "o\n", + "---end entry---\n" + ] + } + ], "source": [ "for entry in \"hello\":\n", " print(entry)\n", @@ -150,10 +187,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "id": "239e133f", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The name in position 0 is Lovelace\n", + "The name in position 1 is Smith\n", + "The name in position 2 is Hopper\n", + "The name in position 3 is Babbage\n" + ] + } + ], "source": [ "name_list = [\"Lovelace\", \"Smith\", \"Hopper\", \"Babbage\"]\n", "\n", @@ -171,10 +219,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "id": "b66c5c53", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The name in position 1 is Lovelace\n", + "The name in position 2 is Smith\n", + "The name in position 3 is Hopper\n", + "The name in position 4 is Babbage\n" + ] + } + ], "source": [ "for i, name in enumerate(name_list, start=1):\n", " print(f\"The name in position {i} is {name}\")" @@ -190,10 +249,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "id": "010239fe", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "In Paris, the temperature is 28 degrees C today.\n", + "In London, the temperature is 22 degrees C today.\n", + "In Seville, the temperature is 36 degrees C today.\n", + "In Wellesley, the temperature is 29 degrees C today.\n" + ] + } + ], "source": [ "cities_to_temps = {\"Paris\": 28, \"London\": 22, \"Seville\": 36, \"Wellesley\": 29}\n", "\n", @@ -217,10 +287,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "id": "8ea3efc5", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Ada Lovelace\n", + "Adam Smith\n", + "Grace Hopper\n", + "Charles Babbage\n" + ] + } + ], "source": [ "first_names = [\"Ada\", \"Adam\", \"Grace\", \"Charles\"]\n", "last_names = [\"Lovelace\", \"Smith\", \"Hopper\", \"Babbage\"]\n", @@ -257,10 +338,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "id": "7efed381", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "[51, 52, 53, 54, 55, 56, 57, 58, 59, 60]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "num_list = range(50, 60)\n", "[1 + num for num in num_list]" @@ -286,10 +378,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "id": "722fda21", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39]\n" + ] + } + ], "source": [ "number_list = range(1, 40)\n", "divide_list = [x for x in number_list if x % 3 == 0]\n", @@ -308,10 +408,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "id": "b6e80d6b", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Adam Smith', 'leonara smith']\n" + ] + } + ], "source": [ "names_list = [\"Joe Bloggs\", \"Adam Smith\", \"Sandra Noone\", \"leonara smith\"]\n", "smith_list = [x for x in names_list if \"smith\" in x.lower()]\n", @@ -330,10 +438,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "id": "f348bfb6", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Not Smith!', 'Adam Smith', 'Not Smith!', 'leonara smith']\n" + ] + } + ], "source": [ "names_list = [\"Joe Bloggs\", \"Adam Smith\", \"Sandra Noone\", \"leonara smith\"]\n", "smith_list = [x if \"smith\" in x.lower() else \"Not Smith!\" for x in names_list]\n", @@ -350,10 +466,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "id": "74e4fcc7", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Ada Lovelace', 'Adam Smith', 'Grace Hopper', 'Charles Babbage']\n" + ] + } + ], "source": [ "first_names = [\"Ada\", \"Adam\", \"Grace\", \"Charles\"]\n", "last_names = [\"Lovelace\", \"Smith\", \"Hopper\", \"Babbage\"]\n", @@ -371,10 +495,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "id": "2c82cf1f", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[['Ada Lovelace', 'Adam Lovelace'], ['Ada Smith', 'Adam Smith']]\n" + ] + } + ], "source": [ "first_names = [\"Ada\", \"Adam\"]\n", "last_names = [\"Lovelace\", \"Smith\"]\n", @@ -394,10 +526,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "id": "acef16ae", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'Ada': 'Lovelace', 'Adam': 'Smith'}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "{key: value for key, value in zip(first_names, last_names)}" ] @@ -420,10 +563,28 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 16, "id": "3e47ba02", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "9\n", + "8\n", + "7\n", + "6\n", + "5\n", + "4\n", + "3\n", + "2\n", + "1\n", + "execution complete\n" + ] + } + ], "source": [ "n = 10\n", "while n > 0:\n", @@ -452,11 +613,13 @@ "id": "5ec0643e", "metadata": {}, "source": [ - "## Iteration with **pandas** Data Frames\n", + "## Iteration with **polars** Data Frames\n", "\n", - "For loops, while loops, and comprehensions all work on **pandas** data frames, but they are generally a bad way to get things done because they are slow and not memory efficient. To aid cases where iteration is needed, **pandas** has built-in methods for iteration depending on what you need to do.\n", + "For loops, while loops, and comprehensions can be used with data frames, but in **Polars**, they are even more strongly discouraged than in pandas. **Polars** is built on a columnar, vectorized, and expression-based engine, so row-by-row iteration breaks performance and prevents optimizations.\n", "\n", "These built-in methods for iteration have an overlap with what we've seen in @sec-data-transform but we'll dig a little deeper into `assign()`/assignment operations, `apply()`, and `eval()` here.\n", + "\n", + "Instead of iterating, **Polars** encourages you to use expressions and lazy evaluation, which are much faster and more memory efficient.\n", "\n" ] }, @@ -474,15 +637,48 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 17, "id": "b3116809", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (6, 4)
abcd
f64f64f64f64
-0.1032560.546434-1.0070280.016913
-0.377048-0.2567130.913368-1.584054
-0.0096441.126097-0.4014050.332419
1.551129-0.505919-0.2170860.348191
0.7055770.826090.4458931.21817
0.3426031.178953-0.0046820.546398
" + ], + "text/plain": [ + "shape: (6, 4)\n", + "┌───────────┬───────────┬───────────┬───────────┐\n", + "│ a ┆ b ┆ c ┆ d │\n", + "│ --- ┆ --- ┆ --- ┆ --- │\n", + "│ f64 ┆ f64 ┆ f64 ┆ f64 │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╡\n", + "│ -0.103256 ┆ 0.546434 ┆ -1.007028 ┆ 0.016913 │\n", + "│ -0.377048 ┆ -0.256713 ┆ 0.913368 ┆ -1.584054 │\n", + "│ -0.009644 ┆ 1.126097 ┆ -0.401405 ┆ 0.332419 │\n", + "│ 1.551129 ┆ -0.505919 ┆ -0.217086 ┆ 0.348191 │\n", + "│ 0.705577 ┆ 0.82609 ┆ 0.445893 ┆ 1.21817 │\n", + "│ 0.342603 ┆ 1.178953 ┆ -0.004682 ┆ 0.546398 │\n", + "└───────────┴───────────┴───────────┴───────────┘" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "import numpy as np\n", - "import pandas as pd\n", + "import polars as pl\n", "\n", - "df = pd.DataFrame(np.random.normal(size=(6, 4)), columns=[\"a\", \"b\", \"c\", \"d\"])\n", + "df = pl.DataFrame(np.random.normal(size=(6, 4)), schema=[\"a\", \"b\", \"c\", \"d\"])\n", "df" ] }, @@ -491,27 +687,88 @@ "id": "0e826ad5", "metadata": {}, "source": [ - "**pandas** has many built-in functions that are already built to iterate over rows and columns; for example, to compute the median of rows or columns respectively:" + "**polars** has built-in expressions designed to operate over columns and rows. For example, to compute the median:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 18, "id": "ac909c2f", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (1, 4)
abcd
f64f64f64f64
0.1664790.686262-0.1108840.340305
" + ], + "text/plain": [ + "shape: (1, 4)\n", + "┌──────────┬──────────┬───────────┬──────────┐\n", + "│ a ┆ b ┆ c ┆ d │\n", + "│ --- ┆ --- ┆ --- ┆ --- │\n", + "│ f64 ┆ f64 ┆ f64 ┆ f64 │\n", + "╞══════════╪══════════╪═══════════╪══════════╡\n", + "│ 0.166479 ┆ 0.686262 ┆ -0.110884 ┆ 0.340305 │\n", + "└──────────┴──────────┴───────────┴──────────┘" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "df.median(axis=\"rows\") # can also use axis=1" + "df.select(pl.all().median())" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "id": "96426002", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (6, 1)
row_median
f64
-0.043172
-0.316881
0.161388
0.065552
0.765833
0.4445
" + ], + "text/plain": [ + "shape: (6, 1)\n", + "┌────────────┐\n", + "│ row_median │\n", + "│ --- │\n", + "│ f64 │\n", + "╞════════════╡\n", + "│ -0.043172 │\n", + "│ -0.316881 │\n", + "│ 0.161388 │\n", + "│ 0.065552 │\n", + "│ 0.765833 │\n", + "│ 0.4445 │\n", + "└────────────┘" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "df.median(axis=\"columns\") # can also use axis=0" + "df.select(pl.concat_list(pl.all()).list.median().alias(\"row_median\"))" ] }, { @@ -524,10 +781,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "id": "060b6815", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "246 μs ± 2.46 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" + ] + } + ], "source": [ "# Do not do this!\n", "\n", @@ -535,7 +800,7 @@ "def add_five_slow(df):\n", " for i in range(len(df)):\n", " for j in range(len(df.columns)):\n", - " df.iloc[i, j] = df.iloc[i, j] + 5\n", + " df[i, j] = df[i, j] + 5\n", "\n", "\n", "%timeit add_five_slow(df)" @@ -546,15 +811,23 @@ "id": "8246132e", "metadata": {}, "source": [ - "But to do this, every individual cell must be accessed and operated on—so it is very slow, taking milliseconds. **pandas** has far faster ways of performing the same operation. For simple operations on data frames with consistent type, you can simply add five to the whole data frame:" + "But to do this, every individual cell must be accessed and operated on—so it is very slow, taking milliseconds. **polars** has far faster ways of performing the same operation. For simple operations on data frames with consistent type, you can simply add five to the whole data frame:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 21, "id": "1a48ae52", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "51.1 μs ± 395 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" + ] + } + ], "source": [ "%timeit df + 5" ] @@ -572,19 +845,19 @@ "id": "7313616e", "metadata": {}, "source": [ - "This also works on a per column basis, so you can do `df[\"a\"] = df[\"a\"] + 5` and so on.\n", + "This also works on a per column basis, so you can do `df.with_columns(pl.col(\"a\") + 5)` and so on.\n", "\n", - "These operations have equivalents using the `assign()` operator, which allows for *method chaining*; stringing multiple operations together. The `assign()` operator version of `df[\"new_a\"] = df[\"a\"] + 5` would be" + "These operations have equivalents using method chaining; stringing multiple operations together. The version of `df.with_columns(new_a = pl.col(\"a\") + 5)` would be:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 22, "id": "f7391dc5", "metadata": {}, "outputs": [], "source": [ - "df = df.assign(new_a=lambda x: x[\"a\"] + 5)" + "df = df.with_columns(new_a=pl.col(\"a\") + 5)" ] }, { @@ -592,19 +865,63 @@ "id": "76aec162", "metadata": {}, "source": [ - "### Apply\n", + "### Expressions (Polars' Alternative to apply)\n", + "\n", + "What happens if you have a more complicated operation you want to perform? In pandas, you might reach for `apply()`. In **polars**, you almost never need an equivalent because its expression API is incredibly expressive.\n", "\n", - "What happens if you have a more complicated function you want to iterate over? This is where **pandas**' `apply()` comes in, and can be used with assignment. `apply()` can also be used across rows or columns. Like `assign()`, it can be combined with a lambda function and used with either the whole data frame or just a column (in which case no need to specify `axis=`)." + "Most \"complicated\" operations can be expressed directly using **polars'** built-in expressions:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "id": "31adcb3f", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (6, 6)
abcdnew_aresult
f64f64f64f64f64f64
40554.89674440555.54643440553.99297240555.01691340559.896744-3.90117
40554.62295240554.74328740555.91336840553.41594640559.622952-6.898851
40554.99035640556.12609740554.59859540555.33241940559.990356-3.833543
40556.55112940554.49408140554.78291440555.34819140561.551129-4.089305
40555.70557740555.8260940555.44589340556.2181740560.705577-4.265744
40555.34260340556.17895340554.99531840555.54639840560.342603-3.8252
" + ], + "text/plain": [ + "shape: (6, 6)\n", + "┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┬───────────┐\n", + "│ a ┆ b ┆ c ┆ d ┆ new_a ┆ result │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │\n", + "╞══════════════╪══════════════╪══════════════╪══════════════╪══════════════╪═══════════╡\n", + "│ 40554.896744 ┆ 40555.546434 ┆ 40553.992972 ┆ 40555.016913 ┆ 40559.896744 ┆ -3.90117 │\n", + "│ 40554.622952 ┆ 40554.743287 ┆ 40555.913368 ┆ 40553.415946 ┆ 40559.622952 ┆ -6.898851 │\n", + "│ 40554.990356 ┆ 40556.126097 ┆ 40554.598595 ┆ 40555.332419 ┆ 40559.990356 ┆ -3.833543 │\n", + "│ 40556.551129 ┆ 40554.494081 ┆ 40554.782914 ┆ 40555.348191 ┆ 40561.551129 ┆ -4.089305 │\n", + "│ 40555.705577 ┆ 40555.82609 ┆ 40555.445893 ┆ 40556.21817 ┆ 40560.705577 ┆ -4.265744 │\n", + "│ 40555.342603 ┆ 40556.178953 ┆ 40554.995318 ┆ 40555.546398 ┆ 40560.342603 ┆ -3.8252 │\n", + "└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┴───────────┘" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "df.apply(lambda x: x[\"a\"] - x[\"new_a\"].mean() * x[\"c\"] / x[\"b\"], axis=1)" + "# Don't do this (slow, row-wise)\n", + "mean_new_a = df.select(pl.col(\"new_a\").mean()).item()\n", + "df.with_columns(\n", + " result=pl.struct([\"a\", \"b\", \"c\"]).map_elements(\n", + " lambda x: x[\"a\"] - mean_new_a * x[\"c\"] / x[\"b\"], return_dtype=pl.Float64\n", + " )\n", + ")\n", + "\n", + "# Do this instead (fast, vectorized)\n", + "df.with_columns(result=pl.col(\"a\") - pl.col(\"new_a\").mean() * pl.col(\"c\") / pl.col(\"b\"))" ] }, { @@ -612,24 +929,7 @@ "id": "78b558f4", "metadata": {}, "source": [ - "Note that this is just an example: you could still do this entire operation without using apply! But you will sometimes find yourself with cases where you do need to use it.\n", - "\n", - "Apply also works with functions, including user-defined functions:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "136d435d", - "metadata": {}, - "outputs": [], - "source": [ - "def complicated_function(x):\n", - " return x - x.mean()\n", - "\n", - "\n", - "df = df.apply(complicated_function, axis=1)\n", - "df" + "The first expression would work, but it evaluates the computation row by row using a **python lambda**, which is slow and prevents **polars** from optimizing the query. The second approach uses native expressions, allowing **polars** to execute the computation efficiently in a fully vectorized and optimized manner." ] }, { @@ -637,19 +937,50 @@ "id": "171be2c9", "metadata": {}, "source": [ - "### Eval(uate)\n", - "\n", - "`eval()` evaluates a string describing operations on DataFrame columns to create new columns. It operates on columns only, not rows or elements. Here's an example:" + "In **polars**, there's no `eval()` — you use expressions directly instead:\n" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 24, "id": "8d9defd3", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (6, 6)
abcdnew_aratio
f64f64f64f64f64f64
40554.89674440555.54643440553.99297240555.01691340559.8967440.999877
40554.62295240554.74328740555.91336840553.41594640559.6229520.999877
40554.99035640556.12609740554.59859540555.33241940559.9903560.999877
40556.55112940554.49408140554.78291440555.34819140561.5511290.999877
40555.70557740555.8260940555.44589340556.2181740560.7055770.999877
40555.34260340556.17895340554.99531840555.54639840560.3426030.999877
" + ], + "text/plain": [ + "shape: (6, 6)\n", + "┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┬──────────┐\n", + "│ a ┆ b ┆ c ┆ d ┆ new_a ┆ ratio │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │\n", + "╞══════════════╪══════════════╪══════════════╪══════════════╪══════════════╪══════════╡\n", + "│ 40554.896744 ┆ 40555.546434 ┆ 40553.992972 ┆ 40555.016913 ┆ 40559.896744 ┆ 0.999877 │\n", + "│ 40554.622952 ┆ 40554.743287 ┆ 40555.913368 ┆ 40553.415946 ┆ 40559.622952 ┆ 0.999877 │\n", + "│ 40554.990356 ┆ 40556.126097 ┆ 40554.598595 ┆ 40555.332419 ┆ 40559.990356 ┆ 0.999877 │\n", + "│ 40556.551129 ┆ 40554.494081 ┆ 40554.782914 ┆ 40555.348191 ┆ 40561.551129 ┆ 0.999877 │\n", + "│ 40555.705577 ┆ 40555.82609 ┆ 40555.445893 ┆ 40556.21817 ┆ 40560.705577 ┆ 0.999877 │\n", + "│ 40555.342603 ┆ 40556.178953 ┆ 40554.995318 ┆ 40555.546398 ┆ 40560.342603 ┆ 0.999877 │\n", + "└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────┘" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "df[\"ratio\"] = df.eval(\"a / new_a\")\n", + "df = df.with_columns((pl.col(\"a\") / pl.col(\"new_a\")).alias(\"ratio\"))\n", "df" ] }, @@ -658,7 +989,51 @@ "id": "8b275b5b", "metadata": {}, "source": [ - "Evaluate can also be used to create new boolean columns using, for example, a string `\"a > 0.5\"` in the above example." + "You can also create boolean columns the same way:" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "f301c8cb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "shape: (6, 7)
abcdnew_aratioa_gt_0.5
f64f64f64f64f64f64bool
40554.89674440555.54643440553.99297240555.01691340559.8967440.999877true
40554.62295240554.74328740555.91336840553.41594640559.6229520.999877true
40554.99035640556.12609740554.59859540555.33241940559.9903560.999877true
40556.55112940554.49408140554.78291440555.34819140561.5511290.999877true
40555.70557740555.8260940555.44589340556.2181740560.7055770.999877true
40555.34260340556.17895340554.99531840555.54639840560.3426030.999877true
" + ], + "text/plain": [ + "shape: (6, 7)\n", + "┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┬──────────┬──────────┐\n", + "│ a ┆ b ┆ c ┆ d ┆ new_a ┆ ratio ┆ a_gt_0.5 │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ bool │\n", + "╞══════════════╪══════════════╪══════════════╪══════════════╪══════════════╪══════════╪══════════╡\n", + "│ 40554.896744 ┆ 40555.546434 ┆ 40553.992972 ┆ 40555.016913 ┆ 40559.896744 ┆ 0.999877 ┆ true │\n", + "│ 40554.622952 ┆ 40554.743287 ┆ 40555.913368 ┆ 40553.415946 ┆ 40559.622952 ┆ 0.999877 ┆ true │\n", + "│ 40554.990356 ┆ 40556.126097 ┆ 40554.598595 ┆ 40555.332419 ┆ 40559.990356 ┆ 0.999877 ┆ true │\n", + "│ 40556.551129 ┆ 40554.494081 ┆ 40554.782914 ┆ 40555.348191 ┆ 40561.551129 ┆ 0.999877 ┆ true │\n", + "│ 40555.705577 ┆ 40555.82609 ┆ 40555.445893 ┆ 40556.21817 ┆ 40560.705577 ┆ 0.999877 ┆ true │\n", + "│ 40555.342603 ┆ 40556.178953 ┆ 40554.995318 ┆ 40555.546398 ┆ 40560.342603 ┆ 0.999877 ┆ true │\n", + "└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────┴──────────┘" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = df.with_columns((pl.col(\"a\") > 0.5).alias(\"a_gt_0.5\"))\n", + "df" ] } ], @@ -687,7 +1062,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.12" + "version": "3.12.13" }, "toc-showtags": true }, diff --git a/visualise.quarto_ipynb_1 b/visualise.quarto_ipynb_1 index 2ef6d2f..c54f104 100644 --- a/visualise.quarto_ipynb_1 +++ b/visualise.quarto_ipynb_1 @@ -28,7 +28,7 @@ "\n", "The first of the three kinds is *exploratory data visualisation*, and it's the kind that you do when you're looking and data and trying to understand it. Just plotting the data is a really good strategy for getting a feel for any issues there might be. This is perhaps most famously demonstrated by Anscombe's quartet: four different datasets with the same mean, standard deviation, and correlation but very different data distributions." ], - "id": "f3331573" + "id": "b9313f4b" }, { "cell_type": "code", @@ -96,7 +96,7 @@ "plt.suptitle(\"Anscombe's Quartet\")\n", "plt.show()" ], - "id": "64a0e7f6", + "id": "3de1a2d3", "execution_count": null, "outputs": [] }, @@ -120,7 +120,7 @@ "\n", "You can find more information on the topic of communicating via data visualisations in the @sec-communicate-plots chapter." ], - "id": "30b9ff30" + "id": "303b55a9" } ], "metadata": { diff --git a/workflow-help.quarto_ipynb_1 b/workflow-help.quarto_ipynb_1 index e7bebf8..8b23ba1 100644 --- a/workflow-help.quarto_ipynb_1 +++ b/workflow-help.quarto_ipynb_1 @@ -35,7 +35,7 @@ "\n", "- First, you need to make your code reproducible. This means that you need to capture everything, i.e., include any packages you used and create all necessary objects. The easiest way to make sure you've done this is to use the [**watermark**](https://github.com/rasbt/watermark) package alongside whatever else you are doing:" ], - "id": "22b3f9e0" + "id": "91169e03" }, { "cell_type": "code", @@ -48,7 +48,7 @@ "print(watermark())\n", "print(watermark(iversions=True, globals_=globals()))" ], - "id": "a119501b", + "id": "bccd500b", "execution_count": null, "outputs": [] }, @@ -72,7 +72,7 @@ "\n", "- **Data**: as others won't be able to easily download the data you're working with, it's often best to create a small amount of data from code that still have the same problem as you're finding with your actual data. Between **numpy** and **pandas**, it's quite easy to generate data from code; here's an example:" ], - "id": "c4ac60b4" + "id": "4000f40e" }, { "cell_type": "code", @@ -87,7 +87,7 @@ "df[\"random_normal\"] = np.random.normal(size=6)\n", "df" ], - "id": "d1e4562c", + "id": "1187dba7", "execution_count": null, "outputs": [] }, @@ -99,7 +99,7 @@ "\n", "Finish by checking that you have actually made a reproducible example by starting a fresh Python session and copying and pasting your reprex in." ], - "id": "4b75e409" + "id": "abea1503" } ], "metadata": { diff --git a/workflow-packages-and-environments.quarto_ipynb_1 b/workflow-packages-and-environments.quarto_ipynb_1 index a5600ce..f61856c 100644 --- a/workflow-packages-and-environments.quarto_ipynb_1 +++ b/workflow-packages-and-environments.quarto_ipynb_1 @@ -104,7 +104,7 @@ "\n", "You may have noticed that a file called `pyproject.toml` has been created." ], - "id": "8b889898" + "id": "58431611" }, { "cell_type": "code", @@ -115,7 +115,7 @@ "\n", "print_json(data=toml.load(\"pyproject.toml\"))" ], - "id": "688f09f1", + "id": "4bb6211b", "execution_count": null, "outputs": [] }, @@ -133,7 +133,7 @@ "\n", "In the screenshot above, you can see the project-environment in two places: on the blue bar at the bottom of the screen, and (in 5), at the top right hand side of the interactive window. A similar top right indicator is present when you have a Jupyter Notebook open too." ], - "id": "148595b3" + "id": "76e4b027" } ], "metadata": {