diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 42274d2..109f016 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -9,8 +9,8 @@ - [Mac](./chapter1/mac.md) - [Linux](./chapter1/linux.md) - [WSL](./chapter1/wsl.md) - - [M3 MASSIVE]() - - [Nectar Cloud]() + - [M3 MASSIVE](./chapter1/m3.md) + - [Nectar Cloud](./chapter1/nectar.md) - [Challenges](./chapter1/challenges.md) - [Intro to C](./chapter2/intro-to-c.md) @@ -44,16 +44,16 @@ - [Spawning Processes & Threads]() - [Challenges](./chapter2/challenges.md) -- [M3 & SLURM](./chapter3/chapter3.md) +- [M3 & SLURM](./chapter5/chapter5.md) - - [Login - SSH & Strudel](./chapter3/login.md) - - [Batch vs. Stream Processing]() - - [Cluster Architectures]() - - [Schedmd's SLURM]() - - [M3 Interface & Usage]() - - [Job Scripting]() + - [Batch Processing vs. Cloud Computing](./chapter5/batch-cloud.md) + - [Parallel & Distributed Computing](./chapter5/parallel-distributed.md) + - [M3 Login - SSH & Strudel](./chapter5/login.md) + - [Schedmd's SLURM](./chapter5/slurm.md) + - [M3 Interface & Usage](./chapter5/m3-interface.md) + - [Job Scripting](./chapter5/job-scripting.md) - [Advanced SLURM]() - - [Challenges](./chapter3/challenges.md) + - [Challenges](./chapter5/challenges.md) - [Virtualisation & DevOps]() @@ -102,10 +102,10 @@ - [Job Batching](./chapter10/job-batching.md) - [Challenges](./chapter10/challenges.md) -- [Being a HPC Member]() +- [Being a HPC Member](./chapter11/chapter11.md) - - [Expectations & Leadership]() - - [Project Workflow]() - - [Academic Supervisors & Papers]() + - [Expectations & Leadership](./chapter11/expectations-leadership.md) + - [Project Workflow](./chapter11/project-workflow.md) + - [Academic Supervisors & Papers](./chapter11/supervisors-papers.md) [Acknowledgements](./acknowledgements.md) diff --git a/src/c b/src/c new file mode 100644 index 0000000..83edfca --- /dev/null +++ b/src/c @@ -0,0 +1 @@ +# M3 Interface & Usage diff --git a/src/ch b/src/ch new file mode 100644 index 0000000..f517805 --- /dev/null +++ b/src/ch @@ -0,0 +1 @@ +# Project Workflow diff --git a/src/chapter b/src/chapter new file mode 100644 index 0000000..7b3942c --- /dev/null +++ b/src/chapter @@ -0,0 +1 @@ +# Being a HPC Member diff --git a/src/chapter1/aaf.png b/src/chapter1/aaf.png new file mode 100644 index 0000000..836a38d Binary files /dev/null and b/src/chapter1/aaf.png differ diff --git a/src/chapter1/hpcid.png b/src/chapter1/hpcid.png new file mode 100644 index 0000000..97ef6ae Binary files /dev/null and b/src/chapter1/hpcid.png differ diff --git a/src/chapter1/join_project.png b/src/chapter1/join_project.png new file mode 100644 index 0000000..f50e537 Binary files /dev/null and b/src/chapter1/join_project.png differ diff --git a/src/chapter3/start.md b/src/chapter1/m3.md similarity index 77% rename from src/chapter3/start.md rename to src/chapter1/m3.md index ab22ab8..226a963 100644 --- a/src/chapter3/start.md +++ b/src/chapter1/m3.md @@ -1,14 +1,19 @@ -# Getting Started +# M3 MASSIVE + +MASSIVE (Multi-modal Australian ScienceS Imaging and Visualisation Environment) is a HPC supercomputing cluster that you will have access to as an MDN member. In this page we will set you up with access before you learn how to use it in Chapter 5. Feel free to go through the docs to learn about the [hardware config](https://docs.massive.org.au/M3/m3users.html) of M3 (3rd version of MASSIVE) and it's [institutional governance](https://massive.org.au/about/about.html#governance). ## Request an account -In order to access M3, you will need to request an account. To do this, follow this link: [HPC ID](https://hpc.erc.monash.edu.au/karaage/aafbootstrap). This should take you to a page this this: +In order to access M3, you will need to request an account. To do this, follow this link: [HPC ID](https://hpc.erc.monash.edu.au/karaage/aafbootstrap). This should take you to a page this this: + -![HPC ID](./imgs/aaf.png) +![HPC ID](./aaf.png) Type in Monash, as you can see here. Select Monash University, and tick the Remember my organisation box down the bottom. Once you continue to your organisation, it will take you to the Monash Uni SSO login page. You will need to login with your Monash credentials. -You should now see something like this: ![HPC ID System](./imgs/hpcid.png) +You should now see something like this: + +![HPC ID System](./hpcid.png) Once you are here, there are a couple things you will need to do. The first, and most important is to set your HPC password. This is the password you will use to login to M3. To do this, go to home, then click on Change Linux Password. This will take you through the steps of setting your password. @@ -16,7 +21,9 @@ Once you have done this, you can move on to requesting access to the MDN project ## Add to project -To request to join the MDN project, again from the Home page click on Join Exiting Project. You should see a screen like this: ![Join Project](./imgs/join_project.png) +To request to join the MDN project, again from the Home page click on Join Exiting Project. You should see a screen like this: + +![Join Project](./join_project.png) In the text box type `vf38` and click search. This is the project code for MDN. Then select the project and click submit. You will now have to wait for the project admins to approve your request. Once they have done this, you will be able to access the project. This should not take longer than a few days, and you will get an email telling you when you have access. @@ -47,4 +54,4 @@ cat ~/.ssh/id_ed25519.pub Then, go to your github account, go to settings, and click on the SSH and GPG keys tab. Click on New SSH key, and paste the key into the box. Give it a name, and click Add SSH key. -You should now be able to clone repos using SSH. To do this, go to the repo you want to clone, but instead of copying the HTTP link, copy the SSH link, and then its regular git cloning. +You should now be able to clone repos using SSH. To do this, go to the repo you want to clone, but instead of copying the HTTP link, copy the SSH link, and then its regular git cloning. \ No newline at end of file diff --git a/src/chapter1/nectar-login.png b/src/chapter1/nectar-login.png new file mode 100644 index 0000000..8f4fed0 Binary files /dev/null and b/src/chapter1/nectar-login.png differ diff --git a/src/chapter1/nectar.md b/src/chapter1/nectar.md new file mode 100644 index 0000000..53624a1 --- /dev/null +++ b/src/chapter1/nectar.md @@ -0,0 +1,16 @@ +# Nectar Cloud + +The ARDC Nectar Research Cloud (Nectar) is Australia’s national research cloud, specifically designed for research computing. Like with M3, we will set you up with access now before you learn about it in later chapters. [This webpage](https://ardc.edu.au/services/ardc-nectar-research-cloud/) explains what it is if you're curious. + +## Connect Monash Account to Nectar Cloud +To create an [identity](https://medium.com/@ciente/identity-and-access-management-iam-in-cloud-computing-2777481525a4) (account) in Nectar Cloud, all you have to do is login using your Monash student account. Click [this link](https://dashboard.rc.nectar.org.au) to access Nectar's landing page. + +You will see the following. Make sure to click "Login via AAF (Australia)". + +![nectar](./nectar-login.png) + +You will be redirected to enter your Monash credentials after which you will see the Nectar Cloud dashboard for your trial project (your project name will be pt-xxxxx). + +## Cloud Starter Series + +ARDC has provided [this cloud starter tutorial series](https://tutorials.rc.nectar.org.au/cloud-starter/01-overview) for people new to Nectar Cloud. You should be able to follow these tutorials using your trial project. If you need more SUs (service units aka. cloud credits) in order to provision more cloud resources for MDN-related work, you should message your HPC Lead with that request. \ No newline at end of file diff --git a/src/chapter11/chapter11.md b/src/chapter11/chapter11.md new file mode 100644 index 0000000..ea51daa --- /dev/null +++ b/src/chapter11/chapter11.md @@ -0,0 +1,6 @@ +# Being a HPC Member + +Congratulations! You've completed all your technical new recruit training! + +At this point it's important to remember that technical skills are only a part of what's required to succeed as a HPC member. +Without good teamwork skills and collaboration practices we will not be able to work together effectively and achieve our goals. To that end, this chapter outlines some basic expectations required of all HPC members along with other non-technical information that you might find useful. \ No newline at end of file diff --git a/src/chapter11/expectations-leadership.md b/src/chapter11/expectations-leadership.md new file mode 100644 index 0000000..e6e1e14 --- /dev/null +++ b/src/chapter11/expectations-leadership.md @@ -0,0 +1 @@ +# Expectations & Leadership \ No newline at end of file diff --git a/src/chapter11/project-workflow.md b/src/chapter11/project-workflow.md new file mode 100644 index 0000000..1bd8c6c --- /dev/null +++ b/src/chapter11/project-workflow.md @@ -0,0 +1 @@ +# Project Workflow \ No newline at end of file diff --git a/src/chapter11/supervisors-papers.md b/src/chapter11/supervisors-papers.md new file mode 100644 index 0000000..49bf64d --- /dev/null +++ b/src/chapter11/supervisors-papers.md @@ -0,0 +1 @@ +# Academic Supervisors & Papers \ No newline at end of file diff --git a/src/chapter3/challenges.md b/src/chapter3/challenges.md index 39f5e4f..9358534 100644 --- a/src/chapter3/challenges.md +++ b/src/chapter3/challenges.md @@ -1,45 +1 @@ -# M3 Challenges - -## Challenge 1 - -Navigate to your scratch directory and, using vim (or your chosen in-terminal editor) create a file called `hello.txt` that contains the text "Hello World". Once you have created the file, use the `cat` command to print the contents of the file to the screen. - -## Challenge 2 - -Write a bash script that prints the contents of the above hello.txt file to the screen and run it locally (on your login node). - -## Challenge 3 - -Submit the above script to the queue by writing another SLURM bash script. Check the status of the job using `squeue`. Once the job has finished, check the output using `cat`. You can find the output file in the directory you submitted the job from. - -## Challenge 4 - -Request an interactive node and attach to it. Once you have done this, install python 3.7 using conda. - -## Challenge 5 - -Clone and run [this](./dl_on_m3/alexnet_stl10.py) script. You will need to first install the dependencies for it. You don't need to wait for it to finish, just make sure it is working. You will know its working if it starts listing out the loss and accuracy for each epoch. You can stop it by pressing `ctrl + c`. - -Once you have confirmed that it is working, deactivate and delete the conda environment, and then end the interactive session. - -> Hint: I have included the dependencies and their versions (make sure you install the right version) in the `requirements.txt` file. You will need python 3.7 to run this script. - -## Challenge 6 - -Go back to the login node. Now you are going to put it all together. Write a bash script that does the following: - -- (1) requests a compute node -- (2) installs python using conda -- (3) clones and runs the above script - -Let this run fully. Check the output of the script to make sure it ran correctly. Does it match the output of the script you ran in challenge 5? -> Hint: You can check the output of the script at any time by `cat`ing the output file. The script does not need to have finished running for you to do this. - -## Challenge 7 - -Edit your submission script so that you get a gpu node, and run the script using the gpu. -> Hint: Use the m3h partition - -## Challenge 8 - -Now you want to clean up your working directory. First, push your solutions to your challenges repo. Then, delete the challenges directory, as well as the conda environment you created in challenge 6. +# Challenges diff --git a/src/chapter3/chapter3.md b/src/chapter3/chapter3.md deleted file mode 100644 index 16097dd..0000000 --- a/src/chapter3/chapter3.md +++ /dev/null @@ -1,7 +0,0 @@ -# M3 - -[M3](https://docs.massive.org.au/M3/index.html) is part of [MASSIVE](https://https://www.massive.org.au/), which is a High Performance Computing facility for Australian scientists and researchers. Monash University is a partner of MASSIVE, and provides as majority of the funding for it. M3 is made up of multiple different types of servers, with a total of 5673 cores, 63.2TB of RAM, 5.6PB of storage, and 1.7 million CUDA cores. - -M3 utilises the [Slurm](https://slurm.schedmd.com/) workload manager, which is a job scheduler that allows users to submit jobs to the cluster. We will learn a bit more about this later on. - -This book will take you through the basics of connecting to M3, submitting jobs, transferring data to and from the system and some other things. If you want to learn more about M3, you can read the [M3 documentation](https://docs.massive.org.au/M3/index.html). This will give you a more in-depth look at the system, and how to use it. diff --git a/src/chapter3/imgs/hpcid.png b/src/chapter3/imgs/hpcid.png deleted file mode 100644 index be747b6..0000000 Binary files a/src/chapter3/imgs/hpcid.png and /dev/null differ diff --git a/src/chapter3/imgs/join_project.png b/src/chapter3/imgs/join_project.png deleted file mode 100644 index 070d055..0000000 Binary files a/src/chapter3/imgs/join_project.png and /dev/null differ diff --git a/src/chapter3/linux-cmds.md b/src/chapter3/linux-cmds.md deleted file mode 100644 index 7057ccc..0000000 --- a/src/chapter3/linux-cmds.md +++ /dev/null @@ -1,47 +0,0 @@ -# Linux Commands - -Even if you are already familiar with linux, please read through all of these commands, as some are specific to M3. - -## Basic Linux Commands - -| Command | Function | -| --- | --- | -| `pwd` | prints current directory | -| `ls` | prints list of files / directories in current directory (add a `-a` to list everything, including hidden files/directories | -| `mkdir` | makes a directory | -| `rm ` | deletes *filename*. add `-r` to delete directory. add `-f` to force deletion (be really careful with that one) | -| `cd ` | move directory. | -| `vim` or `nano` | bring up a text editor | -| `cat ` | prints contents of file to terminal | -| `echo` | prints whatever you put after it | -| `chmod ` | changes permissions of file | -| `cp` | copy a file or directory| -| `mv ` | move or rename file or directory | - -> Note: `.` and `..` are special directories. `.` is the current directory, and `..` is the parent directory. These can be used when using any command that takes a directory as an argument. Similar to these, `~` is the home directory, and `/` is the root directory. For example, if you wanted to copy something from the parent directory to the home directory, you could do `cp ../ ~/`, without having to navigate anywhere. - -## Cluster Specific Commands - -| Command | Function | Flags -| --- | --- | --- | -| `show_job` | prints information about your jobs | -| `show_cluster` | prints information about the cluster | -| `user_info` | prints information about your account | -| `squeue` | prints information about your jobs | `-u ` to print information about a specific user | -| `sbatch ` | submit a job to the cluster | -| `scontrol show job ` | prints information about a specific job | -| `scancel ` | cancel a job | - -## M3 Specific Commands - -| Command | Function | -| --- | --- | -| `module load ` | load a module | -| `module unload ` | unload a module | -| `module avail` | list available modules | -| `module list` | list loaded modules | -| `module spider ` | search for a module | -| `module help ` | get help for a module | -| `module show ` | show details about a module | -| `module purge` | unload all modules | -| `module swap ` | swap two modules | \ No newline at end of file diff --git a/src/chapter3/shared-fs.md b/src/chapter3/shared-fs.md deleted file mode 100644 index f881310..0000000 --- a/src/chapter3/shared-fs.md +++ /dev/null @@ -1,79 +0,0 @@ -# M3's Shared Filesystem - -When we talk about a shared filesystem, what we mean is that the filesystem that M3 uses allows multiple users or systems to access, manage, and share files and directories over a network, concurrently. It enables users to collaborate on projects, share resources, and maintain a unified file structure across different machines and platforms. In addition to this, it enables the many different compute nodes in M3 to access data from a single source which users also have access to, simplifying the process of running jobs on M3. - -Very simply, the way it works is that the home, project and scratch directories are mounted on every node in the cluster, so they are accessible from any node. - -M3 has a unique filesystem consisting of three main important parts (for you). - -## Home Directory - -There is each user's personal directory, which only they have access to. This has a ~10GB allocation, and should store any hidden files, configuration files, or other files that you don't want to share with others. This is backed up nightly. - -## Project Directory - -This is the shared project directory, for all users in MDN to use. This has a ~1TB allocation, and should be used only for project specific files, scripts, and data. This is also backed up nightly, so in the case that you accidentally delete something important, it can be recovered. - -## Scratch Directory - -This is also shared with all users in MDN, and has more allocation (~3TB). You may use this for personal projects, but keep your usage low. In general it is used for temporary files, larger datasets, and should be used for any files that you don't need to keep for a long time. This is not backed up, so if you delete something, it's gone forever. - -## General Rules - -- Keep data usage to a minimum. If you have a large amount of data, consider moving it to the scratch directory. If it is not necessary to keep it, consider deleting it. -- Keep your home directory clean. -- In general, it is good practice to make a directory in the shared directory for yourself. Name this your username or name, to make it easily identifiable. This is where you should store your files for small projects or personal use. -- The project directory is not for personal use. Do not store files in the project directory that are not related to MDN. Use the scratch directory instead. - -## Copying files to and from M3 - -Copying files to and from M3 can be done in a few different ways. We will go over the basics of scp, as well as setting up FileZilla. - -A key thing to remember when copying files to and from M3 is that you shouldn't be using the regular ssh url. Instead, they have a dedicated SFTP url to use for file transfers. This is `m3-dtn.massive.org.au`. This is the url you will use when setting up FileZilla, and when using scp. - -### Using scp - -You can copy files to M3 using the `scp` command. This is a command line tool that is built into most linux distributions. If you are using Windows, you will need to install a tool like [Git Bash](https://gitforwindows.org/) to use this command. - -#### Linux / Mac - -To copy a file to M3, use the following command: - -```bash -scp @m3-dtn.massive.org.au: -``` - -For example, if I wanted to copy a file called `test.txt` to my home directory on M3, I would use the following command: - -```bash -scp test.txt jasparm@m3-dtn.massive.org.au:~ -``` - -To copy a file from M3 to your local machine, use the following command: - -```bash -scp @m3-dtn.massive.org.au: -``` - -So, to bring that same file back to my local machine, I would use the following command: - -```bash -scp jasparm@m3-dtn.massive.org.au:~/test.txt . -``` - -#### FileZilla - -FileZilla is a SFTP client that the M3 staff recommend using. You can download it [here](https://filezilla-project.org/download.php?show_all=1). - -Once installed, run the program and click on File -> Site Manager or `Ctrl-S`. This will open the site manager. Click on New Site, and enter the following details: - -- Protocol: SFTP -- Host: `m3-dtn.massive.org.au` -- Logon Type: Ask for password -- User: `` - -Don't change anything else. Leave password blank for now. - -It should look something like this: -![Add M3 as a site](./imgs/filezilla_connect_m3.png) -Click on Connect, and enter your password when prompted. You should now be connected to M3. You can now drag and drop files to and from M3. diff --git a/src/chapter3/software-tooling.md b/src/chapter3/software-tooling.md deleted file mode 100644 index a3ec5eb..0000000 --- a/src/chapter3/software-tooling.md +++ /dev/null @@ -1,113 +0,0 @@ -# Software and Tooling - -Software and development tooling is handled a little differently on M3 than you might be used to. In particular, because M3 is a shared file system, you do not have access to `sudo`, and you cannot install software on the system manually. Instead, you will need to use the `module` command to load software and development tools. - -## Module - -The `module` command is used kind of as an alternative to package managers like `apt` or `yum`, except it is managed by the M3 team. It allows you to load software and development tools into your environment, and is used to load software on M3. To see a comprehensive list of commands go [here](./linux-cmds.md#m3-specific-commands). - -In general, however, you will only really need to use `module load` and `module unload`. These commands are used to load and unload software and development tools into your environment. - -For most of the more popular software packages, like gcc, there are multiple different versions available. You will need to specify which version you want to load based on your needs. - -## C - -### GCC - -To load GCC, you can run the following command: - -```bash -module load gcc/10.2.0 -``` - -This will load GCC 10.2.0 into your environment, and you can use it to compile C/C++ programs as described in the [Intro to C](../chapter2/intro-to-c.md) chapter. To unload GCC, you can run the following command: - -```bash -module unload gcc/10.2.0 -``` - -## Python - -Python is a bit of a special case on M3. This is because of how many different versions there are, as well as how many different packages are available. To make things easier, it is recommended that you use miniconda or anaconda to manage your python environments instead of using the system python. - -These instructions are based off the M3 docs, which can be found [here](https://docs.massive.org.au/M3/software/pythonandconda/pythonandconda.html#pythonandconda). - -### Miniconda - -#### Installing Miniconda - -To install Miniconda on M3, there is a dedicated install script that you can use. This will install miniconda into your default scratch space, i.e. `/vf38_scratch//miniconda3`. To install miniconda, run the following command: - -```bash -module load conda-install - -# To install miniconda to the default location -conda-install - -# To install miniconda to a custom location -conda-install your/install/location -``` - -#### Activating Miniconda - -To activate the base conda environment, run the following command: - -```bash -source your/install/location/miniconda/bin/activate -``` - -You will notice that once activated, `(base)` will appear in the prompt before your username. - -To create and activate Python environments within Miniconda, follow these steps: - -```bash -# Create a new environment -# Change env-name to whatever you want to call your environment -conda create --name env-name python= - -# Activate the environment -conda activate env-name -``` - -#### Managing Python packages - -Use the following commands to install and manage Python packages: - -```bash -# Install a package -conda install package-name - -# Update a package -conda update package-name - -# You can also change the version of packages by adding a = and the version number - -# Remove a package -conda remove package-name -``` - -#### Deactivating Miniconda - -To deactivate the conda environment you are in, run `conda deactivate`. To exit conda entirely run `conda deactivate` again. You will know you have fully exited conda when `(base)` is no longer in the prompt. - -### VIM - -VIM is a terminal based text editor. You may have heard about it, or even tried using it before. If so, you might recognise the common meme of "how do I exit VIM???". This is because VIM uses a very different key binding system to other text editors, and it can be a little confusing to get used to. However, once you get used to it, it is actually a very powerful and efficient text editor. - -I will attemt to give a brief overview of VIM commands, however you should really check out the [VIM documentation](https://vimhelp.org/) if you want to learn more. - -VIM also has a built in tutorial that you can access by running `:help` while in VIM. - -To use VIM to edit a file, just type `vim ` into the terminal. This will open the file in VIM. If the file does not exist, it will create a new file with that name. - -VIM has three different modes. The first is the command mode, which is the default mode when you open a file. In this mode, you can navigate around the file, and perform other commands. The second is the insert mode, which is used to insert text into the file. The third is the visual mode, which is used to select text. - -To enter the insert mode, press `i`. To exit the insert mode, press `esc`. To enter the visual mode, press `v`. To exit the visual mode, press `esc`. - -In command mode, you move around using `h`, `j`, `k`, `l`. To move along words, press `w` or `b`. To move to the start or end of the line, press `0` or `$`. You can delete a line using `dd`, or delete a word using `dw`. You might be noticing some patterns here. In VIM, commands are made up of single or multiple characters that represent different things. For example, if I wanted to delete a word, I would press `d` to delete, and then `w` to delete a word. If I wanted to delete 3 words, I would press `d3w`. If I just wanted to change a word, I would press `c` instead of `d`. If I wanted to change 3 words, I would press `c3w`. If I wanted to change a line, I would press `cc`. Some other useful command mode commands are `u` to undo, `o` to insert a new line and go into insert mode, and `?` to search for a string. - -To get to insert mode, there are a lots of different ways, but the most common are `i` to insert text before the cursor, `a` to insert text after the cursor, and `o` to insert a new line. The capital versions of these also do things. `I` inserts text at the start of the line, `A` inserts text at the end of the line, and `O` inserts a new line above the current line. To exit insert mode, press `esc`. - -To get to visual mode, press `v`. In visual mode, you can select text using the same commands as in command mode. To delete the selected text, press `d`. To change the selected text, press `c`. To copy the selected text, press `y`. To paste press `p`. To exit visual mode, press `esc`. - -To exit VIM itself, enter command mode, and then press `:q!`. This will exit VIM without saving any changes. To save and exit, press `:wq`. To save without exiting, press `:w`. diff --git a/src/chapter3/strudel.md b/src/chapter3/strudel.md deleted file mode 100644 index 2b34a9f..0000000 --- a/src/chapter3/strudel.md +++ /dev/null @@ -1,31 +0,0 @@ -# Strudel - -STRUDEL is a web application used to connect to M3. There are two main benefits to this over regular ssh. Firstly, you are able to access a desktop session, so you can interact easier with M3, look at graphs, etc.. STRUDEL also enables the use of Jupyter notebooks, which are especially useful for data science and machine learning. - -## Accessing STRUDEL - -First, go to the [STRUDEL](https://beta.desktop.cvl.org.au/) website. You should see something like this: - -![strudel select cvl](imgs/strudel1.png) - -Select the CVL option, and you should be taken to another page, where you choose how to log in. - -![strudel login](imgs/strudel2.png) - -Select AAF. On the next page, search for and select Monash University. - -![AAF Login](imgs/aaf_strudel.png) - -You will now be taken to the Monash login page. Once you have logged in, it will show one last page, asking permission to use your details. Click allow, and you will be taken to the STRUDEL home page. - -![strudel home page](imgs/strudel_home.png) - -## Desktop Session - -To start a desktop session using STRUDEL, click on the **Desktop** tab on the side, select your desired options, and click launch. Once the session has started, you will be able to attach to it by clicking on the connect button in the *Pending / Running Desktops* section. - -## Jupyter Notebooks - -Similar to Desktops, if you want a basic Jupyter notebook, click on the **Jupyter Lab** tab, choose how much compute you want, and click launch. - -If you want to have a more customised Jupyter notebook, you can do this by first sshing into M3, and activate conda. Then activate the conda environment `jupyterlab`. Install you desired packages in this environment. Once you have done this, go back to STRUDEL, and launch a **Jupyter Lab - BYO** session. \ No newline at end of file diff --git a/src/chapter5/batch-cloud.md b/src/chapter5/batch-cloud.md new file mode 100644 index 0000000..6919911 --- /dev/null +++ b/src/chapter5/batch-cloud.md @@ -0,0 +1,29 @@ +# Batch Processing vs. Cloud Computing + +You are all likely familiar with the definition of High Performance Computing. Here is one from IBM, + +> High-performance computing (HPC) is technology that uses clusters of powerful processors that work in parallel to process massive multi-dimensional data sets, also known as big data, and solve complex problems at extremely high speeds. HPC solves some of today’s most complex computing problems in real time. + +But the term HPC is not really used much outside the scientific research community. A lot of cloud systems involve similar scale of hardware, parallel & distributed computing, similar computational workload, data processing capacity and low latency/high throughput capability as HPC clusters. *So what exactly is the difference between a cloud system and a HPC cluster?* + +At the end of the day this comes down to semantics but a key difference is that a HPC cluster implies a system primarily used for **batch processing** whereas a cloud system would involve **interactive processing**. + +## Key Differences + +The vast majority of computer systems and nearly 100% of the ones that the average person uses is a cloud-based interactive system. Due to the nature of use cases specific to researchers, batch processing is a much more suitable choice for them. + +__Batch Processing:__ +- Jobs (code scripts) submitted are executed at a later time. +- User can't interact (or only limited interaction). +- Performance measure is **throughput**. +- Snapshot of output is used for debugging. + +![batch-image](./imgs/batch-processing.jpeg) + +__Interactive Processing:__ +- Jobs submitted are executed immediately. +- User can interact. +- Performance measure is **response time**. +- Interactive debugging. + +![interactive-image](./imgs/interactive-processing.png) \ No newline at end of file diff --git a/src/chapter5/challenges.md b/src/chapter5/challenges.md index e3c3cc5..39f5e4f 100644 --- a/src/chapter5/challenges.md +++ b/src/chapter5/challenges.md @@ -1,54 +1,45 @@ -# Distributed Computing Challenges +# M3 Challenges -## Overview +## Challenge 1 -- [Distributed Computing Challenges](#distributed-computing-challenges) - - [Overview](#overview) - - [Pre-Tasks](#pre-tasks) - - [Task 1 - Multinode 'Hello, world!'](#task-1---multinode-hello-world) - - [Task 2 - Ping Pong](#task-2---ping-pong) - - [Task 3 - Multinode Sum](#task-3---multinode-sum) - - [Task 4 - Multinode Mergesort](#task-4---multinode-mergesort) +Navigate to your scratch directory and, using vim (or your chosen in-terminal editor) create a file called `hello.txt` that contains the text "Hello World". Once you have created the file, use the `cat` command to print the contents of the file to the screen. -## Pre-Tasks +## Challenge 2 -For each task you will need to load MPICH using Spack from within your SLURM job script. There is a shared installation of Spack and MPICH within `vf38_scratch`. To load Spack and MPICH use the following to commands within you SLURM job script before any other command. +Write a bash script that prints the contents of the above hello.txt file to the screen and run it locally (on your login node). -```sh -. ~/vf38_scratch/spack/share/spack/setup-env.sh -spack load mpich -``` +## Challenge 3 -A template SLURM job file is given at the root of the distributed challenges directory. Copy this for each challenge into their respective sub-directories as every challenge will require running a SLURM job. If want to do some more experimenting, create multiple job scripts that use different amounts of nodes and test the execution time. +Submit the above script to the queue by writing another SLURM bash script. Check the status of the job using `squeue`. Once the job has finished, check the output using `cat`. You can find the output file in the directory you submitted the job from. -You will also need to generate some input for the sum and mergesort challenges. This can be done by compiling and running the program in `generate.cpp`. Run the following commands to build an generate the inputs for your challenges. +## Challenge 4 -```sh -module load gcc/10.2.0 -g++ -std=c++20 -o bin/generate generate.cpp -bin/generate 1000000000 -``` +Request an interactive node and attach to it. Once you have done this, install python 3.7 using conda. -> Note: -> -> - You do not have to worry about how to read the numbers from the file, this is handled for you already but it is recommended to look at the read function in `read.h` and understand what it is doing. -> - The expected output of the 'sum' challenge is found in the generated `output.txt` file within the challenges directory. -> The expected output of the 'mergesort' challenge is found in the generated `sorted.txt` file within the challenges directory however this will contain a lot of values so a check function is provided that compares a resorted version of your input to your sorted output. -> The sum and mergesort programs you will develop take a number as input. This is the size of the input data that you are performing your programs on. This should be the same number as the one used with the generator program. In the template programs for this challenge they are maked as an pointer to data called `input`. -> Given the above setup and configuration, the input data will contain ~8GB of data or ~8.0e9 bytes so make sure to allocate enough resources both in the programs an in the SLURM job scripts. +## Challenge 5 -## Task 1 - Multinode 'Hello, world!' +Clone and run [this](./dl_on_m3/alexnet_stl10.py) script. You will need to first install the dependencies for it. You don't need to wait for it to finish, just make sure it is working. You will know its working if it starts listing out the loss and accuracy for each epoch. You can stop it by pressing `ctrl + c`. -Your first task is to say 'Hello, world!' from different nodes on M3. This involves printing the nodes name, rank (ID) and the total number of nodes in the MPI environment. +Once you have confirmed that it is working, deactivate and delete the conda environment, and then end the interactive session. -## Task 2 - Ping Pong +> Hint: I have included the dependencies and their versions (make sure you install the right version) in the `requirements.txt` file. You will need python 3.7 to run this script. -For this next task you will play a Ping-Pong game of sorts between two nodes. This will involve passing a count between the two nodes and incrementing the count for each send and receive. This should increment the count to 10 in the end. +## Challenge 6 -## Task 3 - Multinode Sum +Go back to the login node. Now you are going to put it all together. Write a bash script that does the following: -Your next task is to sum the numbers in the generated `input.txt` file together across ten nodes. This will involve summing 1,000,000,000 floats together. The rough expected output is contained in the `output.txt` file. Remember the input array is already given in the template file. +- (1) requests a compute node +- (2) installs python using conda +- (3) clones and runs the above script -## Task 4 - Multinode Mergesort +Let this run fully. Check the output of the script to make sure it ran correctly. Does it match the output of the script you ran in challenge 5? +> Hint: You can check the output of the script at any time by `cat`ing the output file. The script does not need to have finished running for you to do this. -Your final task is to sort the numbers from the input file `unsorted.txt` using a distributed version of mergesort. This will involve ten nodes running their won mergesorts on chunks of the input data individually and then a final mergesort of the intermediate results. Remember the input array is already given in the template file. +## Challenge 7 + +Edit your submission script so that you get a gpu node, and run the script using the gpu. +> Hint: Use the m3h partition + +## Challenge 8 + +Now you want to clean up your working directory. First, push your solutions to your challenges repo. Then, delete the challenges directory, as well as the conda environment you created in challenge 6. diff --git a/src/chapter5/chapter5.md b/src/chapter5/chapter5.md index 4d82439..16097dd 100644 --- a/src/chapter5/chapter5.md +++ b/src/chapter5/chapter5.md @@ -1,7 +1,7 @@ -# Distributed Computing +# M3 -- [Refresher on Parallelism](parallel-refresher.md) -- [What is Distributed Computing](distributed-computing.md) -- [OpenMPI](openmpi.md) -- [Message Passing](message-passing.md) -- [Challenges](challenges.md) +[M3](https://docs.massive.org.au/M3/index.html) is part of [MASSIVE](https://https://www.massive.org.au/), which is a High Performance Computing facility for Australian scientists and researchers. Monash University is a partner of MASSIVE, and provides as majority of the funding for it. M3 is made up of multiple different types of servers, with a total of 5673 cores, 63.2TB of RAM, 5.6PB of storage, and 1.7 million CUDA cores. + +M3 utilises the [Slurm](https://slurm.schedmd.com/) workload manager, which is a job scheduler that allows users to submit jobs to the cluster. We will learn a bit more about this later on. + +This book will take you through the basics of connecting to M3, submitting jobs, transferring data to and from the system and some other things. If you want to learn more about M3, you can read the [M3 documentation](https://docs.massive.org.au/M3/index.html). This will give you a more in-depth look at the system, and how to use it. diff --git a/src/chapter5/distributed-computing.md b/src/chapter5/distributed-computing.md deleted file mode 100644 index 7aa688e..0000000 --- a/src/chapter5/distributed-computing.md +++ /dev/null @@ -1,44 +0,0 @@ -# What is Distributed Computing - -**Distributed computing is parallel execution on distributed memory architecture.** - -This essentially means it is a form of parallel computing, where the processing power is spread across multiple machines in a network rather than being contained within a single system. In this memory architecture, the problems are broken down into smaller parts, and each machine is assigned to work on a specific part. - -![distributed memory architecture](imgs/distributed_memory_architecture.png) - -## Distributed Memory Architecture - -Lets have a look at the distributed memory architecture in more details. - -- Each processor has its own local memory, with its own address space -- Data is shared via a communications network using a network protocol, e.g Transmission Control Protocol (TCP), Infiniband etc.. - -![Distributed Memory Architecture](imgs/distributed_memory_architecture_2.png) - -## Distributed vs Shared program execution - -The following diagram provides another way of looking at the differences between distributed and shared memory architecture and their program execution. - -![Distributed vs Shared](imgs/distributed_vs_shared.png) - -## Advantages of distributed computing - -There are number of benefits to distributed computing in particular it addresses some shortcomings of shared memory architecture. - -- No contention for shared memory since each machine has its own memory. Compare this to shared memory architecture where all the cpu's are sharing the same memory. -- Highly scalable as we can add more machines and are not limited by RAM. -- Effectively resulting in being able to handle large-scale problems - -The benefits above do not come without some drawbacks including network overhead. - -## Disadvantages of distributed computing - -- Network overload. Network can be overloaded by: - - Multiple small messages - - Very large data throughput - - Multiple all-to-all messages ($N^2$ growth of messages) -- Synchronization failures - - Deadlock (processes waiting for an input from another process that never comes) - - Livelock (even worse as it’s harder to detect. All processes shuffling data around but not progressing in the algorithm ) -- More complex software architecture design. - - Can also be combined with threading-technologies as openMP/pthreads for optimal performance. diff --git a/src/chapter3/imgs/aaf.png b/src/chapter5/imgs/aaf.png similarity index 100% rename from src/chapter3/imgs/aaf.png rename to src/chapter5/imgs/aaf.png diff --git a/src/chapter3/imgs/aaf_strudel.png b/src/chapter5/imgs/aaf_strudel.png similarity index 100% rename from src/chapter3/imgs/aaf_strudel.png rename to src/chapter5/imgs/aaf_strudel.png diff --git a/src/chapter3/imgs/auth_strudel.png b/src/chapter5/imgs/auth_strudel.png similarity index 100% rename from src/chapter3/imgs/auth_strudel.png rename to src/chapter5/imgs/auth_strudel.png diff --git a/src/chapter5/imgs/batch-processing.jpeg b/src/chapter5/imgs/batch-processing.jpeg new file mode 100644 index 0000000..b6eb6c9 Binary files /dev/null and b/src/chapter5/imgs/batch-processing.jpeg differ diff --git a/src/chapter3/imgs/filezilla_connect_m3.png b/src/chapter5/imgs/filezilla_connect_m3.png similarity index 100% rename from src/chapter3/imgs/filezilla_connect_m3.png rename to src/chapter5/imgs/filezilla_connect_m3.png diff --git a/src/chapter3/imgs/filezilla_sitemanager.png b/src/chapter5/imgs/filezilla_sitemanager.png similarity index 100% rename from src/chapter3/imgs/filezilla_sitemanager.png rename to src/chapter5/imgs/filezilla_sitemanager.png diff --git a/src/chapter3/imgs/gurobi.png b/src/chapter5/imgs/gurobi.png similarity index 100% rename from src/chapter3/imgs/gurobi.png rename to src/chapter5/imgs/gurobi.png diff --git a/src/chapter3/imgs/gurobi2.png b/src/chapter5/imgs/gurobi2.png similarity index 100% rename from src/chapter3/imgs/gurobi2.png rename to src/chapter5/imgs/gurobi2.png diff --git a/src/chapter5/imgs/interactive-processing.png b/src/chapter5/imgs/interactive-processing.png new file mode 100644 index 0000000..fdfb2e9 Binary files /dev/null and b/src/chapter5/imgs/interactive-processing.png differ diff --git a/src/chapter5/imgs/parallel-distributed.png b/src/chapter5/imgs/parallel-distributed.png new file mode 100644 index 0000000..2c7b8c2 Binary files /dev/null and b/src/chapter5/imgs/parallel-distributed.png differ diff --git a/src/chapter3/imgs/putty_key_not_cached.png b/src/chapter5/imgs/putty_key_not_cached.png similarity index 100% rename from src/chapter3/imgs/putty_key_not_cached.png rename to src/chapter5/imgs/putty_key_not_cached.png diff --git a/src/chapter3/imgs/putty_start.png b/src/chapter5/imgs/putty_start.png similarity index 100% rename from src/chapter3/imgs/putty_start.png rename to src/chapter5/imgs/putty_start.png diff --git a/src/chapter3/imgs/strudel1.png b/src/chapter5/imgs/strudel1.png similarity index 100% rename from src/chapter3/imgs/strudel1.png rename to src/chapter5/imgs/strudel1.png diff --git a/src/chapter3/imgs/strudel2.png b/src/chapter5/imgs/strudel2.png similarity index 100% rename from src/chapter3/imgs/strudel2.png rename to src/chapter5/imgs/strudel2.png diff --git a/src/chapter3/imgs/strudel_home.png b/src/chapter5/imgs/strudel_home.png similarity index 100% rename from src/chapter3/imgs/strudel_home.png rename to src/chapter5/imgs/strudel_home.png diff --git a/src/chapter3/bash.md b/src/chapter5/job-scripting.md similarity index 98% rename from src/chapter3/bash.md rename to src/chapter5/job-scripting.md index aada975..ad84c9e 100644 --- a/src/chapter3/bash.md +++ b/src/chapter5/job-scripting.md @@ -1,4 +1,6 @@ -# Bash Scripts +# Job Scripting + +## Bash Scripts Bash is both a command line interface and a scripting language. Linux commands are generally using Bash. Bash scripts are a series of commands that are executed in order. Bash scripts are useful for automating tasks that you do often, or for running a series of commands that you don't want to type out every time. In our case, Bash scripts are used for running jobs on M3. diff --git a/src/chapter3/login.md b/src/chapter5/login.md similarity index 65% rename from src/chapter3/login.md rename to src/chapter5/login.md index c1b91e5..0d92da9 100644 --- a/src/chapter3/login.md +++ b/src/chapter5/login.md @@ -78,3 +78,35 @@ a ticket for your issue. ``` Once you are done and want to logout, just type `exit`. This will close the connection. + +# Strudel + +STRUDEL is a web application used to connect to M3. There are two main benefits to this over regular ssh. Firstly, you are able to access a desktop session, so you can interact easier with M3, look at graphs, etc.. STRUDEL also enables the use of Jupyter notebooks, which are especially useful for data science and machine learning. + +## Accessing STRUDEL + +First, go to the [STRUDEL](https://beta.desktop.cvl.org.au/) website. You should see something like this: + +![strudel select cvl](imgs/strudel1.png) + +Select the CVL option, and you should be taken to another page, where you choose how to log in. + +![strudel login](imgs/strudel2.png) + +Select AAF. On the next page, search for and select Monash University. + +![AAF Login](imgs/aaf_strudel.png) + +You will now be taken to the Monash login page. Once you have logged in, it will show one last page, asking permission to use your details. Click allow, and you will be taken to the STRUDEL home page. + +![strudel home page](imgs/strudel_home.png) + +## Desktop Session + +To start a desktop session using STRUDEL, click on the **Desktop** tab on the side, select your desired options, and click launch. Once the session has started, you will be able to attach to it by clicking on the connect button in the *Pending / Running Desktops* section. + +## Jupyter Notebooks + +Similar to Desktops, if you want a basic Jupyter notebook, click on the **Jupyter Lab** tab, choose how much compute you want, and click launch. + +If you want to have a more customised Jupyter notebook, you can do this by first sshing into M3, and activate conda. Then activate the conda environment `jupyterlab`. Install you desired packages in this environment. Once you have done this, go back to STRUDEL, and launch a **Jupyter Lab - BYO** session. \ No newline at end of file diff --git a/src/chapter5/m3-interface.md b/src/chapter5/m3-interface.md new file mode 100644 index 0000000..8cfe317 --- /dev/null +++ b/src/chapter5/m3-interface.md @@ -0,0 +1,243 @@ +# M3 Interface & Usage + +## Linux Commands + +Even if you are already familiar with linux, please read through all of these commands, as some are specific to M3. + +### Basic Linux Commands + +| Command | Function | +| --- | --- | +| `pwd` | prints current directory | +| `ls` | prints list of files / directories in current directory (add a `-a` to list everything, including hidden files/directories | +| `mkdir` | makes a directory | +| `rm ` | deletes *filename*. add `-r` to delete directory. add `-f` to force deletion (be really careful with that one) | +| `cd ` | move directory. | +| `vim` or `nano` | bring up a text editor | +| `cat ` | prints contents of file to terminal | +| `echo` | prints whatever you put after it | +| `chmod ` | changes permissions of file | +| `cp` | copy a file or directory| +| `mv ` | move or rename file or directory | + +> Note: `.` and `..` are special directories. `.` is the current directory, and `..` is the parent directory. These can be used when using any command that takes a directory as an argument. Similar to these, `~` is the home directory, and `/` is the root directory. For example, if you wanted to copy something from the parent directory to the home directory, you could do `cp ../ ~/`, without having to navigate anywhere. + +### Cluster Specific Commands + +| Command | Function | Flags +| --- | --- | --- | +| `show_job` | prints information about your jobs | +| `show_cluster` | prints information about the cluster | +| `user_info` | prints information about your account | +| `squeue` | prints information about your jobs | `-u ` to print information about a specific user | +| `sbatch ` | submit a job to the cluster | +| `scontrol show job ` | prints information about a specific job | +| `scancel ` | cancel a job | + +## M3 Specific Commands + +| Command | Function | +| --- | --- | +| `module load ` | load a module | +| `module unload ` | unload a module | +| `module avail` | list available modules | +| `module list` | list loaded modules | +| `module spider ` | search for a module | +| `module help ` | get help for a module | +| `module show ` | show details about a module | +| `module purge` | unload all modules | +| `module swap ` | swap two modules | + +## M3's Shared Filesystem + +When we talk about a shared filesystem, what we mean is that the filesystem that M3 uses allows multiple users or systems to access, manage, and share files and directories over a network, concurrently. It enables users to collaborate on projects, share resources, and maintain a unified file structure across different machines and platforms. In addition to this, it enables the many different compute nodes in M3 to access data from a single source which users also have access to, simplifying the process of running jobs on M3. + +Very simply, the way it works is that the home, project and scratch directories are mounted on every node in the cluster, so they are accessible from any node. + +M3 has a unique filesystem consisting of three main important parts (for you). + +### Home Directory + +There is each user's personal directory, which only they have access to. This has a ~10GB allocation, and should store any hidden files, configuration files, or other files that you don't want to share with others. This is backed up nightly. + +### Project Directory + +This is the shared project directory, for all users in MDN to use. This has a ~1TB allocation, and should be used only for project specific files, scripts, and data. This is also backed up nightly, so in the case that you accidentally delete something important, it can be recovered. + +### Scratch Directory + +This is also shared with all users in MDN, and has more allocation (~3TB). You may use this for personal projects, but keep your usage low. In general it is used for temporary files, larger datasets, and should be used for any files that you don't need to keep for a long time. This is not backed up, so if you delete something, it's gone forever. + +### General Rules + +- Keep data usage to a minimum. If you have a large amount of data, consider moving it to the scratch directory. If it is not necessary to keep it, consider deleting it. +- Keep your home directory clean. +- In general, it is good practice to make a directory in the shared directory for yourself. Name this your username or name, to make it easily identifiable. This is where you should store your files for small projects or personal use. +- The project directory is not for personal use. Do not store files in the project directory that are not related to MDN. Use the scratch directory instead. + +### Copying files to and from M3 + +Copying files to and from M3 can be done in a few different ways. We will go over the basics of scp, as well as setting up FileZilla. + +A key thing to remember when copying files to and from M3 is that you shouldn't be using the regular ssh url. Instead, they have a dedicated SFTP url to use for file transfers. This is `m3-dtn.massive.org.au`. This is the url you will use when setting up FileZilla, and when using scp. + +#### Using scp + +You can copy files to M3 using the `scp` command. This is a command line tool that is built into most linux distributions. If you are using Windows, you will need to install a tool like [Git Bash](https://gitforwindows.org/) to use this command. + +##### Linux / Mac + +To copy a file to M3, use the following command: + +```bash +scp @m3-dtn.massive.org.au: +``` + +For example, if I wanted to copy a file called `test.txt` to my home directory on M3, I would use the following command: + +```bash +scp test.txt jasparm@m3-dtn.massive.org.au:~ +``` + +To copy a file from M3 to your local machine, use the following command: + +```bash +scp @m3-dtn.massive.org.au: +``` + +So, to bring that same file back to my local machine, I would use the following command: + +```bash +scp jasparm@m3-dtn.massive.org.au:~/test.txt . +``` + +#### FileZilla + +FileZilla is a SFTP client that the M3 staff recommend using. You can download it [here](https://filezilla-project.org/download.php?show_all=1). + +Once installed, run the program and click on File -> Site Manager or `Ctrl-S`. This will open the site manager. Click on New Site, and enter the following details: + +- Protocol: SFTP +- Host: `m3-dtn.massive.org.au` +- Logon Type: Ask for password +- User: `` + +Don't change anything else. Leave password blank for now. + +It should look something like this: +![Add M3 as a site](./imgs/filezilla_connect_m3.png) +Click on Connect, and enter your password when prompted. You should now be connected to M3. You can now drag and drop files to and from M3. + +## Software and Tooling + +Software and development tooling is handled a little differently on M3 than you might be used to. In particular, because M3 is a shared file system, you do not have access to `sudo`, and you cannot install software on the system manually. Instead, you will need to use the `module` command to load software and development tools. + +### Module + +The `module` command is used kind of as an alternative to package managers like `apt` or `yum`, except it is managed by the M3 team. It allows you to load software and development tools into your environment, and is used to load software on M3. To see a comprehensive list of commands go [here](./linux-cmds.md#m3-specific-commands). + +In general, however, you will only really need to use `module load` and `module unload`. These commands are used to load and unload software and development tools into your environment. + +For most of the more popular software packages, like gcc, there are multiple different versions available. You will need to specify which version you want to load based on your needs. + +## C + +### GCC + +To load GCC, you can run the following command: + +```bash +module load gcc/10.2.0 +``` + +This will load GCC 10.2.0 into your environment, and you can use it to compile C/C++ programs as described in the [Intro to C](../chapter2/intro-to-c.md) chapter. To unload GCC, you can run the following command: + +```bash +module unload gcc/10.2.0 +``` + +## Python + +Python is a bit of a special case on M3. This is because of how many different versions there are, as well as how many different packages are available. To make things easier, it is recommended that you use miniconda or anaconda to manage your python environments instead of using the system python. + +These instructions are based off the M3 docs, which can be found [here](https://docs.massive.org.au/M3/software/pythonandconda/pythonandconda.html#pythonandconda). + +### Miniconda + +#### Installing Miniconda + +To install Miniconda on M3, there is a dedicated install script that you can use. This will install miniconda into your default scratch space, i.e. `/vf38_scratch//miniconda3`. To install miniconda, run the following command: + +```bash +module load conda-install + +# To install miniconda to the default location +conda-install + +# To install miniconda to a custom location +conda-install your/install/location +``` + +#### Activating Miniconda + +To activate the base conda environment, run the following command: + +```bash +source your/install/location/miniconda/bin/activate +``` + +You will notice that once activated, `(base)` will appear in the prompt before your username. + +To create and activate Python environments within Miniconda, follow these steps: + +```bash +# Create a new environment +# Change env-name to whatever you want to call your environment +conda create --name env-name python= + +# Activate the environment +conda activate env-name +``` + +#### Managing Python packages + +Use the following commands to install and manage Python packages: + +```bash +# Install a package +conda install package-name + +# Update a package +conda update package-name + +# You can also change the version of packages by adding a = and the version number + +# Remove a package +conda remove package-name +``` + +#### Deactivating Miniconda + +To deactivate the conda environment you are in, run `conda deactivate`. To exit conda entirely run `conda deactivate` again. You will know you have fully exited conda when `(base)` is no longer in the prompt. + +### VIM + +VIM is a terminal based text editor. You may have heard about it, or even tried using it before. If so, you might recognise the common meme of "how do I exit VIM???". This is because VIM uses a very different key binding system to other text editors, and it can be a little confusing to get used to. However, once you get used to it, it is actually a very powerful and efficient text editor. + +I will attemt to give a brief overview of VIM commands, however you should really check out the [VIM documentation](https://vimhelp.org/) if you want to learn more. + +VIM also has a built in tutorial that you can access by running `:help` while in VIM. + +To use VIM to edit a file, just type `vim ` into the terminal. This will open the file in VIM. If the file does not exist, it will create a new file with that name. + +VIM has three different modes. The first is the command mode, which is the default mode when you open a file. In this mode, you can navigate around the file, and perform other commands. The second is the insert mode, which is used to insert text into the file. The third is the visual mode, which is used to select text. + +To enter the insert mode, press `i`. To exit the insert mode, press `esc`. To enter the visual mode, press `v`. To exit the visual mode, press `esc`. + +In command mode, you move around using `h`, `j`, `k`, `l`. To move along words, press `w` or `b`. To move to the start or end of the line, press `0` or `$`. You can delete a line using `dd`, or delete a word using `dw`. You might be noticing some patterns here. In VIM, commands are made up of single or multiple characters that represent different things. For example, if I wanted to delete a word, I would press `d` to delete, and then `w` to delete a word. If I wanted to delete 3 words, I would press `d3w`. If I just wanted to change a word, I would press `c` instead of `d`. If I wanted to change 3 words, I would press `c3w`. If I wanted to change a line, I would press `cc`. Some other useful command mode commands are `u` to undo, `o` to insert a new line and go into insert mode, and `?` to search for a string. + +To get to insert mode, there are a lots of different ways, but the most common are `i` to insert text before the cursor, `a` to insert text after the cursor, and `o` to insert a new line. The capital versions of these also do things. `I` inserts text at the start of the line, `A` inserts text at the end of the line, and `O` inserts a new line above the current line. To exit insert mode, press `esc`. + +To get to visual mode, press `v`. In visual mode, you can select text using the same commands as in command mode. To delete the selected text, press `d`. To change the selected text, press `c`. To copy the selected text, press `y`. To paste press `p`. To exit visual mode, press `esc`. + +To exit VIM itself, enter command mode, and then press `:q!`. This will exit VIM without saving any changes. To save and exit, press `:wq`. To save without exiting, press `:w`. diff --git a/src/chapter5/message-passing.md b/src/chapter5/message-passing.md deleted file mode 100644 index f6d8742..0000000 --- a/src/chapter5/message-passing.md +++ /dev/null @@ -1,11 +0,0 @@ -# Message Passing - -As each processor has its own local memory with its own address space in distributed computing, we need a way to communicate between the processes and share data. Message passing is the mechanism of exchanging data across processes. Each process can communicate with one or more other processes by sending messages over a network. - -The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments and are implemented by different groups with the main goals being high performance, scalability, and portability. - -OpenMPI is one implementation of the MPI standard. It consists of a set of headers library functions that you call from your program. i.e. C, C++, Fortran etc. - -For C, you will need a header file for all the functions (mpi.h) and link in the relevant library functions. This is all handled by the mpicc program (or your compiler if you wanted to specify all the paths). - -In the next chapter we will look at how to implement message passing using OpenMPI. diff --git a/src/chapter5/openmpi.md b/src/chapter5/openmpi.md deleted file mode 100644 index 6e89875..0000000 --- a/src/chapter5/openmpi.md +++ /dev/null @@ -1,254 +0,0 @@ -# OpenMPI - -## Primary MPI Routines - -``` C -int MPI_Init(int * argc, char ** argv); -// initializes the MPI environment. -// Argc argv are the parameters come -// from main(argc,argv). The return value is an -// error code. 0 is OK. Non-zero is an error code -``` - -``` C -int MPI_Comm_size(MPI_Comm comm, int \* size); -// this functions gets the number of MPI processes -// i.e. the number you enter when you go mpirun -np \ myprog.exe -// *size is C syntax indicating that size will be modified to contain -// the value after the function returns. The return value is only used -// for error detection. printf(“MPI size is %d\n”,size); -int MPI_Comm_rank(MPI_Comm comm, int \* rank); -// this returns the rank of this particular process -// rank contains the value for that process- the function return value is an error code -``` - -![MPI routines](imgs/mpi_routines.png) - -### Point-to-Point communication - -These are blocking functions - they wait until the message is sent or received. Note that the CPU is actively polling the network interface when waiting for a message. This is opposite in behaviour to other C functions, i.e. c= getChar() (which causes a context switch and then a sleep in the OS). This is done for speed reasons. - -```C -int MPI_Send(void * buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm); -``` - -Sends a message from the calling process to another process - -INPUT PARAMETERS - -- ```buf``` - - Initial address of send buffer (choice). -- ```count``` - - Number of elements sent (non negative integer). -- ```type``` - - DataType of each send buffer element (handle). -- ```dest``` - - Rank of destination (integer). -- ```tag``` - - Message tag (integer). -- ```comm``` - - Communicator (handle). - -OUTPUT PARAMETER - -- ```IERROR``` - - Fortran only: Error status (integer). - -```c -int MPI_Recv(void * buf, int count, MPI_Datatype type, int source, int tag, MPI_Comm comm, MPI_Status * status); -``` - -Receives a message from another process - -INPUT PARAMETERS - -- ```count``` - - Maximum number of elements to receive (integer). -- ```type``` - - DataType of each receive buffer entry (handle). -- ```source``` - - Rank of source (integer). -- ```tag``` - - Message tag (integer). -- ```comm``` - - Communicator (handle). - -OUTPUT PARAMETERS - -- ```buf``` - - Initial address of receive buffer (choice). -- ```status``` - - Status object (status). -- ```IERROR``` - - Fortran only: Error status (integer). - -### Primary MPI Routines closing - -In a header file you will find - -``` C -int MPI_Finalize(void); -``` - -To call in your C or C++ program - -``` C -#include -MPI_Finalize(); -``` - -## General overview MPI program - -``` C -... -int MPI_Init(int argc, char ** argv); ---------------------------Parallel algorithm starts---------------------- -int MPI_Comm_size(MPI_Comm comm, int * size); -int MPI_Comm_rank(MPI_Comm comm, int * rank); -... -int MPI_Send(void * buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm); -int MPI_Recv(void * buf, int count, MPI_Datatype type, int source, int tag, MPI_Comm comm, MPI_Status * status); -… ---------------------------Parallel algorithm ends----------------------- -int MPI_Finalize(void); -... - -``` - -Use man pages to find out more about each routine - -When sending a Process it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office) - -## Elementary MPI Data types - -MPI_Send and MPI_Recv utilize MPI Datatypes as a means to specify the structure of a message at a higher level. The data types defined in the table below are simple in nature and for custom data structures you will have to define the structure. - -| MPI datatype | C equivalent | -|-------------------------|------------------------| -| MPI_SHORT | short int | -| MPI_INT | int | -| MPI_LONG | long int | -| MPI_LONG_LONG | long long int | -| MPI_UNSIGNED_CHAR | unsigned char | -| MPI_UNSIGNED_SHORT | unsigned short int | -| MPI_UNSIGNED | unsigned int | -| MPI_UNSIGNED_LONG | unsigned long int | -| MPI_UNSIGNED_LONG_LONG | unsigned long long int | -| MPI_FLOAT | float | -| MPI_DOUBLE | double | -| MPI_LONG_DOUBLE | long double | -| MPI_BYTE | char | - -## Example of a simple program - -``` C - -/* - MPI Program, send ranks -*/ - -#include -#include - -#define MASTER 0 - -int main(int argc, char *argv[]) -{ - - int my_rank; - /* Also known as world size */ - int num_processes; - - /* Initialize the infrastructure necessary for communication */ - MPI_Init(&argc, &argv); - - /* Identify this process */ - MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); - - /* Find out how many total processes are active */ - MPI_Comm_size(MPI_COMM_WORLD, &num_processes); - - printf("Process %d: There is a total of %d \n", my_rank, num_processes); - - if (my_rank == MASTER) - { - int dest = 1; - int tag = 0; - int count = 1; - - MPI_Send(&my_rank, count, MPI_INT, dest, tag, MPI_COMM_WORLD); - - printf("Process %d: Sent my_rank to process %d \n", my_rank, dest); - } - else - { - int tag = 0; - int count = 1; - int buffer; - MPI_Recv(&buffer, count, MPI_INT, MASTER, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); - printf("Process %d: Received %d from process %d \n", my_rank, buffer, MASTER); - } - - /* Tear down the communication infrastructure */ - MPI_Finalize(); - return 0; -} -``` - -## Compilation and Linking - -- Make sure you have the following packages installed and that they are in your $PATH: - - gcc - - OPENMPI or MPICH -- To compile and Link: - - ```mpicc -Wall -o ``` - - -Wall This enables all the warnings about questionable code. - - -o sets the output executable name. If you omit it, it defaults to a.out -- To run: - - ```mpirun -np ``` -- Behind the scenes: - - mpicc is just a wrapper around a C compiler. To see what it does type: - - ```mpicc –showme``` - -### sbatch to send job to compute nodes using SLURM - -``` bash -#!/bin/bash -#SBATCH --job-name=Vaccinator -#SBATCH --ntasks=4 -#SBATCH --ntasks-per-node=4 -#SBATCH –time=00:10:00 - -~/vf38/HPC_Training/spack/share/spack/setup-env.sh -spack load mpich - -mpirun -np 4 ./my-awesome-program -``` - - - -- ntasks Controls the number of tasks to be created for the job -- ntasks-per-node Controls the maximum number of tasks per allocated node -- cpus-per-task Controls the number of CPUs allocated per task - -## Measuring performance - -- ```htop``` to check the CPU usage. You need to run this command while the process is running -- If you are using SLURM, you will need to use ```squeue``` or ```scontrol``` to find the compute node it is running on and then ssh into it. -- ```time``` is a shell command to check the overall wall time , i.e. - - ```time mpirun -np 4 myProg.exe``` - - You can also use a MPI profiler - -There are some useful commands to check the parallelism of the code. -The command top or htop looks into a process. As you can see from the image below, it shows the CPU usages - -![htop](imgs/htop.png) - -- The command ```time``` checks the overall performance of the code - - By running this command, you get real time, user time and system time. - - Real is wall clock time - time from start to finish of the call. This includes the time of overhead -- User is the amount of CPU time spent outside the kernel within the process -- Sys is the amount of CPU time spent in the kernel within the process. - - User time +Sys time will tell you how much actual CPU time your process used. - -![time](imgs/time.png) diff --git a/src/chapter5/parallel-distributed.md b/src/chapter5/parallel-distributed.md new file mode 100644 index 0000000..3f66be4 --- /dev/null +++ b/src/chapter5/parallel-distributed.md @@ -0,0 +1,32 @@ +# Parallel & Distributed Computing + +Nearly all modern computer systems utilise parallel computing to speed up the execution of algorithms. To see how this works in practice look at the diagram below. + +![parallel vs. distributed](imgs/parallel-distributed.png) + +As you can see, in a scenario where a program (job) takes 3 seconds and 3 independent jobs have to be executed by a system, doing it serially in a single computer takes a total of 9 seconds. But doing it simultaneously across 3 computers will only take 3 seconds thus achieving a 3x speedup through parallel computing. + +This is the fundamental principle that High Performance Computing is based on. + +## What is Distributed Computing? + +**Distributed computing is parallel execution on distributed memory architecture.** + +This essentially means it is a form of parallel computing, where the processing power is spread across multiple machines in a network rather than being contained within a single system. In this memory architecture, the problems are broken down into smaller parts, and each machine is assigned to work on a specific part. + +![distributed memory architecture](imgs/distributed_memory_architecture.png) + +### Distributed Memory Architecture + +Lets have a look at the distributed memory architecture in more details. + +- Each processor has its own local memory, with its own address space +- Data is shared via a communications network using a network protocol, e.g Transmission Control Protocol (TCP), Infiniband etc.. + +![Distributed Memory Architecture](imgs/distributed_memory_architecture_2.png) + +Each machine or **node** is connected to the HPC cluster via a network, typically one with high bandwidth and low latency. The fact that these are largely independent computers connected over a network rather than a set of CPU/GPU cores in the same computer (in parallel computing), presents a set of disadvantages. + +__Advantages of parallel & local computing:__ +- No **data transfer latency** & I/O throughput bottleneck. The system bus inside a machine has incredibly higher bandwidth and lower latency compared to even the fastest computer networks. +- \ No newline at end of file diff --git a/src/chapter5/parallel-refresher.md b/src/chapter5/parallel-refresher.md deleted file mode 100644 index c4dbc3c..0000000 --- a/src/chapter5/parallel-refresher.md +++ /dev/null @@ -1,31 +0,0 @@ -# Refresher on Parallelism - -## Task Parallelism - -We saw in the last chapter parallel computing can be used to solve problems by executing code in parallel as opposed to in series. - -![Task parallelism](imgs/task_parallelism.jpg) - -## Data Parallelism - -Note that not all programs can be broken down into independent tasks and we might instead data parallelism like the following. - -![Data parallelism](imgs/data_parallelism.jpg) - -## Parallel computing example - -Think back to the example below which was provided in the last chapter. We will look at the cost of memory transactions soon. - -![Parallel computing example](imgs/parallel_computing_arrays_eg.png) - -## Parallel Scalability - -The speed up achieved from parallelism is dictated by your algorithm. Notably the serial bits of your algorithm can not be sped up by increasing the number of processors. The diagram below looks at the benefits we can achieve from writing parallel code as the number of processes increases. - -![Parallel scalability](imgs/parallel_scalability.jpg) - -## Memory Architectures - -Lastly, the different memory architectures we looked at in the last section included shared memory, distributed memory and hybrid architectures. We have looked at shared memory in detail and now we will dive into distributed memory architecture. - -![Memory architectures](imgs/memory_architectures.jpg) diff --git a/src/chapter3/slurm.md b/src/chapter5/slurm.md similarity index 100% rename from src/chapter3/slurm.md rename to src/chapter5/slurm.md