diff --git a/src/SUMMARY.md b/src/SUMMARY.md index ead2775..6d609d1 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -42,7 +42,8 @@ - [Challenges](./chapter4/challenges.md) - [Distributed Computing](./chapter5/chapter5.md) - - [What is Distributed Computing](./chapter5/distrubuted-computing.md) + - [Refresher on Parallelism](./chapter5/parallel-refresher.md) + - [What is Distributed Computing](./chapter5/distributed-computing.md) - [Message Passing](./chapter5/message-passing.md) - [OpenMPI](./chapter5/openmpi.md) - [Challenges](./chapter5/challenges.md) diff --git a/src/chapter5/challenges.md b/src/chapter5/challenges.md index 9358534..8a6fc32 100644 --- a/src/chapter5/challenges.md +++ b/src/chapter5/challenges.md @@ -1 +1,52 @@ # Challenges + +🚧 Under Construction! 🏗️ + +## Install MPI for the tasks + +- ```~/vf38/HPC_Training/spack/share/spack/setup-env.sh``` #configure spack +- ```spack load mpich``` #load MPICH through spack +- ```module load gcc``` #load gcc +- ```cp -r ~/vf38/HPC_Training/MPI_examples ~``` #copy tasks to your home directory + +## Task 1: Hello World + +1. Go to folder ‘hello’ +2. Modify the files +3. Compile mpi_hello_world.c using makefile +4. Execute the file + +## Task 2: Ping Pong + +1. Go to ‘ping_pong’ folder +2. Modify the files +3. Compile the files +4. Run (world size must be two for this file) + +Output should be similar to this. May be slightly different due to process scheduling + +![Ping pong](../imgs/ping_pong.png) + +## Task 3: Monte Carlo + +- Run “./calcPiSeq 100000000” # make sure you use gcc for compiling serial code +- Modify calcPiMPI.c +- Run calcPiMPI 100000000 with mpi and see the difference. You can change the number of processes. However, please be mindful that you are in login node! +Hint: # + +## Task 4: Parallel computing task on compute nodes + +- Submit your parallelised Monte Carlo task on compute nodes with 8 tasks + +## Task 5: Trapezoidal Rule Integration + +- Run “./seq_trap 10000000000” +- Modify calcPiMPI.c +- Run seq_MPI 100000000000 with mpi and see the difference. You can change the number of processes. However, please be mindful that you are in login node! + +## Task 6: Bonus: Merge Sort + +- This task is quite challenging to parallelise it yourself +- Please refer to the answer and check if you can understand it + +Additional resources: diff --git a/src/chapter5/chapter5.md b/src/chapter5/chapter5.md index 4882183..4d82439 100644 --- a/src/chapter5/chapter5.md +++ b/src/chapter5/chapter5.md @@ -1 +1,7 @@ # Distributed Computing + +- [Refresher on Parallelism](parallel-refresher.md) +- [What is Distributed Computing](distributed-computing.md) +- [OpenMPI](openmpi.md) +- [Message Passing](message-passing.md) +- [Challenges](challenges.md) diff --git a/src/chapter5/distributed-computing.md b/src/chapter5/distributed-computing.md new file mode 100644 index 0000000..84ba238 --- /dev/null +++ b/src/chapter5/distributed-computing.md @@ -0,0 +1,44 @@ +# What is Distributed Computing + +**Distributed computing is parallel execution on distributed memory architecture.** + +This essentially means it is a form of parallel computing, where the processing power is spread across multiple machines in a network rather than being contained within a single system. In this memory architecture, the problems are broken down into smaller parts, and each machine is assigned to work on a specific part. + +![distributed memory architecture](../imgs/distributed_memory_architecture.png) + +## Distributed Memory Architecture + +Lets have a look at the distributed memory architecture in more details. + +- Each processor has its own local memory, with its own address space +- Data is shared via a communications network using a network protocol, e.g Transmission Control Protocol (TCP), Infiniband etc.. + +![Distributed Memory Architecture](../imgs/distributed_memory_architecture_2.png) + +## Distributed vs Shared program execution + +The following diagram provides another way of looking at the differences between distributed and shared memory architecture and their program execution. + +![Distributed vs Shared](../imgs/distributed_vs_shared.png) + +## Advantages of distributed computing + +There are number of benefits to distributed computing in particular it addresses some shortcomings of shared memory architecture. + +- No contention for shared memory since each machine has its own memory. Compare this to shared memory architecture where all the cpu's are sharing the same memory. +- Highly scalable as we can add more machines and are not limited by RAM. +- Effectively resulting in being able to handle large-scale problems + +The benefits above do not come without some drawbacks including network overhead. + +## Disadvantages of distributed computing + +- Network overload. Network can be overloaded by: + - Multiple small messages + - Very large data throughput + - Multiple all-to-all messages ($N^2$ growth of messages) +- Synchronization failures + - Deadlock (processes waiting for an input from another process that never comes) + - Livelock (even worse as it’s harder to detect. All processes shuffling data around but not progressing in the algorithm ) +- More complex software architecture design. + - Can also be combined with threading-technologies as openMP/pthreads for optimal performance. diff --git a/src/chapter5/distrubuted-computing.md b/src/chapter5/distrubuted-computing.md deleted file mode 100644 index 58598e7..0000000 --- a/src/chapter5/distrubuted-computing.md +++ /dev/null @@ -1 +0,0 @@ -# What is Distributed Computing diff --git a/src/chapter5/message-passing.md b/src/chapter5/message-passing.md index f117715..f6d8742 100644 --- a/src/chapter5/message-passing.md +++ b/src/chapter5/message-passing.md @@ -1 +1,11 @@ # Message Passing + +As each processor has its own local memory with its own address space in distributed computing, we need a way to communicate between the processes and share data. Message passing is the mechanism of exchanging data across processes. Each process can communicate with one or more other processes by sending messages over a network. + +The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments and are implemented by different groups with the main goals being high performance, scalability, and portability. + +OpenMPI is one implementation of the MPI standard. It consists of a set of headers library functions that you call from your program. i.e. C, C++, Fortran etc. + +For C, you will need a header file for all the functions (mpi.h) and link in the relevant library functions. This is all handled by the mpicc program (or your compiler if you wanted to specify all the paths). + +In the next chapter we will look at how to implement message passing using OpenMPI. diff --git a/src/chapter5/openmpi.md b/src/chapter5/openmpi.md index ea24d72..54344e6 100644 --- a/src/chapter5/openmpi.md +++ b/src/chapter5/openmpi.md @@ -1 +1,254 @@ # OpenMPI + +## Primary MPI Routines + +``` C +int MPI_Init(int * argc, char ** argv); +// initializes the MPI environment. +// Argc argv are the parameters come +// from main(argc,argv). The return value is an +// error code. 0 is OK. Non-zero is an error code +``` + +``` C +int MPI_Comm_size(MPI_Comm comm, int \* size); +// this functions gets the number of MPI processes +// i.e. the number you enter when you go mpirun -np \ myprog.exe +// *size is C syntax indicating that size will be modified to contain +// the value after the function returns. The return value is only used +// for error detection. printf(“MPI size is %d\n”,size); +int MPI_Comm_rank(MPI_Comm comm, int \* rank); +// this returns the rank of this particular process +// rank contains the value for that process- the function return value is an error code +``` + +![MPI routines](../imgs/mpi_routines.png) + +### Point-to-Point communication + +These are blocking functions - they wait until the message is sent or received. Note that the CPU is actively polling the network interface when waiting for a message. This is opposite in behaviour to other C functions, i.e. c= getChar() (which causes a context switch and then a sleep in the OS). This is done for speed reasons. + +```C +int MPI_Send(void * buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm); +``` + +Sends a message from the calling process to another process + +INPUT PARAMETERS + +- ```buf``` + - Initial address of send buffer (choice). +- ```count``` + - Number of elements sent (non negative integer). +- ```type``` + - DataType of each send buffer element (handle). +- ```dest``` + - Rank of destination (integer). +- ```tag``` + - Message tag (integer). +- ```comm``` + - Communicator (handle). + +OUTPUT PARAMETER + +- ```IERROR``` + - Fortran only: Error status (integer). + +```c +int MPI_Recv(void * buf, int count, MPI_Datatype type, int source, int tag, MPI_Comm comm, MPI_Status * status); +``` + +Receives a message from another process + +INPUT PARAMETERS + +- ```count``` + - Maximum number of elements to receive (integer). +- ```type``` + - DataType of each receive buffer entry (handle). +- ```source``` + - Rank of source (integer). +- ```tag``` + - Message tag (integer). +- ```comm``` + - Communicator (handle). + +OUTPUT PARAMETERS + +- ```buf``` + - Initial address of receive buffer (choice). +- ```status``` + - Status object (status). +- ```IERROR``` + - Fortran only: Error status (integer). + +### Primary MPI Routines closing + +In a header file you will find + +``` C +int MPI_Finalize(void); +``` + +To call in your C or C++ program + +``` C +#include +MPI_Finalize(); +``` + +## General overview MPI program + +``` C +... +int MPI_Init(int argc, char ** argv); +--------------------------Parallel algorithm starts---------------------- +int MPI_Comm_size(MPI_Comm comm, int * size); +int MPI_Comm_rank(MPI_Comm comm, int * rank); +... +int MPI_Send(void * buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm); +int MPI_Recv(void * buf, int count, MPI_Datatype type, int source, int tag, MPI_Comm comm, MPI_Status * status); +… +--------------------------Parallel algorithm ends----------------------- +int MPI_Finalize(void); +... + +``` + +Use man pages to find out more about each routine + +When sending a Process it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office) + +## Elementary MPI Data types + +MPI_Send and MPI_Recv utilize MPI Datatypes as a means to specify the structure of a message at a higher level. The data types defined in the table below are simple in nature and for custom data structures you will have to define the structure. + +| MPI datatype | C equivalent | +|-------------------------|------------------------| +| MPI_SHORT | short int | +| MPI_INT | int | +| MPI_LONG | long int | +| MPI_LONG_LONG | long long int | +| MPI_UNSIGNED_CHAR | unsigned char | +| MPI_UNSIGNED_SHORT | unsigned short int | +| MPI_UNSIGNED | unsigned int | +| MPI_UNSIGNED_LONG | unsigned long int | +| MPI_UNSIGNED_LONG_LONG | unsigned long long int | +| MPI_FLOAT | float | +| MPI_DOUBLE | double | +| MPI_LONG_DOUBLE | long double | +| MPI_BYTE | char | + +## Example of a simple program + +``` C + +/* + MPI Program, send ranks +*/ + +#include +#include + +#define MASTER 0 + +int main(int argc, char *argv[]) +{ + + int my_rank; + /* Also known as world size */ + int num_processes; + + /* Initialize the infrastructure necessary for communication */ + MPI_Init(&argc, &argv); + + /* Identify this process */ + MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); + + /* Find out how many total processes are active */ + MPI_Comm_size(MPI_COMM_WORLD, &num_processes); + + printf("Process %d: There is a total of %d \n", my_rank, num_processes); + + if (my_rank == MASTER) + { + int dest = 1; + int tag = 0; + int count = 1; + + MPI_Send(&my_rank, count, MPI_INT, dest, tag, MPI_COMM_WORLD); + + printf("Process %d: Sent my_rank to process %d \n", my_rank, dest); + } + else + { + int tag = 0; + int count = 1; + int buffer; + MPI_Recv(&buffer, count, MPI_INT, MASTER, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + printf("Process %d: Received %d from process %d \n", my_rank, buffer, MASTER); + } + + /* Tear down the communication infrastructure */ + MPI_Finalize(); + return 0; +} +``` + +## Compilation and Linking + +- Make sure you have the following packages installed and that they are in your $PATH: + - gcc + - OPENMPI or MPICH +- To compile and Link: + - ```mpicc -Wall -o ``` + - -Wall This enables all the warnings about questionable code. + - -o sets the output executable name. If you omit it, it defaults to a.out +- To run: + - ```mpirun -np ``` +- Behind the scenes: + - mpicc is just a wrapper around a C compiler. To see what it does type: + - ```mpicc –showme``` + +### sbatch to send job to compute nodes using SLURM + +``` bash +#!/bin/bash +#SBATCH --job-name=Vaccinator +#SBATCH --ntasks=4 +#SBATCH --ntasks-per-node=4 +#SBATCH –time=00:10:00 + +~/vf38/HPC_Training/spack/share/spack/setup-env.sh +spack load mpich + +mpirun -np 4 ./my-awesome-program +``` + + + +- ntasks Controls the number of tasks to be created for the job +- ntasks-per-node Controls the maximum number of tasks per allocated node +- cpus-per-task Controls the number of CPUs allocated per task + +## Measuring performance + +- ```htop``` to check the CPU usage. You need to run this command while the process is running +- If you are using SLURM, you will need to use ```squeue``` or ```scontrol``` to find the compute node it is running on and then ssh into it. +- ```time``` is a shell command to check the overall wall time , i.e. + - ```time mpirun -np 4 myProg.exe``` + - You can also use a MPI profiler + +There are some useful commands to check the parallelism of the code. +The command top or htop looks into a process. As you can see from the image below, it shows the CPU usages + +![htop](../imgs/htop.png) + +- The command ```time``` checks the overall performance of the code + - By running this command, you get real time, user time and system time. + - Real is wall clock time - time from start to finish of the call. This includes the time of overhead +- User is the amount of CPU time spent outside the kernel within the process +- Sys is the amount of CPU time spent in the kernel within the process. + - User time +Sys time will tell you how much actual CPU time your process used. + +![time](../imgs/time.png) diff --git a/src/chapter5/parallel-refresher.md b/src/chapter5/parallel-refresher.md new file mode 100644 index 0000000..80ad6b6 --- /dev/null +++ b/src/chapter5/parallel-refresher.md @@ -0,0 +1,31 @@ +# Refresher on Parallelism + +## Task Parallelism + +We saw in the last chapter parallel computing can be used to solve problems by executing code in parallel as opposed to in series. + +![Task parallelism](../imgs/task_parallelism.jpg) + +## Data Parallelism + +Note that not all programs can be broken down into independent tasks and we might instead data parallelism like the following. + +![Data parallelism](../imgs/data_parallelism.jpg) + +## Parallel computing example + +Think back to the example below which was provided in the last chapter. We will look at the cost of memory transactions soon. + +![Parallel computing example](../imgs/parallel_computing_arrays_eg.png) + +## Parallel Scalability + +The speed up achieved from parallelism is dictated by your algorithm. Notably the serial bits of your algorithm can not be sped up by increasing the number of processors. The diagram below looks at the benefits we can achieve from writing parallel code as the number of processes increases. + +![Parallel scalability](../imgs/parallel_scalability.jpg) + +## Memory Architectures + +Lastly, the different memory architectures we looked at in the last section included shared memory, distributed memory and hybrid architectures. We have looked at shared memory in detail and now we will dive into distributed memory architecture. + +![Memory architectures](../imgs/memory_architectures.jpg) diff --git a/src/imgs/data_parallelism.jpg b/src/imgs/data_parallelism.jpg new file mode 100644 index 0000000..4538619 Binary files /dev/null and b/src/imgs/data_parallelism.jpg differ diff --git a/src/imgs/distributed_memory_architecture.png b/src/imgs/distributed_memory_architecture.png new file mode 100644 index 0000000..ad5ca1d Binary files /dev/null and b/src/imgs/distributed_memory_architecture.png differ diff --git a/src/imgs/distributed_memory_architecture_2.png b/src/imgs/distributed_memory_architecture_2.png new file mode 100644 index 0000000..7f0ed51 Binary files /dev/null and b/src/imgs/distributed_memory_architecture_2.png differ diff --git a/src/imgs/distributed_vs_shared.png b/src/imgs/distributed_vs_shared.png new file mode 100644 index 0000000..e724102 Binary files /dev/null and b/src/imgs/distributed_vs_shared.png differ diff --git a/src/imgs/htop.png b/src/imgs/htop.png new file mode 100644 index 0000000..2efbc06 Binary files /dev/null and b/src/imgs/htop.png differ diff --git a/src/imgs/memory_architectures.jpg b/src/imgs/memory_architectures.jpg new file mode 100644 index 0000000..0927697 Binary files /dev/null and b/src/imgs/memory_architectures.jpg differ diff --git a/src/imgs/mpi_datatypes.png b/src/imgs/mpi_datatypes.png new file mode 100644 index 0000000..4cbf9a6 Binary files /dev/null and b/src/imgs/mpi_datatypes.png differ diff --git a/src/imgs/mpi_routines.png b/src/imgs/mpi_routines.png new file mode 100644 index 0000000..6e393ee Binary files /dev/null and b/src/imgs/mpi_routines.png differ diff --git a/src/imgs/parallel_computing_arrays_eg.png b/src/imgs/parallel_computing_arrays_eg.png new file mode 100644 index 0000000..dd395ef Binary files /dev/null and b/src/imgs/parallel_computing_arrays_eg.png differ diff --git a/src/imgs/parallel_scalability.jpg b/src/imgs/parallel_scalability.jpg new file mode 100644 index 0000000..357b51b Binary files /dev/null and b/src/imgs/parallel_scalability.jpg differ diff --git a/src/imgs/ping_pong.png b/src/imgs/ping_pong.png new file mode 100644 index 0000000..1850bb3 Binary files /dev/null and b/src/imgs/ping_pong.png differ diff --git a/src/imgs/task_parallelism.jpg b/src/imgs/task_parallelism.jpg new file mode 100644 index 0000000..797da45 Binary files /dev/null and b/src/imgs/task_parallelism.jpg differ diff --git a/src/imgs/time.png b/src/imgs/time.png new file mode 100644 index 0000000..da640d6 Binary files /dev/null and b/src/imgs/time.png differ