Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions book/HPC.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ wait

```

There are three sections in the commands portion fo the file, the first of which sets some important environment variables (`OMP_NUM_THREADS` and `MKL_NUM_THREADS`) that are used by Numpy and other packages to determine how many cores are available for multithreading. If these variables are not set then Numpy will by default try to use all of the available cores, which can lead to excessive *context-switching* that can actually cause the code to run slower. The second section runs the commands using the `srun` command, with settings that specify the number of tasks, nodes, and cores for each script. Ending each line with `&` causes it to be run in the background, which allows the next job to start; otherwise the runner would wait for that line to complete before starting the next command. The final section includes a `wait` command, which basically tells Slurm to wait until the parallel jobs are complete before ending the job.
There are three sections in the commands portion of the file, the first of which sets some important environment variables (`OMP_NUM_THREADS` and `MKL_NUM_THREADS`) that are used by Numpy and other packages to determine how many cores are available for multithreading. If these variables are not set then Numpy will by default try to use all of the available cores, which can lead to excessive *context-switching* that can actually cause the code to run slower. The second section runs the commands using the `srun` command, with settings that specify the number of tasks, nodes, and cores for each script. Ending each line with `&` causes it to be run in the background, which allows the next job to start; otherwise the runner would wait for that line to complete before starting the next command. The final section includes a `wait` command, which basically tells Slurm to wait until the parallel jobs are complete before ending the job.

### Running a batch job

Expand Down Expand Up @@ -202,7 +202,7 @@ russpold 30751 57.1 0.0 150044 13388 ? S 11:05 0:11 /share/software
russpold 30754 56.9 0.0 149892 13296 ? S 11:05 0:11 /share/software/user/open/python/3.12.1/bin/python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000001
```

Why are there two srun processes? It turns out that srun first starts a lead process, whose job it is to communicate with the Slurm controller (in this case that's PID 30614 for the `-i 1000004` job). This is how the job can be cancelled if the user cancels it (using `scancel`) or when the alloted time expires. This process then starts a *helper* process (in this case PID 30634), which sets up the environment and actually runs the python script, which is running in PID 30731. These processes are treated as part of a single group, which ensures that if the lead runner gets killed, the helper and actual python script also get killed, preventing zombie processes from persisting on the compute node.
Why are there two srun processes? It turns out that srun first starts a lead process, whose job it is to communicate with the Slurm controller (in this case that's PID 30614 for the `-i 1000004` job). This is how the job can be cancelled if the user cancels it (using `scancel`) or when the allotted time expires. This process then starts a *helper* process (in this case PID 30634), which sets up the environment and actually runs the python script, which is running in PID 30731. These processes are treated as part of a single group, which ensures that if the lead runner gets killed, the helper and actual python script also get killed, preventing zombie processes from persisting on the compute node.

### Parametric sweeps

Expand Down Expand Up @@ -280,7 +280,7 @@ LINE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" params.txt)
python3 runmodel.py $LINE
```

Job arrays work well up to about 1000 jobs, beyond which schedulers often get unhappy; it's often useful to think about reorganizing the work so that there are fewer jobs but each job does more work. It's very important to include a throttle on the job array (the 10 in `--array=1-100%10`) when the array is large in order to prevent the filesystem from being overwhelmed if one is reading data. It's also useful to create a file for each job that specifies that it completed succesfully, which then allows rerunning the array in order to rerun any jobs that crashed, without rerunning those that were successful; alternatively one might consider using Snakemake, which interoperates very well with Slurm.
Job arrays work well up to about 1000 jobs, beyond which schedulers often get unhappy; it's often useful to think about reorganizing the work so that there are fewer jobs but each job does more work. It's very important to include a throttle on the job array (the 10 in `--array=1-100%10`) when the array is large in order to prevent the filesystem from being overwhelmed if one is reading data. It's also useful to create a file for each job that specifies that it completed successfully, which then allows rerunning the array in order to rerun any jobs that crashed, without rerunning those that were successful; alternatively one might consider using Snakemake, which interoperates very well with Slurm.

## Job dependencies

Expand Down Expand Up @@ -399,7 +399,7 @@ We could also set this as a default using `module save`, and the next time we lo

### Virtual environments

Throughout the book I have talked about the utility of virtual environments, and they are commonly used on HPC systems to can access to packages or package versions that are not available as modules on the system. There is, however, one issue that should be kept in mind when using virtual environments in the HPC context. When we install a virtual environment, the environment folder contains all of the dependencies that are installed in the environment. For some projects this can end up being quite large, to the degree that one can run into disk quota issues if they are stored in the home directory.For example, the full Anaconda installation is almost 10GB, which would largely fill the 15 GB quota for my home directory on the local HPC system; for this reason, I always recommend using miniconda which is a more minimal installation. `uv` does a better job of caching but its local cache directory can also get very large over many projects. For this reason, I we generally install Conda-based environments outside of the home diectory, on a filesystem that has a larger quota. When using `uv`, we generally set the `$UV_CACHE_DIR` environment variable to a location with a larger quota as well.
Throughout the book I have talked about the utility of virtual environments, and they are commonly used on HPC systems to can access to packages or package versions that are not available as modules on the system. There is, however, one issue that should be kept in mind when using virtual environments in the HPC context. When we install a virtual environment, the environment folder contains all of the dependencies that are installed in the environment. For some projects this can end up being quite large, to the degree that one can run into disk quota issues if they are stored in the home directory.For example, the full Anaconda installation is almost 10GB, which would largely fill the 15 GB quota for my home directory on the local HPC system; for this reason, I always recommend using miniconda which is a more minimal installation. `uv` does a better job of caching but its local cache directory can also get very large over many projects. For this reason, I we generally install Conda-based environments outside of the home directory, on a filesystem that has a larger quota. When using `uv`, we generally set the `$UV_CACHE_DIR` environment variable to a location with a larger quota as well.

### Containers

Expand Down Expand Up @@ -432,7 +432,7 @@ Finally, I would create the Slurm script to run the full job at scale on the sys

## Distributed computing using MPI

So far I have focused on using HPC resources for jobs that are embarassingly parallel, such that we can run a large number of jobs without having to coordinate between them. This is an increasingly common use case for HPC, particularly with the advent of "big data", but historically a major use case for HPC was the execution of massive computations across many nodes that require coordination between nodes to acheive parallelism. This is particularly the case for very large simulations, such as cosmological models, climate models, and dynamical models such as molecular or fluid dynamics. These applications commonly use a framework called *message passing interface* (MPI), which allows the computation to run simultaneously across many nodes and coordinates computations by sending messages between nodes, taking advantage of high-performance interconnects like Infiniband when available. I'm not going to go into detail about MPI here since it has become relatively niche (I personally have never used it in my 15+ years of using HCP systems), but it is important to be aware of if you are working on a problem that exceeds the memory of a single node and requires intensive communication between nodes.
So far I have focused on using HPC resources for jobs that are embarrassingly parallel, such that we can run a large number of jobs without having to coordinate between them. This is an increasingly common use case for HPC, particularly with the advent of "big data", but historically a major use case for HPC was the execution of massive computations across many nodes that require coordination between nodes to achieve parallelism. This is particularly the case for very large simulations, such as cosmological models, climate models, and dynamical models such as molecular or fluid dynamics. These applications commonly use a framework called *message passing interface* (MPI), which allows the computation to run simultaneously across many nodes and coordinates computations by sending messages between nodes, taking advantage of high-performance interconnects like Infiniband when available. I'm not going to go into detail about MPI here since it has become relatively niche (I personally have never used it in my 15+ years of using HCP systems), but it is important to be aware of if you are working on a problem that exceeds the memory of a single node and requires intensive communication between nodes.

## Cloud computing

Expand Down
Loading