From 50e8a05279434ee87bf9face6654b3d224627705 Mon Sep 17 00:00:00 2001 From: Russell Poldrack Date: Fri, 30 Jan 2026 11:43:39 -0800 Subject: [PATCH] spelling fixes --- book/HPC.md | 10 +++++----- book/performance.md | 22 +++++++++++----------- 2 files changed, 16 insertions(+), 16 deletions(-) diff --git a/book/HPC.md b/book/HPC.md index 968c3d2..f32a7c2 100644 --- a/book/HPC.md +++ b/book/HPC.md @@ -111,7 +111,7 @@ wait ``` -There are three sections in the commands portion fo the file, the first of which sets some important environment variables (`OMP_NUM_THREADS` and `MKL_NUM_THREADS`) that are used by Numpy and other packages to determine how many cores are available for multithreading. If these variables are not set then Numpy will by default try to use all of the available cores, which can lead to excessive *context-switching* that can actually cause the code to run slower. The second section runs the commands using the `srun` command, with settings that specify the number of tasks, nodes, and cores for each script. Ending each line with `&` causes it to be run in the background, which allows the next job to start; otherwise the runner would wait for that line to complete before starting the next command. The final section includes a `wait` command, which basically tells Slurm to wait until the parallel jobs are complete before ending the job. +There are three sections in the commands portion of the file, the first of which sets some important environment variables (`OMP_NUM_THREADS` and `MKL_NUM_THREADS`) that are used by Numpy and other packages to determine how many cores are available for multithreading. If these variables are not set then Numpy will by default try to use all of the available cores, which can lead to excessive *context-switching* that can actually cause the code to run slower. The second section runs the commands using the `srun` command, with settings that specify the number of tasks, nodes, and cores for each script. Ending each line with `&` causes it to be run in the background, which allows the next job to start; otherwise the runner would wait for that line to complete before starting the next command. The final section includes a `wait` command, which basically tells Slurm to wait until the parallel jobs are complete before ending the job. ### Running a batch job @@ -202,7 +202,7 @@ russpold 30751 57.1 0.0 150044 13388 ? S 11:05 0:11 /share/software russpold 30754 56.9 0.0 149892 13296 ? S 11:05 0:11 /share/software/user/open/python/3.12.1/bin/python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000001 ``` -Why are there two srun processes? It turns out that srun first starts a lead process, whose job it is to communicate with the Slurm controller (in this case that's PID 30614 for the `-i 1000004` job). This is how the job can be cancelled if the user cancels it (using `scancel`) or when the alloted time expires. This process then starts a *helper* process (in this case PID 30634), which sets up the environment and actually runs the python script, which is running in PID 30731. These processes are treated as part of a single group, which ensures that if the lead runner gets killed, the helper and actual python script also get killed, preventing zombie processes from persisting on the compute node. +Why are there two srun processes? It turns out that srun first starts a lead process, whose job it is to communicate with the Slurm controller (in this case that's PID 30614 for the `-i 1000004` job). This is how the job can be cancelled if the user cancels it (using `scancel`) or when the allotted time expires. This process then starts a *helper* process (in this case PID 30634), which sets up the environment and actually runs the python script, which is running in PID 30731. These processes are treated as part of a single group, which ensures that if the lead runner gets killed, the helper and actual python script also get killed, preventing zombie processes from persisting on the compute node. ### Parametric sweeps @@ -280,7 +280,7 @@ LINE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" params.txt) python3 runmodel.py $LINE ``` -Job arrays work well up to about 1000 jobs, beyond which schedulers often get unhappy; it's often useful to think about reorganizing the work so that there are fewer jobs but each job does more work. It's very important to include a throttle on the job array (the 10 in `--array=1-100%10`) when the array is large in order to prevent the filesystem from being overwhelmed if one is reading data. It's also useful to create a file for each job that specifies that it completed succesfully, which then allows rerunning the array in order to rerun any jobs that crashed, without rerunning those that were successful; alternatively one might consider using Snakemake, which interoperates very well with Slurm. +Job arrays work well up to about 1000 jobs, beyond which schedulers often get unhappy; it's often useful to think about reorganizing the work so that there are fewer jobs but each job does more work. It's very important to include a throttle on the job array (the 10 in `--array=1-100%10`) when the array is large in order to prevent the filesystem from being overwhelmed if one is reading data. It's also useful to create a file for each job that specifies that it completed successfully, which then allows rerunning the array in order to rerun any jobs that crashed, without rerunning those that were successful; alternatively one might consider using Snakemake, which interoperates very well with Slurm. ## Job dependencies @@ -399,7 +399,7 @@ We could also set this as a default using `module save`, and the next time we lo ### Virtual environments -Throughout the book I have talked about the utility of virtual environments, and they are commonly used on HPC systems to can access to packages or package versions that are not available as modules on the system. There is, however, one issue that should be kept in mind when using virtual environments in the HPC context. When we install a virtual environment, the environment folder contains all of the dependencies that are installed in the environment. For some projects this can end up being quite large, to the degree that one can run into disk quota issues if they are stored in the home directory.For example, the full Anaconda installation is almost 10GB, which would largely fill the 15 GB quota for my home directory on the local HPC system; for this reason, I always recommend using miniconda which is a more minimal installation. `uv` does a better job of caching but its local cache directory can also get very large over many projects. For this reason, I we generally install Conda-based environments outside of the home diectory, on a filesystem that has a larger quota. When using `uv`, we generally set the `$UV_CACHE_DIR` environment variable to a location with a larger quota as well. +Throughout the book I have talked about the utility of virtual environments, and they are commonly used on HPC systems to can access to packages or package versions that are not available as modules on the system. There is, however, one issue that should be kept in mind when using virtual environments in the HPC context. When we install a virtual environment, the environment folder contains all of the dependencies that are installed in the environment. For some projects this can end up being quite large, to the degree that one can run into disk quota issues if they are stored in the home directory.For example, the full Anaconda installation is almost 10GB, which would largely fill the 15 GB quota for my home directory on the local HPC system; for this reason, I always recommend using miniconda which is a more minimal installation. `uv` does a better job of caching but its local cache directory can also get very large over many projects. For this reason, I we generally install Conda-based environments outside of the home directory, on a filesystem that has a larger quota. When using `uv`, we generally set the `$UV_CACHE_DIR` environment variable to a location with a larger quota as well. ### Containers @@ -432,7 +432,7 @@ Finally, I would create the Slurm script to run the full job at scale on the sys ## Distributed computing using MPI -So far I have focused on using HPC resources for jobs that are embarassingly parallel, such that we can run a large number of jobs without having to coordinate between them. This is an increasingly common use case for HPC, particularly with the advent of "big data", but historically a major use case for HPC was the execution of massive computations across many nodes that require coordination between nodes to acheive parallelism. This is particularly the case for very large simulations, such as cosmological models, climate models, and dynamical models such as molecular or fluid dynamics. These applications commonly use a framework called *message passing interface* (MPI), which allows the computation to run simultaneously across many nodes and coordinates computations by sending messages between nodes, taking advantage of high-performance interconnects like Infiniband when available. I'm not going to go into detail about MPI here since it has become relatively niche (I personally have never used it in my 15+ years of using HCP systems), but it is important to be aware of if you are working on a problem that exceeds the memory of a single node and requires intensive communication between nodes. +So far I have focused on using HPC resources for jobs that are embarrassingly parallel, such that we can run a large number of jobs without having to coordinate between them. This is an increasingly common use case for HPC, particularly with the advent of "big data", but historically a major use case for HPC was the execution of massive computations across many nodes that require coordination between nodes to achieve parallelism. This is particularly the case for very large simulations, such as cosmological models, climate models, and dynamical models such as molecular or fluid dynamics. These applications commonly use a framework called *message passing interface* (MPI), which allows the computation to run simultaneously across many nodes and coordinates computations by sending messages between nodes, taking advantage of high-performance interconnects like Infiniband when available. I'm not going to go into detail about MPI here since it has become relatively niche (I personally have never used it in my 15+ years of using HCP systems), but it is important to be aware of if you are working on a problem that exceeds the memory of a single node and requires intensive communication between nodes. ## Cloud computing diff --git a/book/performance.md b/book/performance.md index 0294981..6946ab3 100644 --- a/book/performance.md +++ b/book/performance.md @@ -67,7 +67,7 @@ In addition to time complexity, it's also important to keep in mind the *memory Complexity analysis tells us about the worst case performance of our code, but there are many reasons for slow code that are unrelated to complexity. *Profiling* is the activity of empirically analyzing the performance of our code in order to identify specific parts of the code that might cause poor performance. It's often the case that slow performance arises from specific portions of the code, which we refer to as *bottlenecks*. These bottlenecks can be difficult to intuit, which is why it's important to empirically analyze performance in order to identify the location of those bottlenecks, which can then help us focus our efforts. However, it's also important to keep complexity in mind when we analyze code; in particular, we should always profile the code using realistic input sizes, so that we will see any complexity-related slowdowns if they exist. -There are a couple of important points to know when profiling code. First, it's important to remember that profiling has overhead and can sometimes distort results. In particular, when code involves many repetitions of a very fast operation, the overhead due to profiling the operation can add up, making it seem worse than it is. The profiler can also compete with your code for memory and CPU time, potentially distorting results (e.g. for processes involving lots of memory). Second, it's important to keep in mind the distinction between *CPU time*, which refers to the time actually spent by the CPU doing processing, and *wall time*, which includes CPU time as well as time due to other sources such as input/output. If the wall time is much greater than the CPU time, then this suggests that optimizing the computations may not have much impact on the overall execution time. Also note that sometimes the CPU time can actually be *greater* than the wall time, if mulitple cores are used for the computation, since the time spent by each core is added together. +There are a couple of important points to know when profiling code. First, it's important to remember that profiling has overhead and can sometimes distort results. In particular, when code involves many repetitions of a very fast operation, the overhead due to profiling the operation can add up, making it seem worse than it is. The profiler can also compete with your code for memory and CPU time, potentially distorting results (e.g. for processes involving lots of memory). Second, it's important to keep in mind the distinction between *CPU time*, which refers to the time actually spent by the CPU doing processing, and *wall time*, which includes CPU time as well as time due to other sources such as input/output. If the wall time is much greater than the CPU time, then this suggests that optimizing the computations may not have much impact on the overall execution time. Also note that sometimes the CPU time can actually be *greater* than the wall time, if multiple cores are used for the computation, since the time spent by each core is added together. ### Function profiling @@ -492,7 +492,7 @@ def fill_df_fast(random_data): 255 μs ± 885 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` -Again notice the difference in units between the two measurements. This method gives us an almost 500X speedup! The fundamental reason for this is that pandas data frames are not meant to serve as dynamic objects. Every time `.loc[i]` is called on a index value that doesn't exist, pandas may reallocate the entire data frame in memory to accomodate the new row. Since that allocation process scales with the input size, this turns an $O(N)$ operation into a potentially $O(N^2)$ operation. +Again notice the difference in units between the two measurements. This method gives us an almost 500X speedup! The fundamental reason for this is that pandas data frames are not meant to serve as dynamic objects. Every time `.loc[i]` is called on a index value that doesn't exist, pandas may reallocate the entire data frame in memory to accommodate the new row. Since that allocation process scales with the input size, this turns an $O(N)$ operation into a potentially $O(N^2)$ operation. We can make it even faster by directly passing the NumPy array into pandas: @@ -521,7 +521,7 @@ Different types of objects in Python may perform better or worse for different t | List | 2975.12 | 3531.51 | | Set | 0.007625 | 0.00999999 | -Here we see that there are major differences between data types. Sets are by far the fastest, which is due to the fact that they use a hash table algorithm that has $O(1)$ complexity; the 7 nanoseconds for the set operation is basically the minimum time needed for Python to perform a single operation, requiring about 25 clock cycles on my 4 GHz CPU. Second, we see that NumPy/pandas are roughly 30 times faster than Python lists for integers; this likely reflects the fact that NumPy can load the entire dataset into contiguous memory, whereas the list requires a lookup and retreival of the value from an unpredictable location in memory for each item. Third, we see that strings are generally slower than integers, which reflects both the fact that the representation of strings is larger than the representation of integers and that comparing a string requires more operations than comparing integers. When timing matters, it's usually useful to do some prototype testing across different types of objects to find the most performant. The million-fold difference between the performance of sets and other data types on this problem is a great example of how algorithmic complexity is the single most important factor we should consider when optimizing code. +Here we see that there are major differences between data types. Sets are by far the fastest, which is due to the fact that they use a hash table algorithm that has $O(1)$ complexity; the 7 nanoseconds for the set operation is basically the minimum time needed for Python to perform a single operation, requiring about 25 clock cycles on my 4 GHz CPU. Second, we see that NumPy/pandas are roughly 30 times faster than Python lists for integers; this likely reflects the fact that NumPy can load the entire dataset into contiguous memory, whereas the list requires a lookup and retrieval of the value from an unpredictable location in memory for each item. Third, we see that strings are generally slower than integers, which reflects both the fact that the representation of strings is larger than the representation of integers and that comparing a string requires more operations than comparing integers. When timing matters, it's usually useful to do some prototype testing across different types of objects to find the most performant. The million-fold difference between the performance of sets and other data types on this problem is a great example of how algorithmic complexity is the single most important factor we should consider when optimizing code. ### Unnecessary looping @@ -617,7 +617,7 @@ Time saved: 54.33 seconds Note that part of the difference in time is due to the `time.sleep()` commands, but that only amounts to 10 seconds in total, so the majority of the speedup is due to using a single optimized API call rather than many individual calls. This speedup reflects the high degree of overhead related to network processes that occurs for each API call. -Batch processing is not simply about optimizing performance; it is also a feature of good Internet citizenship. Because lots of API calls can overload a server, many APIs will ban (either temporarily or permanently) IP addressses that send too many requests too quickly. Depending on the nature of the API calls it may be necessary to *batch* them, since many servers also have a limit on the number of queries that can be made in any call. +Batch processing is not simply about optimizing performance; it is also a feature of good Internet citizenship. Because lots of API calls can overload a server, many APIs will ban (either temporarily or permanently) IP addresses that send too many requests too quickly. Depending on the nature of the API calls it may be necessary to *batch* them, since many servers also have a limit on the number of queries that can be made in any call. ### Slow I/O @@ -631,9 +631,9 @@ Input/output speed can sometimes vary greatly depending on how the files are sav - Small chunks (1 x 64 x 64) - Tiny chunks (1 x 32 x 32) -[#hdfchunking-fig] shows the loading times with each of these strategies for two different data access patterns. In one (framewise), we access a single one of the 1000 frames, whereas in the other (spatial access) pattern we extract data from a small region of interest across all frames. This shows clearly how these two factors interact; the framewise chunking strategy that is the fastest for framewise access becomes the slowest for spatial access. In some cases (like the spatial access pattern) performance with chunked data may not be much better than unchunked, but leaving the data in a *continguous* (unchunked) state has the drawback for HDF5 files that they cannot be compressed, so it's often common to use smaller chunks in order to obtain the advantage of compression. +[#hdfchunking-fig] shows the loading times with each of these strategies for two different data access patterns. In one (framewise), we access a single one of the 1000 frames, whereas in the other (spatial access) pattern we extract data from a small region of interest across all frames. This shows clearly how these two factors interact; the framewise chunking strategy that is the fastest for framewise access becomes the slowest for spatial access. In some cases (like the spatial access pattern) performance with chunked data may not be much better than unchunked, but leaving the data in a *contiguous* (unchunked) state has the drawback for HDF5 files that they cannot be compressed, so it's often common to use smaller chunks in order to obtain the advantage of compression. -```{figure} images/hdf5_chunking_performancepng +```{figure} images/hdf5_chunking_performance.png :label: hdfchunking-fig :align: center :width: 700px @@ -687,7 +687,7 @@ Total route-month combinations: 3,339,788 This call was able to extract data from 120 parquet files and summarize them in just over 2 seconds. This is blazingly fast; for comparison, loading all of the files individually took over 130 seconds! DuckDB achieves this performance via several mechanisms: -- It uses *lazy execution* - that only peforms operations when they are needed, and it uses the entire query to develop an execution plan that is optimized for the specific problem. +- It uses *lazy execution* - that only performs operations when they are needed, and it uses the entire query to develop an execution plan that is optimized for the specific problem. - It only loads the columns that are needed for the computation (which is a function afforded by Parquet files). - It operates in batches rather than single lines, and uses vectorized operations to speed processing over each batch. This also means that only the current batch needs to be kept in memory, greatly reducing the memory footprint. - It parallelizes execution of the query across files and groups within files using multiple CPU cores. @@ -749,7 +749,7 @@ Both DuckDB and Polars perform well at analyses of large tabular datasets. One r ## Interpreted versus compiled functions -Python is well known for being much slower than other languages such as C/C++ and Rust. The major difference between these languages is that Python is *interpreted* whereas the others are *compiled*. An interpreted langauge is one where the program is executed statement-by-statement at runtime[^1], whereas a compiled language is first run through a *compiler* that produces a machine language program that can be run. Compiling allows a significant amount of optimization prior to running, which can make the execution substantially faster. Another difference is that Python is *dynamically typed*, which means that we don't have to specify what type a variable is (for example, a string or a 64-bit integer) when we generate it. On the other hand, C/C++ and Rust are *statically typed*, which means that the type of each variable is specified when it is created, which frees the CPU from having to determine what type the variable is at runtime. +Python is well known for being much slower than other languages such as C/C++ and Rust. The major difference between these languages is that Python is *interpreted* whereas the others are *compiled*. An interpreted language is one where the program is executed statement-by-statement at runtime[^1], whereas a compiled language is first run through a *compiler* that produces a machine language program that can be run. Compiling allows a significant amount of optimization prior to running, which can make the execution substantially faster. Another difference is that Python is *dynamically typed*, which means that we don't have to specify what type a variable is (for example, a string or a 64-bit integer) when we generate it. On the other hand, C/C++ and Rust are *statically typed*, which means that the type of each variable is specified when it is created, which frees the CPU from having to determine what type the variable is at runtime. [^1]: Note that Python code is actually compiled into *byte-code* by the CPython compiler before being passed to the Python interpreter, so it performs better than purely interpreted languages. @@ -1000,7 +1000,7 @@ The degree to which code can benefit from parallelization depends almost entirel The degree of expected speedup due to parallelization is often described in terms of *Amdahl's Law* [@Amdahl:1967aa], which states that the benefit of parallelization depends on the proportion of execution time that is spent on code that can be parallelized (i.e. that doesn't involve serial dependencies). This implies that before spending effort parallelizing a portion of code, it's important to understand how much time is spent on that portion of the code. It's also important to understand that parallelization comes with overhead, since the different processes need to be managed and the results combined after parallel execution. Memory limitations also become important here: If each process requires a large memory footprint, then memory limitations can quickly become the primary bottleneck. -The benefits of parallelization also depend upon keeping the available cores maximally busy for the entire computation time. This is easy when each parallel job takes roughly the same amount of time; all jobs will start and finish together, releasing all of the CPU's resources at the same time. However, in some cases the parallel jobs may vary in the time they take to complete, and in some cases this finishing time distribution can have a very long tail, which means that most of the CPU's resources will be sitting idle waiting for the slowest process to finish. This can occur, for example, in optimization problems where some runs quickly converage while others can get stuck searching for a very long time. It is possible to use sophisticated scheduling schemes that can reduce this problem somewhat, but for very long-tailed finishing times it's always going to an efficiency-killer. +The benefits of parallelization also depend upon keeping the available cores maximally busy for the entire computation time. This is easy when each parallel job takes roughly the same amount of time; all jobs will start and finish together, releasing all of the CPU's resources at the same time. However, in some cases the parallel jobs may vary in the time they take to complete, and in some cases this finishing time distribution can have a very long tail, which means that most of the CPU's resources will be sitting idle waiting for the slowest process to finish. This can occur, for example, in optimization problems where some runs quickly converge while others can get stuck searching for a very long time. It is possible to use sophisticated scheduling schemes that can reduce this problem somewhat, but for very long-tailed finishing times it's always going to an efficiency-killer. The greatest benefit from parallelization thus comes when: @@ -1040,7 +1040,7 @@ $$ where $c$ is a point on the complex plane, and $z_0 = 0$. A point is defined as being in the Mandelbrot set if it remains bounded (which is defined as $|Z| \le 2$) within a finite number of time steps. This is an interesting problem for several reasons: -- It's embarassingly parallel, as the points in space can be computed independently of one another +- It's embarrassingly parallel, as the points in space can be computed independently of one another - It's easily implemented in pure Python code - Different points vary in the time that they take to verify whether the point is bounded; unbounded points are likely to escape quickly, whereas points within the radius will take the maximum number of iterations to confirm boundedness. This means that we need to think about load balancing across jobs of different lengths. @@ -1140,7 +1140,7 @@ def run_serial(grid_size): A simulation of the parallel acceleration on the Mandelbrot set problem with a chunking factor of 10. The left panel shows performance in terms of speedup factor as a function of the difficult of the problem, where perfect acceleration would occur when the speedup factor is equal to the number of cores. The gray line shows serial performance, showing that parallelization on small problems can lead to worse performance than serial processing due to overhead of parallelization. The right panel shows efficiency (that is, the proportion of perfect acceleration) achieved, demonstrating that there are decreasing gains with increasing numbers of cores. ``` -We also see that parallelization is more effective as the number of chunks increases. This reflects the wide disribution of finishing times in the dataset; when the number of chunks is the same as the number of cores, then each core does a single job, and the finishing time will reflect the longest job amongst the cores, such that many cores will be sitting idle for much of the time. Breaking the data into smaller chunks allows for a higher number of concurrent jobs, but also increases parallelization overhead. To see the effects of chunking, I ran a simulation of the Mandelbrot problem with a 4096 x 4096 image using chunking factors ranging from 1 (i.e. as many chunks as cores) to 100 (e.g. 1400 chunks for 14 cores). [](#chunking-fig) shows how parallel performance improves as a function of chunking up to a point, but at some point the overhead of parallel processing kicks in, such that speedup actually drops for high levels of chunking with many processors. This highlights the need for benchmarking to understand how parallel performance scales and what the optimal size of data chunks is for the problem at hand. +We also see that parallelization is more effective as the number of chunks increases. This reflects the wide distribution of finishing times in the dataset; when the number of chunks is the same as the number of cores, then each core does a single job, and the finishing time will reflect the longest job amongst the cores, such that many cores will be sitting idle for much of the time. Breaking the data into smaller chunks allows for a higher number of concurrent jobs, but also increases parallelization overhead. To see the effects of chunking, I ran a simulation of the Mandelbrot problem with a 4096 x 4096 image using chunking factors ranging from 1 (i.e. as many chunks as cores) to 100 (e.g. 1400 chunks for 14 cores). [](#chunking-fig) shows how parallel performance improves as a function of chunking up to a point, but at some point the overhead of parallel processing kicks in, such that speedup actually drops for high levels of chunking with many processors. This highlights the need for benchmarking to understand how parallel performance scales and what the optimal size of data chunks is for the problem at hand. ```{figure} images/chunking_factor_speedup.png :label: chunking-fig