new approach to grid loops for exploration of OpenMP shared-memory parallelism#565
Closed
HomerReid wants to merge 20 commits intoNanoComp:masterfrom
Closed
new approach to grid loops for exploration of OpenMP shared-memory parallelism#565HomerReid wants to merge 20 commits intoNanoComp:masterfrom
HomerReid wants to merge 20 commits intoNanoComp:masterfrom
Conversation
… be_quiet() routine for reduced console output
Collaborator
|
|
Collaborator
Closed
Collaborator
|
It would still be nice to merge something along these general lines. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR proposes a new approach to handling multidimensional loops over
ivecs, replacing the existing paradigm of theLOOP_OVER_IVECSmacro and its derivatives.The primary motivation is to facilitate exploration of OpenMP directives for shared-memory parallelism of grid loops, particularly in operations like
step_dbandupdate_ehthat account for most of the cost of timestepping. The hard-coded triply-nested loop in theLOOP_OVER_IVECSmacro doesn't lend itself to simple OpenMP parallelization and doesn't offer much flexibility to rearrange the loop structure for better performance.ivec_loop_counterThe basic idea is to replace the
LOOP_OVER_IVECSmacro with a C++ class calledivec_loop_counterthat tracks the progress of the loop via internal class data fields and provides methods that report relevant information, such as grid-point indices or coordinates, as necessary on each loop iteration.Single-threaded usage of
ivec_loop_counterIf
gvis agrid_volumeandis,ieareivecs specifying the minimum and maximum corners of a rectangular region, then the conventional paradigm for looping over the region looks like:The field-array index
idxis declared and computed automatically for each loop iteration by the macro, while the grid point indices and/or coordinatesilocmay optionally be fetched by invoking the additional macros shown in the snippet.Note that, regardless of the actual dimensions of the grid region,
LOOP_OVER_IVECSalways expands to three nestedforloops whose order is fixed by the geometry dimensions (gv.dim). For example, if the target region has zero width (is.in_direction(d)==ie.in_direction(d)) in one or more directionsd, the macro still produces single-iterationforloops in those dimensions. This is one reason for the difficulty of applying OpenMP parallelization directives toLOOP_OVER_IVECS. (Another reason is that the loop initialization and increment statements are much more complicated than the bare-bones possibililities allowable in OpenMP-parallelized loops.)The simplest invocation of
ivec_loop_counteris to replace the above witha single
forloop:The first line here invokes the
ivec_loop_counterconstructor, whose inputs (gv,is,ie) match the first three arguments toLOOP_OVER_IVECS.The increment of theforloop invokes the++operator, which is overloaded byivec_loop_counterto represent the operation of updating the internal state to move to the next grid point in the loop. Thecompleteflag is false while the loop is in progress and switches totrueupon increment from the last grid point.Note that
idx(index into field arrays) is computed and returned automatically by thestartand++methods ofivec_loop_counter, while the integer indices and real-space coordinates of the current grid point may optionally be obtained via theget_ilocandget_locclass methods.Indices into PML sigma arrays
In addition to the
idxindex, some grid loops require, for each grid point, the values of one or more integer indices into PML sigma arrays for givendirections.(Thesedirectionsare calleddsig,dsigu, ordsigw, and the corresponding indices are calledk,ku, orkw.)In the conventional approach, this is handled by
(1) invoking the
KSTRIDE_DEFmacro for thedirectionin question before entering the loop, and then(2) invoking the macros
DEF_k,DEF_ku, and/orDEF_kwinside the loop, i.e.The new paradigm updates this procedure as follows:
(1) the
directionsin question are passed as arguments toivec_loop_counter::start()when the loop is initiated(2) values of
kindices are obtained by callingivec_loop_counter.get_k:Loops with explicit endpoints
The above calling convention is a convenient abstraction that hides details of loop indices and iteration counts. On the other hand, OpenMP loops and other applications require loops whose length is known in advance. The
min_iterandmax_iterclass fields ofivec_loop_counterallow the above grid loop to be written as a one-dimensional loop with fixed starting and end points:Separating the innermost loop from outer loops
As a performance optimization, the innermost loop may be split off from the outer loops like this:
Here
_Pragma(IVDEP)expands to#pragma GCC ivdepfor GNU compilers and to#pragma ivdepfor intel compilers.State-ful and stateless computational models
In all the examples above, instances of
ivec_loop_countermaintain internal state that is updated as the loop progresses.ivec_loop_counteroffer an alternative stateless paradigm, in which the class stores only information about the geometry, maintaining no internal state. This allows loop iterations to be executed in arbitrary order.In the stateless approach, the
iterth iteration of the loop is executed by first calling theniter_to_narray()class method ofivec_loop_counterto convert the overall iteration number into an array of loop indices, which may then be passed to class methods likeget_idxorget_ilocto fetch information on theiterth grid point in the loop. For example, to fetchidxandilocfor grid point #54, we sayHere's how the entire grid loop could be executed backwards using the stateless approach:
Because the stateless approach allows multiple simultaneous threads to use a single instance of
ivec_loop_counterwithout contention, it offers one possible approach to OpenMP parallelization:However,
ivec_loop_countersupports an alternative parallelization strategy that I think is likely to yield better performance, as discussed below.Multi-threaded usage of
ivec_loop_counterivec_loop_counteris designed to facilitate shared-memory parallelism of grid loops. To this end, in addition to thegv, is, iearguments, theivec_loop_counterconstructor accepts optional argumentsntandNT(with0 <= nt < NT) indicating that the counter should run over just thentth ofNTequal-size subdivisions of the loop, each to be executed in its own thread.Thus, here's an alternative way to execute a grid loop simultaneously on
NTthreads:Shared-memory-parallelized grid-loop macros
The file
step_generic.cppin the existing meep codebase invokes macros likeLOOP_OVER_VOL_OWNEDto define custom-tuned versions of the full-grid loops needed for timestepping.step_generic_stride1.cppis an automatically-generated version of this file with additional compiler optimizations for stride-1 loops.This PR includes a new file
step_multithreaded.cppthat replacesLOOP_OVER_VOL_OWNEDwith a new series of macros with names likePLOOP_OVER_IVECS(defined inmultithreading.hpp) that expand to OpenMP-parallelized loop structures based onivec_loop_counter.Running multithreaded meep calculations
Multithreading is disabled by default. To enable it, set the environment variable
MEEP_NUM_THREADSto the number of threads to use.% export MEEP_NUM_THREADS=6 % python meep_script.pyInitial performance results
Here's a plot of execution time versus number of threads for just the full-grid loops invoked by
step_dbandupdate_ehinstep.cpp.