Skip to content

new approach to grid loops for exploration of OpenMP shared-memory parallelism#565

Closed
HomerReid wants to merge 20 commits intoNanoComp:masterfrom
homerredux:ivec_loop_counter
Closed

new approach to grid loops for exploration of OpenMP shared-memory parallelism#565
HomerReid wants to merge 20 commits intoNanoComp:masterfrom
homerredux:ivec_loop_counter

Conversation

@HomerReid
Copy link
Contributor

This PR proposes a new approach to handling multidimensional loops over ivecs, replacing the existing paradigm of the LOOP_OVER_IVECS macro and its derivatives.

The primary motivation is to facilitate exploration of OpenMP directives for shared-memory parallelism of grid loops, particularly in operations like step_db and update_eh that account for most of the cost of timestepping. The hard-coded triply-nested loop in the LOOP_OVER_IVECS macro doesn't lend itself to simple OpenMP parallelization and doesn't offer much flexibility to rearrange the loop structure for better performance.

ivec_loop_counter

The basic idea is to replace the LOOP_OVER_IVECS macro with a C++ class called ivec_loop_counter that tracks the progress of the loop via internal class data fields and provides methods that report relevant information, such as grid-point indices or coordinates, as necessary on each loop iteration.

Single-threaded usage of ivec_loop_counter

If gv is a grid_volume and is,ie are ivecs specifying the minimum and maximum corners of a rectangular region, then the conventional paradigm for looping over the region looks like:

   LOOP_OVER_IVECS(gv, is, ie, idx)
    { 
      // idx = ptrdiff_t-valued index into field array
      IVEC_LOOP_ILOC(gv, iloc);  // iloc = (integer) indices of current grid point
      IVEC_LOOP_LOC(gv, loc);    //  loc = (real-valued) spatial coordinates

    }

The field-array index idx is declared and computed automatically for each loop iteration by the macro, while the grid point indices and/or coordinates iloc may optionally be fetched by invoking the additional macros shown in the snippet.

Note that, regardless of the actual dimensions of the grid region, LOOP_OVER_IVECS always expands to three nested for loops whose order is fixed by the geometry dimensions (gv.dim). For example, if the target region has zero width (is.in_direction(d)==ie.in_direction(d)) in one or more directions d, the macro still produces single-iteration for loops in those dimensions. This is one reason for the difficulty of applying OpenMP parallelization directives to LOOP_OVER_IVECS. (Another reason is that the loop initialization and increment statements are much more complicated than the bare-bones possibililities allowable in OpenMP-parallelized loops.)

The simplest invocation of ivec_loop_counter is to replace the above with
a single for loop:

   ivec_loop_counter ilc(gv, is, ie);
   for( ptrdiff_t idx=ilc.start(); !(ilc.complete); idx=++ilc )
    { ivec iloc = ilc.get_iloc();
       vec loc  = ilc.get_loc();
    }

The first line here invokes the ivec_loop_counter constructor, whose inputs (gv, is, ie) match the first three arguments to LOOP_OVER_IVECS. The increment of the for loop invokes the ++ operator, which is overloaded by ivec_loop_counter to represent the operation of updating the internal state to move to the next grid point in the loop. The complete flag is false while the loop is in progress and switches to true upon increment from the last grid point.

Note that idx (index into field arrays) is computed and returned automatically by the start and ++ methods of ivec_loop_counter, while the integer indices and real-space coordinates of the current grid point may optionally be obtained via the get_iloc and get_loc class methods.

Indices into PML sigma arrays

In addition to the idx index, some grid loops require, for each grid point, the values of one or more integer indices into PML sigma arrays for given directions. (These directions are called dsig, dsigu, or dsigw, and the corresponding indices are called k, ku, or kw.)

In the conventional approach, this is handled by

(1) invoking the KSTRIDE_DEF macro for the direction in question before entering the loop, and then

(2) invoking the macros DEF_k, DEF_ku, and/or DEF_kw inside the loop, i.e.

   KSTRIDE_DEF(dsig, k, corner)
   KSTRIDE_DEF(dsigu, ku, corner)
   ...
   LOOP_OVER_IVECS(...)
    { 
      DEF_k; DEF_ku; // integers k, ku now available
    }
   

The new paradigm updates this procedure as follows:

(1) the directions in question are passed as arguments to ivec_loop_counter::start() when the loop is initiated

(2) values of k indices are obtained by calling ivec_loop_counter.get_k:

   for( ptrdiff_t idx=ilc.start(dsig, dsigu); !(ilc.complete); idx=++ilc )
    { int k = ilc.get_k(0);
      int ku = ilc.get_k(1);
    }

Loops with explicit endpoints

The above calling convention is a convenient abstraction that hides details of loop indices and iteration counts. On the other hand, OpenMP loops and other applications require loops whose length is known in advance. The min_iter and max_iter class fields of ivec_loop_counter allow the above grid loop to be written as a one-dimensional loop with fixed starting and end points:

   ivec_loop_counter ilc(gv, is, ie);
   ilc.start();
   for(size_t niter=ilc.min_iter; niter<ilc.max_iter; niter++)
    { ptrdiff_T idx = ++ilc;
      ..;
    }

Separating the innermost loop from outer loops

As a performance optimization, the innermost loop may be split off from the outer loops like this:

  ivec_loop_counter ilc(gv, is, ie);
  for(ptrdiff_t idx=ilc.start(dsig); !(ilc.complete); idx=ilc.update_outer()) \
_Pragma(IVDEP)
   for( ; idx<ilc.idx_max; idx+=ilc.idx_step)

Here _Pragma(IVDEP) expands to #pragma GCC ivdep for GNU compilers and to #pragma ivdep for intel compilers.

State-ful and stateless computational models

In all the examples above, instances of ivec_loop_counter maintain internal state that is updated as the loop progresses. ivec_loop_counter offer an alternative stateless paradigm, in which the class stores only information about the geometry, maintaining no internal state. This allows loop iterations to be executed in arbitrary order.

In the stateless approach, the iterth iteration of the loop is executed by first calling the niter_to_narray() class method of ivec_loop_counter to convert the overall iteration number into an array of loop indices, which may then be passed to class methods like get_idx or get_iloc to fetch information on the iterth grid point in the loop. For example, to fetch idx and iloc for grid point #54, we say

   size_t narray[3];
   size_t idx = ilc.niter_to_narray(54, narray);
   ivec iloc  = ilc.get_iloc(narray);

Here's how the entire grid loop could be executed backwards using the stateless approach:

   ivec_loop_counter ilc(gv, is, ie);
   for(size_t niter=ilc.max_iter-1; niter>=ilc.min_iter; niter--)
    { size_t narray[3], idx=ilc.niter_to_narray(niter, narray);
      vec loc = ilc.get_loc(narray);
      ..;
    }

Because the stateless approach allows multiple simultaneous threads to use a single instance of ivec_loop_counter without contention, it offers one possible approach to OpenMP parallelization:

   ivec_loop_counter ilc(gv, is, ie);
#pragma omp parallel for schedule(dynamic), num_threads(NT)
   for(size_t niter=ilc.min_iter; niter<ilc.max_iter; niter++)
    { size_t narray[3]; 
      ptrdiff_t idx=ilc.niter_to_narray(niter, narray);
      ivec iloc = ilc.get_iloc(narray);
      ..
    }

However, ivec_loop_counter supports an alternative parallelization strategy that I think is likely to yield better performance, as discussed below.

Multi-threaded usage of ivec_loop_counter

ivec_loop_counter is designed to facilitate shared-memory parallelism of grid loops. To this end, in addition to the gv, is, ie arguments, the ivec_loop_counter constructor accepts optional arguments nt and NT (with 0 <= nt < NT) indicating that the counter should run over just the ntth of NT equal-size subdivisions of the loop, each to be executed in its own thread.

Thus, here's an alternative way to execute a grid loop simultaneously on NT threads:

   std::vector<ivec_loop_counter> ilc;
   for(int nt=0; nt<NT; nt++)
    ilc.push_back(ivec_loop_counter(gv,is,ie,nt,NT);
#pragma omp parallel for schedule(static), num_threads(NT)
   for(int nt=0; nt<NT; nt++)
    for(size_t idx=ilc[nt].start(); !(ilc[nt].complete); idx=++ilc[nt])
     {
     }

Shared-memory-parallelized grid-loop macros

The file step_generic.cpp in the existing meep codebase invokes macros like LOOP_OVER_VOL_OWNED to define custom-tuned versions of the full-grid loops needed for timestepping. step_generic_stride1.cpp is an automatically-generated version of this file with additional compiler optimizations for stride-1 loops.

This PR includes a new file step_multithreaded.cpp that replaces LOOP_OVER_VOL_OWNED with a new series of macros with names like PLOOP_OVER_IVECS (defined in multithreading.hpp) that expand to OpenMP-parallelized loop structures based on ivec_loop_counter.

Running multithreaded meep calculations

Multithreading is disabled by default. To enable it, set the environment variable MEEP_NUM_THREADS to the number of threads to use.

  • From the command line:
 % export MEEP_NUM_THREADS=6
 % python meep_script.py
  • From within python:
 import os;
 os.environ['MEEP_NUM_THREADS']='6'

Initial performance results

Here's a plot of execution time versus number of threads for just the full-grid loops invoked by step_db and update_eh in step.cpp.

costvsthreads

@stevengj
Copy link
Collaborator

step_boundaries was never really very optimized, because I think we thought that it wouldn't be performance limiting. Simple things to try:

  • Caching some of the pointers like chunks[j]->connections[ft][ip][Outgoing] outside the loops, since I'm not sure the compiler knows that these are constant?

  • Omitting the phases multiplication if the phases are trivial.

  • putting #pragma omp for or whatever on the comm_sizes[ft][ip][pair] loop, or on some of the other loops?

@stevengj
Copy link
Collaborator

I just merged a PR #662 that reformats everything with clang-format, which creates lots of conflicts.

To rebase this PR onto master without having to deal manually with these formatting conflicts, follow the exact procedure given in the instructions of #662.

@HomerReid HomerReid closed this Apr 11, 2019
@stevengj stevengj mentioned this pull request May 23, 2020
@stevengj
Copy link
Collaborator

It would still be nice to merge something along these general lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants