Skip to content

Feature/large grid#25

Closed
matthiaszilk wants to merge 5 commits intoNanoComp:masterfrom
matthiaszilk:feature/large-grid
Closed

Feature/large grid#25
matthiaszilk wants to merge 5 commits intoNanoComp:masterfrom
matthiaszilk:feature/large-grid

Conversation

@matthiaszilk
Copy link

This is a quite invasive commit to enable very large grids with more than 2^31-1 grid points. Basically the changes consist of setting the type of meep::grid_volume::the_ntot and all related variables to ptrdiff_t (via a typedef) and tracking down all errors, conversion warnings and loop variables (for details see the commit message of 521cb13).
So far the changes have been tested only on x86-64 with gcc 4.8 and 4.9. I ran some simulations with very large grids (>2^31-1 grid points) for a dispersive gold grating (x and y periodic, z pmls) and a dipole emitter in a large box terminated with pmls. The obtained resuts were consistent with smaller grid sizes and unchanged versions of meep. I can provide both the results and the ctl files if you are interested. If there are any further tests you would like me to perform please let me know.

Changed the default integer type for grid sizes from int to a 64bit
signed integer type on 64bit systems. This enables using large grids
with more than 2^31 grid points. For maintainability a new typedef
meep::integer is introduced in meep/meeptyes.hpp and is set to
ptrdiff_t. This will ensure the proper sizes on both 32bit and 64bit
systems.

The changes were made according to the following procedure:
- the type of meep::grid_volume::the_ntot was updated to meep::integer
- evrything was recompiled with -Wconversion under gcc4.8.2 on a 64bit
system
- iteratively all places with an implicit conversions from meep::integer
to a smaller interger type where updated until all warnings disappeared
- loop variables were updated from int to meep::integer when a
comparison against meep::integer was involved in the iteration
- all casts to int were checked and corrected where necessary
- additional casts were introduced to silence conversion warnings fron
or to double
- print format specifiers were chenge from %d to %ld with additional
casts to long int to avoid warnings on 32bit systems
- MPI communication wrappers were updated to accept meep::integer as
message size. As MPI specifies only int size messages (and the messages
hopefully will not exceed this limit), internally the size is just
downcasted to int. However, checks of the message size were added before
downcasting to abort when the message size does not allow a safe cast.
This at least ensures a proper error message when the supported message
size is exceeded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this cast needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those casts were inserted to silence compiler warnings. Converting from a meep::interger to a double is a narrowing operation on a 64 bit system (the double mantissa has only 52 bits). To find all potential problems with the new 64 bit int type I had to use the gcc option -Wconversion that warns when an implicit conversion may alter the value. The meep::integer -> double conversions gave hundreds of false positives which made it hard to find the real problems. So I decided to insert static_cast<double> to make the conversion explicit. It turns off the compiler warning and documents that the conversion is actually intented. It looks a bit ugly but it does not cause any overhead (the conversion has to happen anyway, no matter if it is implicit or explicit). So I would prefer to leave the casts.

@stevengj
Copy link
Collaborator

stevengj commented Dec 2, 2015

A lot of the type conversions here seem superfluous, e.g. the calls to static_cast<double>.

@matthiaszilk
Copy link
Author

The calls to static_cast<double> are required to suppress compiler warnings when -Wconversion is activated. They tell the compiler explicitly that the narrowing conversion from meep::integer (a 64-bit int on 64-bit systems) to double (which can hold only 52 bits in the mantissa) is intended. They should not impose any runtime overhead.

Some of the promotions from int to meep::integer might be indeed superfluous. But I decided to be conservative and promoted all variables that came into contact with a meep::integer. This might waste a little bit of stack space in some places but I don't consider this a severe problem. With some careful review one could figure out those spurious promotions but at the moment I don't see any benefit in doing this.

@stevengj
Copy link
Collaborator

stevengj commented Dec 2, 2015

Fair enough.

src/mympi.cpp Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent indentation here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have this check several times, it would be better to write a function:

static void check_size(meep::integer size) {
  if (size > INT32_MAX) abort("MPI message size exceeds maximum");
}

and then just call check_size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I include the check into the cast operation by replacing static_cast<int>(size) with a separate function cast_message_size(size).

src/mympi.cpp Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there should be a check somewhere that sizeof(long int) == sizeof(meep::integer).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as sizeof(long int) >= sizeof(meep::integer) everything is fine because the data of pt is copied. However, on systems with LLP64 data model (are there any except MS Windows?) there will be trouble (maybe also in other places?).
One way to resolve the issue would be to use long long int instead of long int.
Another way would be to use meep::integer and define a suiteable MPI data type, e.g. MPI_MEEP_INTEGER, that points either to MPI_INT, MPI_LONG or MPI_LONG_LONG_INT, depending on the size of meep::integer. Then it would be possible to write

ivec max_to_all(const ivec &pt) {
    meep::integer  in[5], out[5];
    for (int i=0; i<5; ++i) in[i] = out[i] = pt.in_direction(direction(i));
 #ifdef HAVE_MPI
    MPI_Allreduce(&in,&out,5,MPI_MEEP_INTEGER,MPI_MAX,mycomm);
 #endif
   ivec ptout(pt.dim);
   for (int i=0; i<5; ++i) ptout.set_direction(direction(i), out[i]);
   return ptout;
 }

I will try this and see how far I can get with AC_CHECK_SIZEOF and the preprocessor.

- added a define MPI_MEEP_INTEGER that maps to the appropriate MPI type
- added mpi communication wrappers and auxiliary functions for long long int
  (if the type is supported by the compiler)
@matthiaszilk
Copy link
Author

I added precautions for the case sizeof(long int) < sizeof(meep::integer). Please have a look.

@stevengj
Copy link
Collaborator

stevengj commented Dec 4, 2015

Looking good. I assume you've tried make check? (It would be nice to set up Travis testing at some point.)

@stevengj
Copy link
Collaborator

stevengj commented Dec 4, 2015

LIBS="$BLAS_LIBS $LAPACK_LIBS $BLAS_LIBS ... should be safe enough, if slightly ugly. But putting it in a separate PR would be good in any case.

@stevengj
Copy link
Collaborator

stevengj commented Dec 4, 2015

Shouldn't the order be -lblas -llapack -lgsl? I thought they have to be sorted in dependency order (i.e. if A depends on B, then A should come after B on the link line). No, I think -lgsl -llapack -lblas is right, they should be in reverse dependency order.

Could it be that your LAPACK library is providing its own BLAS? (Sometimes liblapack bundles a BLAS, or vice versa.)

@matthiaszilk
Copy link
Author

No, my LAPACK lib is not providing its own BLAS. I checked this. I suspect it is some issue with libgsl that is specific to Ubuntu. I think I will try another OS or compile libgsl from sources to see if the bug persists.

@matthiaszilk
Copy link
Author

Yes, I ran make check. I always do before pushing any changes. All tests are passed (tested with mpi and mpb).
Today I also set up a 32-bit VM with (Ubuntu 14.04.3, gcc 4.8.2) to make sure everything is fine there, too. Unfortunately on the 32-bit system things do not look as good. Two test are failling: Symmetry and three_d. I think this is not due to the changes of the current PR. On master the tests fail as well. Here are the logs for both branches:
test-suite_feature_large_grid.txt
test-suite_master.txt
The symmetry test is passed when run serial or when eps_compare in symmetry.cpp is increased from 1.0e-9 to 2.0e-9. Might be just an accuracy issue. The three_d test also fails when run serial. Slightly increasing the tolerance does not help either. Do you have any suggestions how to proceed?

@matthiaszilk
Copy link
Author

Reducing the optimization level from -O3 to -O2 solved the problem. Now all test are passed also under 32-bit.
After palying with some optimization options I'm quite sure that -ftree-vectorize causes the tests to fail.

@stevengj
Copy link
Collaborator

Was there a measurable performance penalty to turning off auto-vectorization? (Probably this flag only helps in step_generic.cpp and step_generic_stride1.cpp, so we could potentially use it only there.) Perhaps this is a gcc bug? What gcc version do you have?

@matthiaszilk
Copy link
Author

I tested mainly with gcc 4.8 and 4.9 and to some minor extent with gcc 5. All versions show the same behavior. The test failures occur when sse2 or avx are activitated either by using -msse2, -mavx and so on, or by setting an appropriate architecture via the -march option. I tested several configurations for the -march flag and can provide the results if you are interested. I did only little benchmarking based on the test suite (total runtime, time stepping time im near2far). On the virtual machine there is no performance penalty when vectorization is turned off. Maybe the speed is limited by the memory bandwidth. However, some testing on a real machine should be perforemed to be really sure. Unfortunately I can do "real" benchmarks only on 64 bit machines at the moment.

This was referenced Feb 2, 2018
@stevengj
Copy link
Collaborator

stevengj commented Feb 2, 2018

#193 is a more minimal version of this change that hopefully addresses the needs for large grids.

@stevengj stevengj closed this Feb 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants