Conversation
|
Added threading to one of the loops in calc_matrix_elements. Did not check performance yet. Test pass. However, I tested with |
|
I just realized the values in the output are only written out at 8 decimals. So my race condition is not producing any observable error in the tests in the testsuite. Very strange 🤔 |
|
I have tested the attached file (216 atoms which scales happily enough to 9 MPI processes) on combinations from 1 to 8 MPI processes with 1 to 4 OpenMP threads and found the same energy to 10 decimal places for all cases. I'm not sure if this helps! |
|
It's always awkward when the code is producing correct results and you don't understand why 😁 I wrote this bit of code to test the concept and it does indeed produce a race condition when I don't declare |
This reverts commit 5875d9c.
|
I think this PR is now ready to merge to development. We can address #244 in a separate PR to avoid cluttering this one further. Here is a performance comparison to development. I ran these on young using 8 MPI ranks and the attached input files
Even the serial performance is significantly improved! I've started work on the code for #244 and it looks like we can gain a bit more speedup in threaded performance. |
|
Running the LiCl test (input files above) on a local cluster (AMD, Intel compiler 2020_u4, MKL, OpenMPI 4.1.3) I get a segfault when using two threads and 9 MPI processes (the fault seems to come from a reduction). Another segfault occurs using GCC 11.2.0 on the same system. On a Mac MBP M2 with GCC 12.2 and OpenMPI 4.1.5 the same test runs without problems. |
|
I find the same problem with your test above |
|
Interesting, is it possible to get access to the cluster? On young I'm using
gcc and openmpi are available there as well, I can see if I can build with them |
|
The problem comes from |
|
We are now confident that the problem here is coming from the OpenMP stack (the errors I found were caused by a stack overflow...). Setting the environment variable OMP_STACKSIZE fixes this issue (in particular, I found that a value somewhere around 50M was helpful though this might require a little testing). I will add to the documentation before approving the PR. |
Changed output so that number of threads is only written out when we compile with OpenMP. Brief notes on compiling along with advice on OMP_STACKSIZE variable added.
The threads variable is only used for output, so does not need to be in global. Removed from main and global and consigned to write_info.
.github/workflows/makefile.yml
Outdated
| exclude: | ||
| - np: 2 | ||
| threads: 2 |
There was a problem hiding this comment.
| exclude: | |
| - np: 2 | |
| threads: 2 |
src/PAO_grid_transform_module.f90
Outdated
| ! Note: Using OpenMP in this loop requires some redesign because there is a loop | ||
| ! carrier dependency in next_offset_position. | ||
| blocks_loop: do iblock = 1, domain%groups_on_node ! primary set of blocks | ||
| part_in_block: if(naba_atoms_of_blocks(atomf)%no_of_part(iblock) > 0) then ! if there are naba atoms |
There was a problem hiding this comment.
This may not be necessary, if naba_atoms_of_blocks(atomf)%no_of_part(iblock) == 0, the loop below will not execute?
PR to close #178
This PR adds threading over the biggest loops over blocks in
calc_matrix_elements_moduleandPAO_grid_transform_module.In
calc_matrix_elements_modulethreading is done in subroutinesget_matrix_elements_newandact_on_vectors_new. Both subroutines have a deep loop nest where the loop over blocks is the outermost loop. Inget_matrix_elements_newthe innermost loop accumulates data insend_array, therefore a reduction is done at the end of the parallel region. Inget_matrix_elements_new,gridfunctions%griddatais updated with agemmcall in the innermost loop. The updates are done in separate indices by each loop iteration, therefore it can be operated on by all threads in parallel.In
PAO_grid_transform_module, thesingle_PAO_to_gridsubroutine is rewritten so that threading over blocks can be done. The innermost loop updatesgridfunctions%griddata. However the index was calculated sequentially inside the loop, which made it unsafe for threading. The rewritten subroutine first precomputes the indices and stores them in an array, then does the loop over blocks which can now safely be threaded. Thesingle_PAO_to_gradsubroutine has not been refactored, it will be removed in a separate PR and the functionality merged withsingle_PAO_to_gridto address #244