Skip to content

Thread loops over blocks#195

Merged
tkoskela merged 48 commits intodevelopfrom
tk-thread-calc-matrix
Aug 29, 2023
Merged

Thread loops over blocks#195
tkoskela merged 48 commits intodevelopfrom
tk-thread-calc-matrix

Conversation

@tkoskela
Copy link
Contributor

@tkoskela tkoskela commented Jun 9, 2023

PR to close #178

This PR adds threading over the biggest loops over blocks in calc_matrix_elements_module and PAO_grid_transform_module.

In calc_matrix_elements_module threading is done in subroutines get_matrix_elements_new and act_on_vectors_new. Both subroutines have a deep loop nest where the loop over blocks is the outermost loop. In get_matrix_elements_new the innermost loop accumulates data in send_array, therefore a reduction is done at the end of the parallel region. In get_matrix_elements_new, gridfunctions%griddata is updated with a gemm call in the innermost loop. The updates are done in separate indices by each loop iteration, therefore it can be operated on by all threads in parallel.

In PAO_grid_transform_module, the single_PAO_to_grid subroutine is rewritten so that threading over blocks can be done. The innermost loop updates gridfunctions%griddata. However the index was calculated sequentially inside the loop, which made it unsafe for threading. The rewritten subroutine first precomputes the indices and stores them in an array, then does the loop over blocks which can now safely be threaded. The single_PAO_to_grad subroutine has not been refactored, it will be removed in a separate PR and the functionality merged with single_PAO_to_grid to address #244

@tkoskela tkoskela changed the title Tk thread calc matrix Thread loops over blocks in calc_matrix_elements Jun 9, 2023
@tkoskela tkoskela changed the base branch from master to develop June 9, 2023 15:59
@tkoskela
Copy link
Contributor Author

tkoskela commented Jun 9, 2023

Added threading to one of the loops in calc_matrix_elements. Did not check performance yet.

Test pass. However, I tested with send_array as a shared variable, which should create a data race, and I had to go to 9 decimals before the testsuite picked up a difference with the reduction implementation. Against the reference data currently in the tests, my tests fail at 7 decimals (with no changes to develop). That is a bit concerning -- we don't have a test that is accurate enough to pick up the race condition.

@tkoskela tkoskela changed the title Thread loops over blocks in calc_matrix_elements WIP: Thread loops over blocks in calc_matrix_elements Jun 9, 2023
@tkoskela tkoskela added area: main-source Relating to the src/ directory (main Conquest source code) improves: speed Speed-up of code time: days type: enhancement Planned enhancement being suggested by developers project: eCSE8 labels Jun 9, 2023
@tkoskela
Copy link
Contributor Author

tkoskela commented Jun 13, 2023

I just realized the values in the output are only written out at 8 decimals. So my race condition is not producing any observable error in the tests in the testsuite. Very strange 🤔

@davidbowler
Copy link
Contributor

I have tested the attached file (216 atoms which scales happily enough to 9 MPI processes) on combinations from 1 to 8 MPI processes with 1 to 4 OpenMP threads and found the same energy to 10 decimal places for all cases. I'm not sure if this helps!

LiCltest.tgz

@tkoskela
Copy link
Contributor Author

It's always awkward when the code is producing correct results and you don't understand why 😁

I wrote this bit of code to test the concept and it does indeed produce a race condition when I don't declare array_to_reduce as a reduction variable. I don't quite understand what is different in calc_matrix_elements but I'll keep digging.

@tkoskela tkoskela changed the title Thread loops over blocks in calc_matrix_elements Thread loops over blocks Aug 24, 2023
@tkoskela tkoskela marked this pull request as ready for review August 24, 2023 07:27
@tkoskela
Copy link
Contributor Author

I think this PR is now ready to merge to development. We can address #244 in a separate PR to avoid cluttering this one further. Here is a performance comparison to development. I ran these on young using 8 MPI ranks and the attached input files
input_data.zip.

develop This PR
Without -fopenmp 95.331s 63.482s
With -fopenmp, 1 OMP thread 95.597s 62.324s
With -fopenmp, 4 OMP threads 36.491 s

Even the serial performance is significantly improved! I've started work on the code for #244 and it looks like we can gain a bit more speedup in threaded performance.

@tkoskela tkoskela requested a review from ilectra August 24, 2023 08:03
@tkoskela tkoskela linked an issue Aug 24, 2023 that may be closed by this pull request
@davidbowler
Copy link
Contributor

Running the LiCl test (input files above) on a local cluster (AMD, Intel compiler 2020_u4, MKL, OpenMPI 4.1.3) I get a segfault when using two threads and 9 MPI processes (the fault seems to come from a reduction). Another segfault occurs using GCC 11.2.0 on the same system. On a Mac MBP M2 with GCC 12.2 and OpenMPI 4.1.5 the same test runs without problems.

@davidbowler
Copy link
Contributor

I find the same problem with your test above

@tkoskela
Copy link
Contributor Author

Interesting, is it possible to get access to the cluster?

On young I'm using

  • compilers/intel/2022.2
  • mpi/intel/2021.6.0/intel

gcc and openmpi are available there as well, I can see if I can build with them

@davidbowler
Copy link
Contributor

The problem comes from calc_matrix_elements_new (found by trial and error - commenting out different areas) but I don't know more yet.

@davidbowler
Copy link
Contributor

We are now confident that the problem here is coming from the OpenMP stack (the errors I found were caused by a stack overflow...). Setting the environment variable OMP_STACKSIZE fixes this issue (in particular, I found that a value somewhere around 50M was helpful though this might require a little testing).

I will add to the documentation before approving the PR.

Changed output so that number of threads is only written out
when we compile with OpenMP.  Brief notes on compiling along
with advice on OMP_STACKSIZE variable added.
The threads variable is only used for output, so does not need
to be in global.  Removed from main and global and consigned
to write_info.
@davidbowler davidbowler added this to the Release v1.3 milestone Aug 25, 2023
Comment on lines +36 to +38
exclude:
- np: 2
threads: 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
exclude:
- np: 2
threads: 2

! Note: Using OpenMP in this loop requires some redesign because there is a loop
! carrier dependency in next_offset_position.
blocks_loop: do iblock = 1, domain%groups_on_node ! primary set of blocks
part_in_block: if(naba_atoms_of_blocks(atomf)%no_of_part(iblock) > 0) then ! if there are naba atoms
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be necessary, if naba_atoms_of_blocks(atomf)%no_of_part(iblock) == 0, the loop below will not execute?

@tkoskela tkoskela merged commit a4b0378 into develop Aug 29, 2023
@tkoskela tkoskela deleted the tk-thread-calc-matrix branch August 29, 2023 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: main-source Relating to the src/ directory (main Conquest source code) improves: speed Speed-up of code time: days type: enhancement Planned enhancement being suggested by developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using the OpenMP library Add threading over blocks in calc_matrix_elements

2 participants