Thread loops over blocks by tkoskela · Pull Request #195 · OrderN/CONQUEST-release

tkoskela · 2023-06-09T15:58:32Z

PR to close #178

This PR adds threading over the biggest loops over blocks in calc_matrix_elements_module and PAO_grid_transform_module.

In calc_matrix_elements_module threading is done in subroutines get_matrix_elements_new and act_on_vectors_new. Both subroutines have a deep loop nest where the loop over blocks is the outermost loop. In get_matrix_elements_new the innermost loop accumulates data in send_array, therefore a reduction is done at the end of the parallel region. In get_matrix_elements_new, gridfunctions%griddata is updated with a gemm call in the innermost loop. The updates are done in separate indices by each loop iteration, therefore it can be operated on by all threads in parallel.

In PAO_grid_transform_module, the single_PAO_to_grid subroutine is rewritten so that threading over blocks can be done. The innermost loop updates gridfunctions%griddata. However the index was calculated sequentially inside the loop, which made it unsafe for threading. The rewritten subroutine first precomputes the indices and stores them in an array, then does the loop over blocks which can now safely be threaded. The single_PAO_to_grad subroutine has not been refactored, it will be removed in a separate PR and the functionality merged with single_PAO_to_grid to address #244

tkoskela · 2023-06-09T16:02:01Z

Added threading to one of the loops in calc_matrix_elements. Did not check performance yet.

Test pass. However, I tested with send_array as a shared variable, which should create a data race, and I had to go to 9 decimals before the testsuite picked up a difference with the reduction implementation. Against the reference data currently in the tests, my tests fail at 7 decimals (with no changes to develop). That is a bit concerning -- we don't have a test that is accurate enough to pick up the race condition.

tkoskela · 2023-06-13T15:59:48Z

I just realized the values in the output are only written out at 8 decimals. So my race condition is not producing any observable error in the tests in the testsuite. Very strange 🤔

davidbowler · 2023-06-14T13:05:43Z

I have tested the attached file (216 atoms which scales happily enough to 9 MPI processes) on combinations from 1 to 8 MPI processes with 1 to 4 OpenMP threads and found the same energy to 10 decimal places for all cases. I'm not sure if this helps!

LiCltest.tgz

tkoskela · 2023-06-14T15:30:53Z

It's always awkward when the code is producing correct results and you don't understand why 😁

I wrote this bit of code to test the concept and it does indeed produce a race condition when I don't declare array_to_reduce as a reduction variable. I don't quite understand what is different in calc_matrix_elements but I'll keep digging.

This reverts commit 5875d9c.

tkoskela · 2023-08-24T08:03:30Z

I think this PR is now ready to merge to development. We can address #244 in a separate PR to avoid cluttering this one further. Here is a performance comparison to development. I ran these on young using 8 MPI ranks and the attached input files
input_data.zip.

	develop	This PR
Without `-fopenmp`	95.331s	63.482s
With `-fopenmp`, 1 OMP thread	95.597s	62.324s
With `-fopenmp`, 4 OMP threads		36.491 s

Even the serial performance is significantly improved! I've started work on the code for #244 and it looks like we can gain a bit more speedup in threaded performance.

davidbowler · 2023-08-24T10:56:12Z

Running the LiCl test (input files above) on a local cluster (AMD, Intel compiler 2020_u4, MKL, OpenMPI 4.1.3) I get a segfault when using two threads and 9 MPI processes (the fault seems to come from a reduction). Another segfault occurs using GCC 11.2.0 on the same system. On a Mac MBP M2 with GCC 12.2 and OpenMPI 4.1.5 the same test runs without problems.

davidbowler · 2023-08-24T11:09:05Z

I find the same problem with your test above

tkoskela · 2023-08-24T11:36:38Z

Interesting, is it possible to get access to the cluster?

On young I'm using

compilers/intel/2022.2
mpi/intel/2021.6.0/intel

gcc and openmpi are available there as well, I can see if I can build with them

davidbowler · 2023-08-24T14:23:56Z

The problem comes from calc_matrix_elements_new (found by trial and error - commenting out different areas) but I don't know more yet.

davidbowler · 2023-08-25T10:32:32Z

We are now confident that the problem here is coming from the OpenMP stack (the errors I found were caused by a stack overflow...). Setting the environment variable OMP_STACKSIZE fixes this issue (in particular, I found that a value somewhere around 50M was helpful though this might require a little testing).

I will add to the documentation before approving the PR.

Changed output so that number of threads is only written out when we compile with OpenMP. Brief notes on compiling along with advice on OMP_STACKSIZE variable added.

The threads variable is only used for output, so does not need to be in global. Removed from main and global and consigned to write_info.

docs/installing.rst

tkoskela · 2023-08-29T13:21:27Z

.github/workflows/makefile.yml

+        exclude:
+          - np: 2
+            threads: 2


Suggested change

exclude:

- np: 2

threads: 2

tkoskela · 2023-08-29T14:07:38Z

src/PAO_grid_transform_module.f90

+    ! Note: Using OpenMP in this loop requires some redesign because there is a loop
+    !       carrier dependency in next_offset_position.
+    blocks_loop: do iblock = 1, domain%groups_on_node ! primary set of blocks
+       part_in_block: if(naba_atoms_of_blocks(atomf)%no_of_part(iblock) > 0) then ! if there are naba atoms


This may not be necessary, if naba_atoms_of_blocks(atomf)%no_of_part(iblock) == 0, the loop below will not execute?

…lease into tk-thread-calc-matrix

tkoskela added 5 commits June 9, 2023 15:39

Remove aborting ifs

62d0b02

Ignore etags TAGS file

a4b3b92

Change order of comparison to make error message more meaningful

4d320ef

Add number of OMP threads to command line arguments

e7c1105

Add OMP parallel do on loop over blocks

90d6ea4

tkoskela changed the title ~~Tk thread calc matrix~~ Thread loops over blocks in calc_matrix_elements Jun 9, 2023

tkoskela changed the base branch from master to develop June 9, 2023 15:59

tkoskela changed the title ~~Thread loops over blocks in calc_matrix_elements~~ WIP: Thread loops over blocks in calc_matrix_elements Jun 9, 2023

tkoskela added area: main-source Relating to the src/ directory (main Conquest source code) improves: speed Speed-up of code time: days type: enhancement Planned enhancement being suggested by developers project: eCSE8 labels Jun 9, 2023

tkoskela added 3 commits June 14, 2023 11:26

Add number of OMP threads to output for record-keeping

7a2840d

Add missing variable

b3d9e08

Add openmp makefile flag for github actions

04c8c1d

Split memory access from axpy call for profiling

5875d9c

tkoskela mentioned this pull request Jun 15, 2023

Access to data in gridfunctions is slow #197

Open

tkoskela added 8 commits June 16, 2023 15:22

Add omp parallelisation over second loop over blocks

704c0f2

Add OpenMP around hot loop

020ebae

Revert "Split memory access from axpy call for profiling"

65d0a6f

This reverts commit 5875d9c.

Remove debug variables from omp clause

44900f7

Always run make, it won't recompile if there's no changes

bbcdc52

Undo data copy + Add some ideas in comments

b028126

Remove loop carrier dependency

de17a3a

Merge branch 'tk-refactor-testsuite' into tk-thread-calc-matrix

c618038

tkoskela changed the title ~~Thread loops over blocks in calc_matrix_elements~~ Thread loops over blocks Aug 24, 2023

tkoskela marked this pull request as ready for review August 24, 2023 07:27

tkoskela requested a review from ilectra August 24, 2023 08:03

tkoskela added 2 commits August 24, 2023 09:17

Add dummy OMP module for compiling without -fopenmp

9a948e6

Set to use omp dummy module by default for consistency.

696306a

tkoskela mentioned this pull request Aug 24, 2023

Using the OpenMP library #246

Closed

tkoskela linked an issue Aug 24, 2023 that may be closed by this pull request

Using the OpenMP library #246

Closed

This was referenced Aug 24, 2023

Investigate performance of threaded matrix multiply kernel #248

Closed

Idea: Test do concurrent #250

Open

davidbowler added 2 commits August 25, 2023 12:22

Added brief compilation note and tweaked output

d0c30cd

Changed output so that number of threads is only written out when we compile with OpenMP. Brief notes on compiling along with advice on OMP_STACKSIZE variable added.

Removed num_threads from global

011ee59

The threads variable is only used for output, so does not need to be in global. Removed from main and global and consigned to write_info.

tkoskela commented Aug 25, 2023

View reviewed changes

docs/installing.rst Outdated Show resolved Hide resolved

Update instructions for OpenMP compiling

fb52f08

davidbowler approved these changes Aug 25, 2023

View reviewed changes

davidbowler added this to the Release v1.3 milestone Aug 25, 2023

Remove unused axpy

19845a8

tkoskela commented Aug 29, 2023

View reviewed changes

tkoskela added 4 commits August 29, 2023 16:06

Include hybrid mpi/openmp

69f6774

Merge branch 'tk-thread-calc-matrix' of github.com:OrderN/CONQUEST-re…

567cea5

…lease into tk-thread-calc-matrix

Remove redundant if statements

a4f0f18

Print output after running test cases for debugging

6a71ff2

tkoskela merged commit a4b0378 into develop Aug 29, 2023

tkoskela deleted the tk-thread-calc-matrix branch August 29, 2023 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread loops over blocks#195

Thread loops over blocks#195
tkoskela merged 48 commits intodevelopfrom
tk-thread-calc-matrix

tkoskela commented Jun 9, 2023 •

edited

Loading

Uh oh!

tkoskela commented Jun 9, 2023 •

edited

Loading

Uh oh!

tkoskela commented Jun 13, 2023 •

edited

Loading

Uh oh!

davidbowler commented Jun 14, 2023

Uh oh!

tkoskela commented Jun 14, 2023

Uh oh!

tkoskela commented Aug 24, 2023

Uh oh!

davidbowler commented Aug 24, 2023

Uh oh!

davidbowler commented Aug 24, 2023

Uh oh!

tkoskela commented Aug 24, 2023

Uh oh!

davidbowler commented Aug 24, 2023

Uh oh!

davidbowler commented Aug 25, 2023

Uh oh!

Uh oh!

tkoskela Aug 29, 2023

Uh oh!

tkoskela Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tkoskela commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tkoskela commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tkoskela commented Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidbowler commented Jun 14, 2023

Uh oh!

tkoskela commented Jun 14, 2023

Uh oh!

tkoskela commented Aug 24, 2023

Uh oh!

davidbowler commented Aug 24, 2023

Uh oh!

davidbowler commented Aug 24, 2023

Uh oh!

tkoskela commented Aug 24, 2023

Uh oh!

davidbowler commented Aug 24, 2023

Uh oh!

davidbowler commented Aug 25, 2023

Uh oh!

Uh oh!

tkoskela Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

tkoskela Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tkoskela commented Jun 9, 2023 •

edited

Loading

tkoskela commented Jun 9, 2023 •

edited

Loading

tkoskela commented Jun 13, 2023 •

edited

Loading