Skip to content

276 thread exx phi on grid#324

Merged
connoraird merged 5 commits into276-combine-nsf-loopsfrom
276-thread-exx-phi-on-grid
Feb 20, 2024
Merged

276 thread exx phi on grid#324
connoraird merged 5 commits into276-combine-nsf-loopsfrom
276-thread-exx-phi-on-grid

Conversation

@connoraird
Copy link
Contributor

@connoraird connoraird commented Feb 9, 2024

Description

  • The xyz nested loop as been thread parallised in exx_phi_on_grid

Speedup plot

These plots shows the performance of test test_004_isol_C2H4_4proc_PBE0CRI for 1 mpi process
276-thread-phi-on-grid

Copy link
Contributor

@tkoskela tkoskela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice, minimal change that gets a parallel performance boost! I like it.

!print*,
xyz_offset = xyz + rst
!$omp parallel do collapse(3) schedule(runtime) default(none) &
!$omp shared(mx,my,mz,px,py,pz,grid_spacing,xyz_offset,pao,spec,phi_on_grid,i_dummy,exx_cartesian,extent) &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorter lines please 🥺

Comment on lines +165 to +167
x = nx*grid_spacing + xyz_offset(1)
y = ny*grid_spacing + xyz_offset(2)
z = nz*grid_spacing + xyz_offset(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. If there's spare time, I'd like to try the following:

precompute x, y, and z into arrays outside of the loop. These are just 1d arrays so the memory footprint should be small, and we would avoid redundant recomputations of y and z. Could use a Array of Structures data format so that structures of x,y,z are aligned in memory (ie, real, dimension(3,N) :: xyz, where N = max(px-mx, py-my, pz-mz), or something like that. Then inside this loop you could just use xyz(1,nx), xyz(2,ny), xyz(3,nz). I'm not sure if this makes much performance difference. As we learned this week, "memory is expensive, flops are free".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this would be worth it. I've just tested the concept with a short program and there seems to be no difference. When accessing xyz in the nested loop, the different values of nx, ny and nz make the memory accessed non-contiguous so we'll be getting lots of cache misses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to arrange the data such that the nx,ny,nz accesses are contiguous. Maybe I got it wrong in my comment. In principle I agree, if it seems like this isn't worth it, let's not spend much time on it.

@tkoskela tkoskela added the improves: speed Speed-up of code label Feb 16, 2024
@connoraird connoraird force-pushed the 276-thread-exx-phi-on-grid branch from ec90a6a to 36848b7 Compare February 19, 2024 15:54
@connoraird connoraird force-pushed the 276-thread-exx-phi-on-grid branch from 36848b7 to be386f1 Compare February 19, 2024 15:57
@connoraird connoraird changed the base branch from 276-use-blas to 276-combine-nsf-loops February 20, 2024 12:01
@connoraird connoraird marked this pull request as ready for review February 20, 2024 12:02
@connoraird connoraird merged commit 442640c into 276-combine-nsf-loops Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improves: speed Speed-up of code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants