-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Triggered by the OMP work on c++ (#82) and by the build issues this cause (#83), I realised that eventually we WILL actualy need to run on CPU and GPU in parallel in a "heterogeneous" Madgraph.
This is actually one of the most interesting (and challenging) issues for resource benchmarking in the HEPIX WG.
There are various issue to address here:
- build issues (what to build with which compiler)
- code structure issues (how to best structure the relevant functions and classes - the current code needs a good cleanup anyway)
- configuration issues (how to pass the approriate command line arguments)
- optimization issues (how to share the load, knowing that the GPU is faster)
- random number issues (how to choose different sequences on CPU and GPU)
- combination issues (eg how to compute a single cross section, or output a single file of unweighted events)
I created a simple prototype in https://github.com/valassi/madgraph4gpu/tree/het. The point here was mainly a proof of concept, also trying to sort out the build (which may be useful for addressing #83).
The current prototype runs exactly the same number of events with exactly the same random numbers in parallel on CPU (with OMP threads) and on GPU. Both computations give the same sets of events, which are not yet combined. As the GPU is much faster, essentiually the net effect is a computation that lasts as long as the CPU version, but does double events (because the same events are also on the GPU), so the throghput doubles.
This clearly needs a lot more work (especially the optimization is tricky), but it's a useful prrof of concept.
./hcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid = 16384
NumThreadsPerBlock = 32
NumIterations = 1
-----------------------------------------------------------------------
FP precision = DOUBLE
Complex type = THRUST::COMPLEX
RanNumb memory layout = AOSOA[4]
Momenta memory layout = AOSOA[4]
Random number generation = CURAND DEVICE (CUDA code)
Wavefunction GPU memory = LOCAL
-----------------------------------------------------------------------
NumIterations = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.312938e-03 ) sec
TotalTime[Rambo+ME] (23)= ( 6.714818e-03 ) sec
TotalTime[RndNumGen] (1)= ( 5.981200e-04 ) sec
TotalTime[Rambo] (2)= ( 5.945168e-03 ) sec
TotalTime[MatrixElems] (3)= ( 7.696500e-04 ) sec
MeanTimeInMatrixElems = ( 7.696500e-04 ) sec
[Min,Max]TimeInMatrixElems = [ 7.696500e-04 , 7.696500e-04 ] sec
-----------------------------------------------------------------------
TotalEventsComputed = 524288 (nan=0)
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.169321e+07 ) sec^-1
EvtsPerSec[Rmb+ME] (23)= ( 7.807926e+07 ) sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.812032e+08 ) sec^-1
***********************************************************************
NumMatrixElements(notNan) = 524288
MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0
[Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0
StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0
MeanWeight = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ]
StdDevWeight = ( 0.000000e+00 )
***********************************************************************
(GPU) 00 CudaFree : 0.872603 sec
(GPU) 0a ProcInit : 0.000233 sec
(GPU) 0b MemAlloc : 0.035793 sec
(GPU) 0c GenCreat : 0.009814 sec
(GPU) 0d SGoodHel : 0.001759 sec
(GPU) 1a GenSeed : 0.000009 sec
(GPU) 1b GenRnGen : 0.000589 sec
(GPU) 2a RamboIni : 0.000022 sec
(GPU) 2b RamboFin : 0.000014 sec
(GPU) 2c CpDTHwgt : 0.000502 sec
(GPU) 2d CpDTHmom : 0.005407 sec
(GPU) 3a SigmaKin : 0.000014 sec
(GPU) 3b CpDTHmes : 0.000756 sec
(GPU) 4a DumpLoop : 0.004765 sec
(GPU) 8a CompStat : 0.003540 sec
(GPU) 9a GenDestr : 0.000048 sec
(GPU) 9b DumpScrn : 0.000044 sec
(GPU) 9c DumpJson : 0.000007 sec
(GPU) TOTAL : 0.935918 sec
(GPU) TOTAL (123) : 0.007313 sec
(GPU) TOTAL (23) : 0.006715 sec
(GPU) TOTAL (1) : 0.000598 sec
(GPU) TOTAL (2) : 0.005945 sec
(GPU) TOTAL (3) : 0.000770 sec
***********************************************************************
***********************************************************************
NumBlocksPerGrid = 16384
NumThreadsPerBlock = 32
NumIterations = 1
-----------------------------------------------------------------------
FP precision = DOUBLE
Complex type = STD::COMPLEX
RanNumb memory layout = AOSOA[4]
Momenta memory layout = AOSOA[4]
Random number generation = CURAND (C++ code)
OMP threads / maxthreads = 4 / 4
-----------------------------------------------------------------------
NumIterations = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 5.725351e-01 ) sec
TotalTime[Rambo+ME] (23)= ( 5.449318e-01 ) sec
TotalTime[RndNumGen] (1)= ( 2.760323e-02 ) sec
TotalTime[Rambo] (2)= ( 9.914417e-02 ) sec
TotalTime[MatrixElems] (3)= ( 4.457877e-01 ) sec
MeanTimeInMatrixElems = ( 4.457877e-01 ) sec
[Min,Max]TimeInMatrixElems = [ 4.457877e-01 , 4.457877e-01 ] sec
-----------------------------------------------------------------------
TotalEventsComputed = 524288 (nan=0)
EvtsPerSec[Rnd+Rmb+ME](123)= ( 9.157308e+05 ) sec^-1
EvtsPerSec[Rmb+ME] (23)= ( 9.621167e+05 ) sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.176094e+06 ) sec^-1
***********************************************************************
NumMatrixElements(notNan) = 524288
MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0
[Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0
StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0
MeanWeight = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ]
StdDevWeight = ( 0.000000e+00 )
***********************************************************************
(CPU) 0a ProcInit : 0.000331 sec
(CPU) 0b MemAlloc : 0.025358 sec
(CPU) 0c GenCreat : 0.000915 sec
(CPU) 1a GenSeed : 0.000009 sec
(CPU) 1b GenRnGen : 0.027595 sec
(CPU) 2a RamboIni : 0.006872 sec
(CPU) 2b RamboFin : 0.092273 sec
(CPU) 3a SigmaKin : 0.445788 sec
(CPU) 4a DumpLoop : 0.004605 sec
(CPU) 8a CompStat : 0.003633 sec
(CPU) 9a GenDestr : 0.000094 sec
(CPU) 9b DumpScrn : 0.004946 sec
(CPU) 9c DumpJson : 0.000008 sec
(CPU) TOTAL : 0.612425 sec
(CPU) TOTAL (123) : 0.572535 sec
(CPU) TOTAL (23) : 0.544932 sec
(CPU) TOTAL (1) : 0.027603 sec
(CPU) TOTAL (2) : 0.099144 sec
(CPU) TOTAL (3) : 0.445788 sec
***********************************************************************
-----------------------------------------------------------------------
TotalTime[Rnd+Rmb+ME] (123)= ( 5.798480e-01 ) sec
TotalTime[Rambo+ME] (23)= ( 5.516467e-01 ) sec
TotalTime[RndNumGen] (1)= ( 2.820135e-02 ) sec
TotalTime[Rambo] (2)= ( 1.050893e-01 ) sec
TotalTime[MatrixElems] (3)= ( 4.465573e-01 ) sec
-----------------------------------------------------------------------
TotalEventsComputed = 1048576
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.808364e+06 ) sec^-1
EvtsPerSec[Rmb+ME] (23)= ( 1.900811e+06 ) sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.348133e+06 ) sec^-1
-----------------------------------------------------------------------