Skip to content

AlphaS against latest master, including code generation and all processes generated#434

Merged
valassi merged 310 commits intomadgraph5:masterfrom
valassi:alphas
Apr 28, 2022
Merged

AlphaS against latest master, including code generation and all processes generated#434
valassi merged 310 commits intomadgraph5:masterfrom
valassi:alphas

Conversation

@valassi
Copy link
Member

@valassi valassi commented Apr 21, 2022

Hi @roiser I am creating here a PR superseding your #428 (to address the running of alphas issue #373). The main differences will be

  • porting to the current latest upstream/master
  • backporting your changes (in ggtt only for now) to code generation, and generating all five processes I use

@valassi valassi marked this pull request as draft April 21, 2022 14:19
@valassi valassi changed the title AlphaS against latest master, including code generation and all processes generated WIP AlphaS against latest master, including code generation and all processes generated Apr 21, 2022
@valassi
Copy link
Member Author

valassi commented Apr 22, 2022

I am making progress on integrating this, but there is work to be done.

I would say the main problems were:

  • while check.exe and gcheck.exe were ok, both runTest and the fcheck test were failing 78c4ed3
  • code generation was not there, and the design of the interface was using gc10/gc11 which are process-dependent parameters that should be removed from the interface (in eemumu they are not needed, for instance)

What I have done so far:

  • I have removed gc10/gc11 from the bridge interfaces, and only kept gs (NB: gs are actually also not needed in some processes like eemumu as this is a QCD coupling, but the whole Madgraph machinery distinguishes between couplings that are dependent or not on alphas, so for the moment I would keep the gs as-is in the interfaces)
  • gc10 and gc11 are now allocated within MEK, no longer within check_sa: they are an internal detail of MEK, they do not need to be exposed outside it
  • there was no way to pass gs through fbridge.inc from fortran code, I have now added that
  • partly thanks to the above, the fcheck tests are now succeding (amongst other things, the hardcoded value of gs that is set in check_sa.cc also needed to be set in fcheck_sa.f, and is now passed through fbridge.inc)
  • I have removed the computeDependentCouplings method from the MEK and from the Bridge: given that only Gs (not gc10, gc11) are passed through the bridge interface, I aminclined to treat thecomputation of dependent couplings as one instance of 'splitting kernels'... I will cook up something for now (possibly moving it to calculate_wavefunction), eventually this should be reviewed together with splitting kernels
  • I have started to rename and move the actual computation of dependent functions: I moved it to HelAmps for now, as it is a process-dependent computation (essentially it should only appear in HelAmps.h, or CPPProcess, or Parameters_sm: these are the only three files with process dependent stuff, all the rest I am trying to keep process independet)

Amongst the things tha are not yet ok

  • runTest still fails
  • remove gc10 and gc11 from sigmakin and similar interfaces, just keep g
  • (minor reminder: copying Gs inside the bridge must be fixed if fptype is not FORTRANFPTYPE)
  • (minor reminder: check timers in check_sa, find the right spot for copying gs)

valassi added 6 commits April 22, 2022 11:19
Was:  computeDependentCouplings      -> calls dependentCouplings kernel -> calls dependent_coupling
Then: computeDependentCouplings      -> calls computeCouplings kernel -> calls G2COUP
Now:  (in MEK merged with ComputeMe) -> calls computeDependentCouplings kernel -> calls G2COUP
…200 registers instead of 170 upstream/master
@valassi
Copy link
Member Author

valassi commented Apr 22, 2022

Ok I have now

  • fixed runtest (was missing hardcoding gs)
  • fixed fptype!=FORTRANFPTYPE (and the fgcheck test in float mode now succeeds, it was failing)

Still t do amongst other things

  • remove gc10 and gc11 from sigmakin and similar interfaces, just keep g (this is necessary for code generation)
  • (minor reminder: check timers in check_sa, find the right spot for copying gs)

In addition

  • investigate performance? I saw the number of registers has increased from 170 to 200 in ggtt double

@valassi
Copy link
Member Author

valassi commented Apr 22, 2022

I need to understand ho wto do gc10/gc11 with respect to code generation. I am dumping my main ideas in #356, as this is mainly about how to split the roles of MEK and CPPProcess (very much relatedto splitting kernels #310)

valassi added 6 commits April 22, 2022 13:40
(NB: now the naming convention is consistent, BufferCouplings and MemoryAccessCouplings)
…ncapsulate ndcoup in BufferCouplings

(still to do: gc10 and gc11 are now derefrenced in MEK, one can do better)
@valassi
Copy link
Member Author

valassi commented Apr 22, 2022

Some progress, I transformed BufferCouplings into a buffer that holds an arbvitrary number ndcoup of coupling arrays. It works so far. Now I need to change the access functions.

@roiser
Copy link
Member

roiser commented Apr 22, 2022

My idea here was to let the code generation do the work, but if you want to change this its also fine with me.

…ray MEs - previous one was using always one value in an array ("trivial"), strange that it worked?!
valassi added 16 commits April 28, 2022 11:15
./tput/teeThroughputX.sh -flt -hrd -makej -makeclean -eemumu -ggtt -ggttg -ggttgg -ggttggg

Note that it takes ~30 minutes for each of the four ggttggg tests to build (no inlining).
Without inlining, all other processes are quite fast:

ls -ltr ee_mumu/lib/build.none_*_inl0_hrd* gg_tt/lib/build.none_*_inl0_hrd* gg_tt*g/lib/build.none_*_inl0_hrd* | egrep -v '(total|\./|\.build|_common|^$)'
ee_mumu/lib/build.none_d_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg   95296 Apr 28 08:29 libmg5amc_epem_mupmum_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1074160 Apr 28 08:29 libmg5amc_epem_mupmum_cuda.so*
ee_mumu/lib/build.none_d_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg   94488 Apr 28 08:29 libmg5amc_epem_mupmum_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1069192 Apr 28 08:29 libmg5amc_epem_mupmum_cuda.so*
ee_mumu/lib/build.none_f_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg   99712 Apr 28 08:29 libmg5amc_epem_mupmum_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1071848 Apr 28 08:29 libmg5amc_epem_mupmum_cuda.so*
ee_mumu/lib/build.none_f_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg   94752 Apr 28 08:29 libmg5amc_epem_mupmum_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1062704 Apr 28 08:29 libmg5amc_epem_mupmum_cuda.so*
gg_tt/lib/build.none_d_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg  103904 Apr 28 08:29 libmg5amc_gg_ttx_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1238000 Apr 28 08:29 libmg5amc_gg_ttx_cuda.so*
gg_tt/lib/build.none_d_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg   98840 Apr 28 08:29 libmg5amc_gg_ttx_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1212608 Apr 28 08:29 libmg5amc_gg_ttx_cuda.so*
gg_tt/lib/build.none_f_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg  104224 Apr 28 08:29 libmg5amc_gg_ttx_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1211112 Apr 28 08:29 libmg5amc_gg_ttx_cuda.so*
gg_tt/lib/build.none_f_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg  103152 Apr 28 08:30 libmg5amc_gg_ttx_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1197872 Apr 28 08:31 libmg5amc_gg_ttx_cuda.so*
gg_ttg/lib/build.none_d_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg  112672 Apr 28 08:31 libmg5amc_gg_ttxg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1868784 Apr 28 08:31 libmg5amc_gg_ttxg_cuda.so*
gg_ttg/lib/build.none_d_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg  111704 Apr 28 08:31 libmg5amc_gg_ttxg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1847488 Apr 28 08:31 libmg5amc_gg_ttxg_cuda.so*
gg_ttg/lib/build.none_f_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg  117088 Apr 28 08:32 libmg5amc_gg_ttxg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1813224 Apr 28 08:33 libmg5amc_gg_ttxg_cuda.so*
gg_ttg/lib/build.none_f_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg  111920 Apr 28 08:34 libmg5amc_gg_ttxg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1799984 Apr 28 08:35 libmg5amc_gg_ttxg_cuda.so*
gg_ttgg/lib/build.none_d_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg  154432 Apr 28 08:37 libmg5amc_gg_ttxgg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 4281328 Apr 28 08:38 libmg5amc_gg_ttxgg_cuda.so*
gg_ttgg/lib/build.none_d_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg  149368 Apr 28 08:39 libmg5amc_gg_ttxgg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 4247744 Apr 28 08:40 libmg5amc_gg_ttxgg_cuda.so*
gg_ttgg/lib/build.none_f_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg  154752 Apr 28 08:42 libmg5amc_gg_ttxgg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 4119272 Apr 28 08:42 libmg5amc_gg_ttxgg_cuda.so*
gg_ttgg/lib/build.none_f_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg  145488 Apr 28 08:44 libmg5amc_gg_ttxgg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 4106032 Apr 28 08:45 libmg5amc_gg_ttxgg_cuda.so*
gg_ttggg/lib/build.none_d_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg   768848 Apr 28 08:48 libmg5amc_gg_ttxggg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 15758320 Apr 28 09:16 libmg5amc_gg_ttxggg_cuda.so*
gg_ttggg/lib/build.none_d_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg   759696 Apr 28 09:20 libmg5amc_gg_ttxggg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 15769792 Apr 28 09:46 libmg5amc_gg_ttxggg_cuda.so*
gg_ttggg/lib/build.none_f_inl0_hrd0:
-rwxr-xr-x.  1 avalassi zg   707728 Apr 28 09:51 libmg5amc_gg_ttxggg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 14711528 Apr 28 10:18 libmg5amc_gg_ttxggg_cuda.so*
gg_ttggg/lib/build.none_f_inl0_hrd1:
-rwxr-xr-x.  1 avalassi zg   702568 Apr 28 10:23 libmg5amc_gg_ttxggg_cpp.so*
-rwxr-xr-x.  1 avalassi zg 14718768 Apr 28 10:49 libmg5amc_gg_ttxggg_cuda.so*
./tput/teeThroughputX.sh -flt -hrd -makej -makeclean -eemumu -ggtt -ggttgg -inlonly

Note that inline builds take longer - and are now slower in c++ than in cuda!
(For non inlined builds cuda is much slower than c++)

ls -ltr ee_mumu/lib/build.none_*_inl1_hrd* gg_tt/lib/build.none_*_inl1_hrd* gg_tt*g/lib/build.none_*_inl1_hrd* | egrep -v '(total|\./|\.build|_common|^$)'
ee_mumu/lib/build.none_d_inl1_hrd0:
-rwxr-xr-x.  1 avalassi zg  107104 Apr 28 11:02 libmg5amc_epem_mupmum_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1074064 Apr 28 11:02 libmg5amc_epem_mupmum_cuda.so*
ee_mumu/lib/build.none_d_inl1_hrd1:
-rwxr-xr-x.  1 avalassi zg   98056 Apr 28 11:02 libmg5amc_epem_mupmum_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1069064 Apr 28 11:02 libmg5amc_epem_mupmum_cuda.so*
ee_mumu/lib/build.none_f_inl1_hrd0:
-rwxr-xr-x.  1 avalassi zg   99224 Apr 28 11:02 libmg5amc_epem_mupmum_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1071720 Apr 28 11:02 libmg5amc_epem_mupmum_cuda.so*
ee_mumu/lib/build.none_f_inl1_hrd1:
-rwxr-xr-x.  1 avalassi zg   94224 Apr 28 11:02 libmg5amc_epem_mupmum_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1062608 Apr 28 11:02 libmg5amc_epem_mupmum_cuda.so*
gg_tt/lib/build.none_d_inl1_hrd0:
-rwxr-xr-x.  1 avalassi zg  111152 Apr 28 11:02 libmg5amc_gg_ttx_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1237904 Apr 28 11:02 libmg5amc_gg_ttx_cuda.so*
gg_tt/lib/build.none_d_inl1_hrd1:
-rwxr-xr-x.  1 avalassi zg   98152 Apr 28 11:02 libmg5amc_gg_ttx_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1212480 Apr 28 11:02 libmg5amc_gg_ttx_cuda.so*
gg_tt/lib/build.none_f_inl1_hrd0:
-rwxr-xr-x.  1 avalassi zg  111472 Apr 28 11:03 libmg5amc_gg_ttx_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1210984 Apr 28 11:03 libmg5amc_gg_ttx_cuda.so*
gg_tt/lib/build.none_f_inl1_hrd1:
-rwxr-xr-x.  1 avalassi zg  102464 Apr 28 11:04 libmg5amc_gg_ttx_cpp.so*
-rwxr-xr-x.  1 avalassi zg 1197776 Apr 28 11:04 libmg5amc_gg_ttx_cuda.so*
gg_ttgg/lib/build.none_d_inl1_hrd0:
-rwxr-xr-x.  1 avalassi zg 4416400 Apr 28 11:06 libmg5amc_gg_ttxgg_cuda.so*
-rwxr-xr-x.  1 avalassi zg  607136 Apr 28 11:09 libmg5amc_gg_ttxgg_cpp.so*
gg_ttgg/lib/build.none_d_inl1_hrd1:
-rwxr-xr-x.  1 avalassi zg 4378688 Apr 28 11:11 libmg5amc_gg_ttxgg_cuda.so*
-rwxr-xr-x.  1 avalassi zg  598160 Apr 28 11:14 libmg5amc_gg_ttxgg_cpp.so*
gg_ttgg/lib/build.none_f_inl1_hrd0:
-rwxr-xr-x.  1 avalassi zg 4201064 Apr 28 11:16 libmg5amc_gg_ttxgg_cuda.so*
-rwxr-xr-x.  1 avalassi zg  627936 Apr 28 11:19 libmg5amc_gg_ttxgg_cpp.so*
gg_ttgg/lib/build.none_f_inl1_hrd1:
-rwxr-xr-x.  1 avalassi zg 4191952 Apr 28 11:21 libmg5amc_gg_ttxgg_cuda.so*
-rwxr-xr-x.  1 avalassi zg  622952 Apr 28 11:24 libmg5amc_gg_ttxgg_cpp.so*
STARTED AT Thu Apr 28 08:28:54 CEST 2022
ENDED(1) AT Thu Apr 28 11:02:01 CEST 2022
ENDED(2) AT Thu Apr 28 11:32:47 CEST 2022
ENDED(3) AT Thu Apr 28 11:37:09 CEST 2022
ENDED(4) AT Thu Apr 28 11:40:19 CEST 2022
ENDED(5) AT Thu Apr 28 11:43:25 CEST 2022
…EFORE THE ALPHAS PR

The typical build times were as for alpha, 30 minutes for each of 4 ggttggg tests (but some builds were cached)
STARTED AT Thu Apr 28 12:13:58 CEST 2022
ENDED(1) AT Thu Apr 28 13:47:08 CEST 2022
ENDED(2) AT Thu Apr 28 14:24:29 CEST 2022
ENDED(3) AT Thu Apr 28 14:28:54 CEST 2022
ENDED(4) AT Thu Apr 28 14:32:07 CEST 2022
ENDED(5) AT Thu Apr 28 14:35:20 CEST 2022
Revert "[alphas] rerun all tests with allTees.sh USING UPSTREAM/MASTER CODE BEFORE THE ALPHAS PR"
This reverts commit 08349369308297560e19a97277049f315cbba078.
…remove inl1 data, confusing and irrelevant)

Introducing the alphas memory access does not seem to have degraded performance significantly, good
@valassi
Copy link
Member Author

valassi commented Apr 28, 2022

This is FINALLY completed

… does not build yet, see madgraph5#439

I will merge this anyway as standalone cudacpp for SM physics works fine
(there is one exception, uudd fails, see madgraph5#440 - I will fix that a posteriori).

Note also that alphas from madevent are still not integrated, and so ggtt.mad fails to build for instance, madgraph5#441.
This will be the next big thing.
@valassi
Copy link
Member Author

valassi commented Apr 28, 2022

Note the performance difference between pre and post alphas:
https://github.com/madgraph5/madgraph4gpu/blob/7d1a6d9b604578c0957ebbf566539d4fb6121aac/epochX/cudacpp/tput/summaryTable_alphas.txt

Example

-------------------------------------------------------------------------------

+++ cudacpp REVISION 88fe36d1 +++
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:

[nvcc 11.6.124 (gcc 10.2.0)] 
HELINL=0 HRDCOD=0
                eemumu          ggtt         ggttg        ggttgg       ggttggg
         [2048/256/12]  [2048/256/1]  [2048/256/1]  [2048/256/1]    [64/256/1]
CUD/none      1.32e+09      1.42e+08      1.44e+07      5.13e+05      1.21e+04
         [2048/256/12]  [2048/256/1]    [64/256/1]    [64/256/1]     [1/256/1]
CPP/none      1.66e+06      2.00e+05      2.48e+04      1.80e+03      7.15e+01
CPP/sse4      3.09e+06      3.15e+05      4.50e+04      3.34e+03      1.30e+02
CPP/avx2      5.47e+06      5.69e+05      8.82e+04      6.79e+03      2.60e+02
CPP/512y      5.87e+06      6.12e+05      9.81e+04      7.46e+03      2.83e+02
CPP/512z      4.67e+06      3.85e+05      7.22e+04      6.56e+03      2.95e+02

+++ cudacpp REVISION bae5c248 +++
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:

[nvcc 11.6.124 (gcc 10.2.0)] 
HELINL=0 HRDCOD=0
                eemumu          ggtt         ggttg        ggttgg       ggttggg
         [2048/256/12]  [2048/256/1]  [2048/256/1]  [2048/256/1]    [64/256/1]
CUD/none      1.10e+09      1.36e+08      1.17e+07      4.91e+05      1.11e+04
         [2048/256/12]  [2048/256/1]    [64/256/1]    [64/256/1]     [1/256/1]
CPP/none      1.65e+06      1.99e+05      2.46e+04      1.79e+03      7.21e+01
CPP/sse4      3.10e+06      3.13e+05      4.43e+04      3.33e+03      1.32e+02
CPP/avx2      5.35e+06      5.61e+05      8.80e+04      6.76e+03      2.61e+02
CPP/512y      5.96e+06      6.11e+05      9.91e+04      7.58e+03      2.91e+02
CPP/512z      4.60e+06      3.80e+05      7.15e+04      6.48e+03      2.93e+02

-------------------------------------------------------------------------------

There is almost no difference in cpp (as expected?). CUDA on GPU seems around 10% slower consistently. But I guess we need to live with that...

@valassi valassi changed the title WIP AlphaS against latest master, including code generation and all processes generated AlphaS against latest master, including code generation and all processes generated Apr 28, 2022
@valassi valassi marked this pull request as ready for review April 28, 2022 15:22
@valassi
Copy link
Member Author

valassi commented Apr 28, 2022

(Well, about performance: not sure at all why eemumu would be slower... I opened #442 to investigate further if someone has the courage).

This is now complete. I am self merging.

@valassi valassi merged commit 605291c into madgraph5:master Apr 28, 2022
valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this pull request Aug 16, 2023
…CCESS for independent couplings and CD_ACCESS for dependent couplings
valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this pull request Aug 16, 2023
…gpu#434 codegen, fails

"output standalone_cudacpp CODEGEN_cudacpp_ee_mumu" with error:
AttributeError : 'PLUGIN_GPUFOHelasCallWriter' object has no attribute 'model'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants