Merge of master into master_june24 and channelid fixes/reimplementation by valassi · Pull Request #882 · madgraph5/madgraph4gpu

valassi · 2024-07-04T14:18:30Z

Hi @oliviermattelaer this is a WIP PR to start working towards the resync of master and master_june24. From what I understand this is one of the things you want to push with high priority.

(PS IMPORTANT a summary for review is in #882 (comment) below)

This is constructed as a merge into master_june24.
That is to say, I start from what you and Stefan had in master_june24 (as a result of Stefan's channelid PR #830, related to the warp issue #765), and I start porting a few of the master stuff, rather than going the opposite way. This allows me to go in steps with things I know (the various steps in master).

For the moment, here I am just merging the latest master CI (with tmad tests) into master_june24. Since the CI is enabled also for master_june24, I expect that the new tests should run, and the results may be interesting.

Speaking of which, @roiser @oliviermattelaer, how did you test the code in master_june24?

am I supposed to use a different input.txt file to pipe to madevent to specify a range of iconfig's, or will the current one with a single iconfig value be enough?
if I am supposed to use the same input.txt with a single iconfig (by looking at driver.f which has not changed I would guess this is the case), can you confirm that the code will still test the new functionality you have created and have a channelid array with different values, or will this result in a channelid array which all have the same value?
(@oliviermattelaer for my information, not directly or immediately relevant for tests: is the madevent fortran/python/bash infrastructure to orchestrate fewer G* jobs with many channels per job complete, or is this still under development?)
(and also for my information if I should have issues in the code: do I remember correctly that a channelID array eg of 32k channels will be segmented such that inside each 32-channel warp the channelid is the same, but different warps can have different channelids? or did you eventually modify the logic of this?)

Thanks,
Andrea

PS For the context: master_june24 mainly differs from master because of the addition of Stefan's channelid MR #830 which is connected to Olivier's warp work in #765

valassi · 2024-07-04T14:56:09Z

There are 49 errors in the CI. I opened #883 #884 #885

valassi · 2024-07-04T15:46:57Z

I am trying to fix issues in MG5AMC. Will do a force push and file an issue

valassi · 2024-07-04T15:52:36Z

I have tried to upgrade MG5AMC from the current eef200f94 to gpucpp_june24, but this fails codegen #886. I have reverted.

I will instead create a branch where I merge gpucpp on top of the eef200f94 which is currently in master_june24.

valassi · 2024-07-04T16:26:59Z

This is annoying. I upgraded MG5AMC including the rotxxx fix #857 that I used for the crash #855. This has NOT fixed the CI crash of madevent in all CI tests #885. I will need to use a debug build with gdb.

There are still 49 failing tests.

valassi · 2024-07-04T16:45:41Z

I have fixed a minor typo in unit_v for MAC #883. Marking it as fixed.

Now there are only 45 CI errors instead of 49

valassi · 2024-07-04T17:33:04Z

I have fixed another minor issue #884 failing tghe builds for FPTYPE=m. There was one line forgotten from a previous implementation, it should have been removed for FPTYPE=m and was not.

Now down from 45 to 39 errors, all related to #885 crashes in the new CI tests I think. The old CI tests are now all succeeding.

valassi · 2024-07-05T11:27:43Z

I investigated #885 and found that the crash only happens when setting VECSIZE_USED different from VECSIZE_MEMMAX. In the CI in my initial tests VECSIZE_MEMMAX was 16384 and VECSIZE_USED was 32, so this crashed.

Looking more into that, I realised that I was not sure what parameters I should use for NB_WARP and WARP_SIZE. This is discussed in #887. I gave it a try to use NB_WARP=512 with WARP_SIZE=32 ie VECSIZE_MEMAMX=16384. With VECSIZE_USED=32, this still crashes in #885. But in addition I also get a Fortran runtime error in symconf #888.

valassi · 2024-07-05T11:53:21Z

I added a workaround (NOT a fix) for crash #885 just to allow the CI tests to proceed further. Essentially I put down NB_WARP=1 and WARP_SIZE=32 so that VECSIZE_MEMAMX=32 is the same as VECSIZE_USED=32. This avoids the crash (but avoids testing anything interesting in the new warp infrastructure, making it pointless). A proper fix for #885 (and for #888) is needed.

Apart from other 'expected' failures, there is a xsec mismatch for ggttggg #889. This can be fixed by increasing tolerance

There are ten failures overall. The other 9 are the usual #826 and #872 (pensing in master) and #856 (fixed in master, to be merged here).

valassi · 2024-07-05T12:26:47Z

Ok I added a workaround for the tolerances #889, now dowsn to 9 errors

The crash #885 becomes high priority here, otherwise we are not testing anything intresting, only NB_WARP=1...

valassi · 2024-07-05T12:26:56Z

Ok I added a workaround for the tolerances #889, now dowsn to 9 errors

The crash #885 becomes high priority here, otherwise we are not testing anything intresting, only NB_WARP=1...

valassi · 2024-07-05T16:26:29Z

Hi @oliviermattelaer as you see I made some progress here, but I am putting this work on hold.

I need some answers on #887 (and it is possible that without VECSIZE_USED I go nowhere and I need to wait for that).

(Or at least: probably I will continue merging bits of master into master_june24 so that we avoid a complete divergence, but I would argue against merging back master_june24 into master until many of these issues are fixed... there are just too many things that seem to not have been tested)

…'make cleanavxs' target as suggested by Olivier

Improvements to complete PRs 979 and 980

PR where I only add CI but not trying to fix them

implement new CI and fixed the one related to the compilation issue.

…ne24) to 4ef15cab1 (current valassi_gpucpp_june24 including merge of mg5amcnlo#132 with Source makefile changes)

…ier merging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)

… merging (except gg_tt.mad, to inspect it) git checkout upstream/master $(git ls-tree --name-only upstream/master */bin/internal/banner.py | grep -v ^gg_tt.mad/)

…, madgraph5#980, madgraph5#984 patches for the new CI and Source/makefile) into june24 Fix conflicts: - MG5aMC/mg5amcnlo (keep the current june24 version 4ef15cab1 i.e. current valassi_gpucpp_june24) - epochX/cudacpp/gg_tt.mad/bin/internal/banner.py (keep a debug printout)

…l commit message later (regenerating the patch changes nothing)

…g changes: I just want to mark that Source/makefile is no longer there) The only files that still need to be patched are - 2 in patch.common: Source/genps.inc, SubProcesses/makefile - 2 in patch.P1: driver.f, matrix1.f ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad

valassi · 2024-09-01T21:09:51Z

Hi @oliviermattelaer I have just merged the latest master Including fixci and the Source/makefile stuff) into my june24 here.

I am running some manual tests tonight. Then tomorrow morning I think we are good to go if those tests are ok and the CI is also ok.

I would close instead your #981 which is essentialy a duplicate (and may be missing some pieces)

valassi · 2024-09-01T21:40:26Z

Ok the CI is good with three expected failures.

Running some manual tests tonight

STARTED AT Sun Sep 1 11:07:02 PM CEST 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sun Sep 1 11:30:56 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Sun Sep 1 11:39:10 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Sun Sep 1 11:48:10 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Sun Sep 1 11:50:55 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Sun Sep 1 11:53:39 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Sun Sep 1 11:56:27 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Mon Sep 2 12:05:56 AM CEST 2024 [Status=0]

…0 on june24 branch - everything ok STARTED AT Mon Sep 2 06:58:36 AM CEST 2024 (SM tests) ENDED(1) AT Mon Sep 2 11:07:02 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Mon Sep 2 11:17:21 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

valassi · 2024-09-02T09:34:56Z

Hi @oliviermattelaer I completed all my tests. I am good to go.

As discussed privately, I would suggest the following:

Merge master_june24 completion: nb_warp_used (and a SUBset of gpucpp_june24) into gpucpp mg5amcnlo/mg5amcnlo#121 into gpucpp
Merge Merge of master into master_june24 and channelid fixes/reimplementation #882 of june24 into master_june24
Close your Val june24 fixed ci #981 because it is a duplicate of 882 and not as up to date
Merge master_june24 into master in the PR I just created Merge master_june24 into master (including channelid fixes/reimplementation) #985
(I have closed june24 into master WIP (into master) channelid fixes/reimplementation and merge of master_june24 into master #930)

Then you will be able to work on 360, while I will focus on goodhel

Should I go ahead?
Thanks
Andrea

valassi · 2024-09-02T11:43:28Z

Thanks @oliviermattelaer for your comments mg5amcnlo/mg5amcnlo#121 (comment)

I put this in WIP again. I need to check if warp_used is used ocrrectly in dsample.f. This is #983

valassi · 2024-09-02T15:08:43Z

I checked #983 and I think that there is nothing else to do for nb_warp_used in dsample.f. But can you cross check please? Thanks Andrea

…c9350 (latest gpucpp, including the merge of the former via mg5amcnlo#121) NB: the contents of 4ef15cab1 and a696c9350 are identical - but a696c9350 is pointing again to the HEAD of gpucpp

valassi · 2024-09-03T10:02:30Z

Ok here is the update

Merge mg5amcnlo/mg5amcnlo#121 into gpucpp

This is DONE. I have also upgraded the mg5amc link in this june24 branch to point to that gpucpp. The CI is as expected, 3 failing tests.

Merge #882 of june24 into master_june24

This I will do now. MERGING!

Close your #981 because it is a duplicate of 882 and not as up to date

This I have done, closed.

Merge master_june24 into master in the PR I just created #985

This will be the next step.

SO FINALLY MERGING!

…annelid fixes/reimplementation madgraph5#882 and madgraph5#985) into runcard Fix conflicts (use ~runcard version): epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common

valassi self-assigned this Jul 4, 2024

valassi requested a review from a team as a code owner July 4, 2024 14:18

valassi marked this pull request as draft July 4, 2024 14:18

This was referenced Jul 4, 2024

master_june24 builds fail for MAC SIMD #883

Closed

master_june24 builds fail for FPTYPE=m #884

Closed

master_june24 tmad tests crashes in dsig1_vec (CUDACPP_RUNTIME_VECSIZEUSED not correctly propagated?) #885

Closed

valassi force-pushed the june24 branch from 8b9962d to eba31fd Compare July 4, 2024 15:48

valassi mentioned this pull request Jul 4, 2024

URGENT - master_june24 does not use gpucpp_june24 (and upgrading fails codegen) #886

Closed

This was linked to issues Jul 4, 2024

master_june24 builds fail for MAC SIMD #883

Closed

master_june24 builds fail for FPTYPE=m #884

Closed

valassi changed the title ~~WIP: merge of master into master_june24 (for the moment: add master CI to master_june24)~~ WIP: merge of master into master_june24 (for the moment: add master CI to master_june24, identify/fix some issues) Jul 4, 2024

valassi linked an issue Jul 5, 2024 that may be closed by this pull request

master_june24: xsec mismatch in ggttggg for FPTYPE=f (must increase tolerance) #889

Closed

valassi mentioned this pull request Jul 5, 2024

Port to master a CI fix (tmad xsec tolerance increase for 889) from master_june24 work #890

Merged

This was referenced Jul 8, 2024

master_june24: hack in counters.cc hides a real issue in auto_dsig1.f #891

Closed

master_june24: nomultichannel should be an array nullptr, not an array full of 0s #892

Closed

master_june24: __CUDACC__ macros prevent HIP support #893

Closed

valassi and others added 14 commits September 1, 2024 18:32

[gpucpp] in gq_ttq.mad and CODEGEN for Source/makefile, add back the …

bc12fdc

…'make cleanavxs' target as suggested by Olivier

[gpucpp] regenerate gq_ttq.mad, check that all is ok

f1db0b7

[gpucpp] ** COMPLETE GPUCPP ** regenerate all processes

4776d29

Merge pull request madgraph5#984 from valassi/gpucpp

c153482

Improvements to complete PRs 979 and 980

Merge pull request madgraph5#979 from madgraph5/testsuite_only

2d95c01

PR where I only add CI but not trying to fix them

Merge pull request madgraph5#980 from madgraph5/testsuite_only_fixed

c4c6e13

implement new CI and fixed the one related to the compilation issue.

[june24] upgrade mg5amcnlo from f0b429915 (previous valassi_gpucpp_ju…

84d6eab

…ne24) to 4ef15cab1 (current valassi_gpucpp_june24 including merge of mg5amcnlo#132 with Source makefile changes)

[june24] move to codegen logs from the latest upstream/master for eas…

b15033d

…ier merging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)

[june24] move to banner.py from the latest upstream/master for easier…

289efb9

… merging (except gg_tt.mad, to inspect it) git checkout upstream/master $(git ls-tree --name-only upstream/master */bin/internal/banner.py | grep -v ^gg_tt.mad/)

[gpucpp] in CODEGEN dummy change to patch.common to allow a meaningfu…

4063621

…l commit message later (regenerating the patch changes nothing)

[june24] regenerate gg_tt.mad to check that all is ok

942f5b2

[june24] regenerate all processes - only banner.py has changed

e9bf146

This was referenced Sep 1, 2024

Val june24 fixed ci #981

Closed

Merge master_june24 into master (including channelid fixes/reimplementation) #985

Merged

valassi added 2 commits September 2, 2024 06:56

valassi changed the title ~~(into master_june24) merge of master into master_june24 and channelid fixes/reimplementation~~ Merge of master into master_june24 and channelid fixes/reimplementation Sep 2, 2024

valassi changed the title ~~Merge of master into master_june24 and channelid fixes/reimplementation~~ WIP Merge of master into master_june24 and channelid fixes/reimplementation Sep 2, 2024

valassi changed the title ~~WIP Merge of master into master_june24 and channelid fixes/reimplementation~~ Merge of master into master_june24 and channelid fixes/reimplementation Sep 2, 2024

[june24] update MG5AMC from 4ef15cab1 (valassi_gpucpp_june24) to a696…

b0954d3

…c9350 (latest gpucpp, including the merge of the former via mg5amcnlo#121) NB: the contents of 4ef15cab1 and a696c9350 are identical - but a696c9350 is pointing again to the HEAD of gpucpp

valassi merged commit 2cb6c41 into madgraph5:master_june24 Sep 3, 2024

Conversation

valassi commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valassi commented Jul 4, 2024

Uh oh!

valassi commented Jul 4, 2024

Uh oh!

valassi commented Jul 4, 2024

Uh oh!

valassi commented Jul 4, 2024

Uh oh!

valassi commented Jul 4, 2024

Uh oh!

valassi commented Jul 4, 2024

Uh oh!

valassi commented Jul 5, 2024

Uh oh!

valassi commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valassi commented Jul 5, 2024

Uh oh!

valassi commented Jul 5, 2024

Uh oh!

valassi commented Jul 5, 2024

Uh oh!

valassi commented Sep 1, 2024

Uh oh!

valassi commented Sep 1, 2024

Uh oh!

valassi commented Sep 2, 2024

Uh oh!

valassi commented Sep 2, 2024

Uh oh!

valassi commented Sep 2, 2024

Uh oh!

valassi commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

valassi commented Jul 4, 2024 •

edited

Loading

valassi commented Jul 5, 2024 •

edited

Loading

valassi commented Sep 3, 2024 •

edited

Loading