Fix segfault from constexpr_math.h and adapt testmisc.cc to higher tolerances when running through valgrind#908
Merged
valassi merged 61 commits intomadgraph5:masterfrom Jul 16, 2024
Merged
Conversation
…n both CPU and GPU (prepare for madgraph5#896) - the C++ tests succeed but the CUDA tests segfaults madgraph5#903
…from release-1.11.0 to v1.14.0 to solve madgraph5#903, but the segfault remains - will revert
…ase-1.11.0 Revert "[gtest/june24] in CODEGEN cudacpp_test.mk, try to upgrade googletest from release-1.11.0 to v1.14.0 to solve madgraph5#903, but the segfault remains - will revert" This reverts commit 34cd623.
…cc build in CUDA while debugging madgraph5#903 With testmisc.cc, valgrind gives a confusing error ==2887713== Stack overflow in thread #1: can't grow stack to 0x1ffe801000 ==2887713== ==2887713== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==2887713== Access not within mapped region at address 0x1FFE801FF8 ==2887713== Stack overflow in thread #1: can't grow stack to 0x1ffe801000 ==2887713== at 0x449C06: mg5amcGpu::constexpr_sin_quad(long double, bool) (constexpr_math.h:156) ==2887713== If you believe this happened as a result of a stack ==2887713== overflow in your program's main thread (unlikely but ==2887713== possible), you can try to increase the size of the ==2887713== main thread stack using the --main-stacksize= flag. ==2887713== The main thread stack size used in this run was 8388608. ==2887713== ==2887713== HEAP SUMMARY: ==2887713== in use at exit: 21,309,363 bytes in 13,995 blocks ==2887713== total heap usage: 18,083 allocs, 4,088 frees, 51,971,780 bytes allocated ==2887713== ==2887713== LEAK SUMMARY: ==2887713== definitely lost: 0 bytes in 0 blocks ==2887713== indirectly lost: 0 bytes in 0 blocks ==2887713== possibly lost: 2,599,608 bytes in 825 blocks ==2887713== still reachable: 18,709,755 bytes in 13,170 blocks ==2887713== suppressed: 0 bytes in 0 blocks ==2887713== Rerun with --leak-check=full to see details of leaked memory ==2887713== ==2887713== For lists of detected and suppressed errors, rerun with: -s ==2887713== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Segmentation fault (core dumped) Without testmisc.cc instead [ RUN ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt ==2889432== Invalid write of size 8 ==2889432== at 0x484E2DB: memmove (vg_replace_strmem.c:1385) ==2889432== by 0x41A6EA: double* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double>(double const*, double const*, double*) (stl_algobase.h:431) ==2889432== by 0x41A49B: double* std::__copy_move_a2<false, double*, double*>(double*, double*, double*) (stl_algobase.h:494) ==2889432== by 0x41A1A5: double* std::__copy_move_a1<false, double*, double*>(double*, double*, double*) (stl_algobase.h:522) ==2889432== by 0x419F4D: double* std::__copy_move_a<false, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:529) ==2889432== by 0x419D0C: double* std::copy<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:619) ==2889432== by 0x419950: mg5amcGpu::CommonRandomNumberKernel::generateRnarray() (CommonRandomNumberKernel.cc:34) ==2889432== by 0x44443D: CUDATest::prepareRandomNumbers(unsigned int) (runTest.cc:202) ==2889432== by 0x440D98: MadgraphTest_CompareMomentaAndME_Test::TestBody() (MadgraphTest.h:253) ==2889432== by 0x48790F: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607) ==2889432== by 0x480EF8: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643) ==2889432== by 0x459587: testing::Test::Run() (gtest.cc:2682) ==2889432== Address 0x2fc0f200 is not stack'd, malloc'd or (recently) free'd ==2889432== ==2889432== ==2889432== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==2889432== Access not within mapped region at address 0x2FC0F200 ==2889432== at 0x484E2DB: memmove (vg_replace_strmem.c:1385) ... Segmentation fault (core dumped)
…cc build while debugging madgraph5#903 also for C++ The test does not segfault without valgrind, but it does segfault in valgrind! (NB this all realted to debug builds, in C++ and in CUDA) And with testmisc.cc, valgrind gives a confusing error for C++ (cppnone here) as in CUDA: ==2893804== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==2893804== Access not within mapped region at address 0x1FFE801FF8 ==2893804== Stack overflow in thread #1: can't grow stack to 0x1ffe801000 ==2893804== at 0x431835: mg5amcCpu::constexpr_sin_quad(long double, bool) (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/runTest_cpp.exe) So I disable testmisc but now the C++ test (cppnone here) no longer segfaults...?!
…pp.exe by adding -no-pie madgraph5#904
…ng OMP only for clang16 madgraph5#904
…6 builds madgraph5#904 (disabling OMP only for clang16; add -no-pie for fcheck_cpp.exe)
…pp.exe by adding -no-pie madgraph5#904
…ng OMP only for clang16 madgraph5#904
…6 builds madgraph5#904 (disabling OMP only for clang16; add -no-pie for fcheck_cpp.exe)
Revert "[gtest/june24] in gg_tt.mad cudacpp.mk, TEMPORARELY disable testmisc.cc build while debugging madgraph5#903 also for C++" This reverts commit 944caab. Will now test with clang16 (after recent fixes) and valgrind (after upgrading to 3.23)
…ster for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…ng OMP only for clang17 madgraph5#904
…7 builds madgraph5#904 (disable OMP also for clang17)
…ng OMP only for clang17 madgraph5#904
…7 builds madgraph5#904 (disable OMP also for clang17)
…ster for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…ph5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…adgraph5#904, adding -fPIC to fortran compilation
…ODEGEN logs from the latest upstream/master for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…g constexpr_sin: now valgrind on c++ runTest succeds again?! However cuda still fails (even without valgrind) madgraph5#903
… now valgrind runTest_cpp.exe will fail Revert "[gtest/june24] in gg_tt.mad testmisc.cc, comment out the section using constexpr_sin: now valgrind on c++ runTest succeds again?!" This reverts commit 975f7aacb8661807a329ec1f51b2d7d8dba45167.
…ph5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…lerances when running on valgrind madgraph5#906 Also allow tan(x)=-inf if ctan(x)=+inf and viceversa when running on valgrind madgraph5#906
…h.h madgraph5#903 and test_misc.cc/valgrind.h madgraph5#906 Add valgrind.h for all processes for d in $(git ls-tree --name-only HEAD */SubProcesses); do git add $d/valgrind.h $d/*/valgrind.h; done
…ster for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
Member
Author
|
Hi @oliviermattelaer can you please review? Hopefully not controversial. Note: this includes also both PR #900 on omp and PR #905 on clang. So it might be better to merge those two independently first Thanks |
… (I had forgotten it enabled)
…math.h with debugging mode switched off for f in $(git ls-tree --name-only HEAD */src/constexpr_math.h); do \cp CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/constexpr_math.h $f; done
This was
linked to
issues
Jul 12, 2024
This was referenced Jul 12, 2024
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Jul 12, 2024
…tempts to duplicate tests using the current infrastructure The code is now identical to the current gtest branch for PR madgraph5#908. Will instead solve madgraph5#907 by using a much simpler test infrastructure with fewer template levels. Revert "[gtest2/june24] in gg_tt.mad runTest.cc, try a different way to duplicate the tests, it still segfaults - will revert" This reverts commit 24e00a2. Revert "[gtest2/june24] in gg_tt.mad runTest.cc, add debug printout for test ctor/dtor" This reverts commit 0700a85. Revert "[gtest2/june24] in gg_tt.mad runTest.cc, temporarely add back the duplicate test" This reverts commit 0cce7fb.
oliviermattelaer
approved these changes
Jul 16, 2024
…h5#900 and submod madgraph5#897) into clang
…r if OpenMP builds are attempted on clang16/17 (as discussed with Olivier in madgraph5#905)
…s from the latest upstream/master for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…aster with OMP madgraph5#900 and submod madgraph5#897) into gtest Fix conflicts in epochX/cudacpp/gg_tt.mad/CODEGEN_mad_gg_tt_log.txt git checkout clang gg_tt.mad/CODEGEN_mad_gg_tt_log.txt Note: MG5AMC has been updated including mg5amcnlo#107
…s from the latest upstream/master for easier merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
Member
Author
Thanks Olivier :-) The clang #905 is now modified as agreed, and merged. I have updated this, I will merge it when the CI has been executed |
Member
Author
|
The CI completed with Merging now |
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Jul 16, 2024
…aster with clang madgraph5#905, OMP madgraph5#900 and submod madgraph5#897) into gtest Fix conflicts in epochX/cudacpp/gg_tt.mad/CODEGEN_mad_gg_tt_log.txt git checkout clang gg_tt.mad/CODEGEN_mad_gg_tt_log.txt Note: MG5AMC has been updated including mg5amcnlo#107
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Jul 17, 2024
…ion, removing the attempts to add two tests madgraph5#896 My last commit was showing the segfault issue madgraph5#907 solved in upcoming PR madgraph5#909 (and bits of madgraph5#908). I will cherry pick the CODEGEN from madgraph5#909 (and madgraph5#908) first and try again. git checkout 3eb4c29 gg_tt.mad/SubProcesses/runTest.cc
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Jul 17, 2024
…ng PR madgraph5#905, constexpr_math.h PR madgraph5#908 and runTest/cudaDeviceReset PR madgraph5#909 Add valgrind.h and its symlink in the repo for gg_tt.mad The new runTest.cc template now has a (commented out) proof of concept for including two tests (with/without multichannel) madgraph5#896, I will resume from there After building bldall, the following succeeds for bck in none sse4 avx2 512y 512z cuda; do echo $bck; ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done This instead is crashing (again?) for some AVX values for bck in none sse4 avx2 512y 512z cuda; do echo $bck; valgrind ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done On closer inspection, this is because valgrind does not support AVX512, so this is ok
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a PR to
This was motivated by the WIP on adding testst for channelid #896 within the WIP on master_june24 #882
Note: the latter is still blocked by another segfault #907