Skip to content

Fix segfault from constexpr_math.h and adapt testmisc.cc to higher tolerances when running through valgrind#908

Merged
valassi merged 61 commits intomadgraph5:masterfrom
valassi:gtest
Jul 16, 2024
Merged

Fix segfault from constexpr_math.h and adapt testmisc.cc to higher tolerances when running through valgrind#908
valassi merged 61 commits intomadgraph5:masterfrom
valassi:gtest

Conversation

@valassi
Copy link
Member

@valassi valassi commented Jul 12, 2024

This is a PR to

This was motivated by the WIP on adding testst for channelid #896 within the WIP on master_june24 #882

Note: the latter is still blocked by another segfault #907

valassi added 30 commits July 11, 2024 13:31
…n both CPU and GPU (prepare for madgraph5#896) - the C++ tests succeed but the CUDA tests segfaults madgraph5#903
…from release-1.11.0 to v1.14.0 to solve madgraph5#903, but the segfault remains - will revert
…ase-1.11.0

Revert "[gtest/june24] in CODEGEN cudacpp_test.mk, try to upgrade googletest from release-1.11.0 to v1.14.0 to solve madgraph5#903, but the segfault remains - will revert"
This reverts commit 34cd623.
…cc build in CUDA while debugging madgraph5#903

With testmisc.cc, valgrind gives a confusing error

==2887713== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==2887713==
==2887713== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==2887713==  Access not within mapped region at address 0x1FFE801FF8
==2887713== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==2887713==    at 0x449C06: mg5amcGpu::constexpr_sin_quad(long double, bool) (constexpr_math.h:156)
==2887713==  If you believe this happened as a result of a stack
==2887713==  overflow in your program's main thread (unlikely but
==2887713==  possible), you can try to increase the size of the
==2887713==  main thread stack using the --main-stacksize= flag.
==2887713==  The main thread stack size used in this run was 8388608.
==2887713==
==2887713== HEAP SUMMARY:
==2887713==     in use at exit: 21,309,363 bytes in 13,995 blocks
==2887713==   total heap usage: 18,083 allocs, 4,088 frees, 51,971,780 bytes allocated
==2887713==
==2887713== LEAK SUMMARY:
==2887713==    definitely lost: 0 bytes in 0 blocks
==2887713==    indirectly lost: 0 bytes in 0 blocks
==2887713==      possibly lost: 2,599,608 bytes in 825 blocks
==2887713==    still reachable: 18,709,755 bytes in 13,170 blocks
==2887713==         suppressed: 0 bytes in 0 blocks
==2887713== Rerun with --leak-check=full to see details of leaked memory
==2887713==
==2887713== For lists of detected and suppressed errors, rerun with: -s
==2887713== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

Without testmisc.cc instead

[ RUN      ] SIGMA_SM_GG_TTX_GPU2/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
==2889432== Invalid write of size 8
==2889432==    at 0x484E2DB: memmove (vg_replace_strmem.c:1385)
==2889432==    by 0x41A6EA: double* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double>(double const*, double const*, double*) (stl_algobase.h:431)
==2889432==    by 0x41A49B: double* std::__copy_move_a2<false, double*, double*>(double*, double*, double*) (stl_algobase.h:494)
==2889432==    by 0x41A1A5: double* std::__copy_move_a1<false, double*, double*>(double*, double*, double*) (stl_algobase.h:522)
==2889432==    by 0x419F4D: double* std::__copy_move_a<false, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:529)
==2889432==    by 0x419D0C: double* std::copy<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double*) (stl_algobase.h:619)
==2889432==    by 0x419950: mg5amcGpu::CommonRandomNumberKernel::generateRnarray() (CommonRandomNumberKernel.cc:34)
==2889432==    by 0x44443D: CUDATest::prepareRandomNumbers(unsigned int) (runTest.cc:202)
==2889432==    by 0x440D98: MadgraphTest_CompareMomentaAndME_Test::TestBody() (MadgraphTest.h:253)
==2889432==    by 0x48790F: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607)
==2889432==    by 0x480EF8: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643)
==2889432==    by 0x459587: testing::Test::Run() (gtest.cc:2682)
==2889432==  Address 0x2fc0f200 is not stack'd, malloc'd or (recently) free'd
==2889432==
==2889432==
==2889432== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==2889432==  Access not within mapped region at address 0x2FC0F200
==2889432==    at 0x484E2DB: memmove (vg_replace_strmem.c:1385)
...
Segmentation fault (core dumped)
…cc build while debugging madgraph5#903 also for C++

The test does not segfault without valgrind, but it does segfault in valgrind!
(NB this all realted to debug builds, in C++ and in CUDA)

And with testmisc.cc, valgrind gives a confusing error for C++ (cppnone here) as in CUDA:

==2893804== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==2893804==  Access not within mapped region at address 0x1FFE801FF8
==2893804== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==2893804==    at 0x431835: mg5amcCpu::constexpr_sin_quad(long double, bool) (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/runTest_cpp.exe)

So I disable testmisc but now the C++ test (cppnone here) no longer segfaults...?!
…6 builds madgraph5#904 (disabling OMP only for clang16; add -no-pie for fcheck_cpp.exe)
…6 builds madgraph5#904 (disabling OMP only for clang16; add -no-pie for fcheck_cpp.exe)
Revert "[gtest/june24] in gg_tt.mad cudacpp.mk, TEMPORARELY disable testmisc.cc build while debugging madgraph5#903 also for C++"
This reverts commit 944caab.

Will now test with clang16 (after recent fixes) and valgrind (after upgrading to 3.23)
…ster for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…ster for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…ph5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…ODEGEN logs from the latest upstream/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…g constexpr_sin: now valgrind on c++ runTest succeds again?!

However cuda still fails (even without valgrind) madgraph5#903
… now valgrind runTest_cpp.exe will fail

Revert "[gtest/june24] in gg_tt.mad testmisc.cc, comment out the section using constexpr_sin: now valgrind on c++ runTest succeds again?!"
This reverts commit 975f7aacb8661807a329ec1f51b2d7d8dba45167.
…ph5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
…5#904: remove link-time -no-pie, add compiler-time -fPIC to fortran
valassi added 4 commits July 12, 2024 14:18
…lerances when running on valgrind madgraph5#906

Also allow tan(x)=-inf if ctan(x)=+inf and viceversa when running on valgrind madgraph5#906
…h.h madgraph5#903 and test_misc.cc/valgrind.h madgraph5#906

Add valgrind.h for all processes

for d in $(git ls-tree --name-only HEAD */SubProcesses); do git add $d/valgrind.h $d/*/valgrind.h; done
…ster for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
@valassi valassi requested a review from oliviermattelaer July 12, 2024 12:35
@valassi valassi self-assigned this Jul 12, 2024
@valassi
Copy link
Member Author

valassi commented Jul 12, 2024

Hi @oliviermattelaer can you please review?

Hopefully not controversial.

Note: this includes also both PR #900 on omp and PR #905 on clang. So it might be better to merge those two independently first

Thanks

valassi added 2 commits July 12, 2024 14:41
…math.h with debugging mode switched off

for f in $(git ls-tree --name-only HEAD */src/constexpr_math.h); do \cp CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/constexpr_math.h $f; done
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 12, 2024
…tempts to duplicate tests using the current infrastructure

The code is now identical to the current gtest branch for PR madgraph5#908.
Will instead solve madgraph5#907 by using a much simpler test infrastructure with fewer template levels.

Revert "[gtest2/june24] in gg_tt.mad runTest.cc, try a different way to duplicate the tests, it still segfaults - will revert"
This reverts commit 24e00a2.

Revert "[gtest2/june24] in gg_tt.mad runTest.cc, add debug printout for test ctor/dtor"
This reverts commit 0700a85.

Revert "[gtest2/june24] in gg_tt.mad runTest.cc, temporarely add back the duplicate test"
This reverts commit 0cce7fb.
Copy link
Member

@oliviermattelaer oliviermattelaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this sounds good.

Just like I said in #905, I'm confused why you forbid any activation the openmp mode.
Maybe the best is to "remove that particular" commit and then merge the rest.
Or wait that #905 is accepted before merging this... (or any other solution)

I put this as approve anyway,

Olivier

valassi added 7 commits July 16, 2024 13:44
…r if OpenMP builds are attempted on clang16/17 (as discussed with Olivier in madgraph5#905)
…s from the latest upstream/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…aster with OMP madgraph5#900 and submod madgraph5#897) into gtest

Fix conflicts in epochX/cudacpp/gg_tt.mad/CODEGEN_mad_gg_tt_log.txt
  git checkout clang gg_tt.mad/CODEGEN_mad_gg_tt_log.txt

Note: MG5AMC has been updated including mg5amcnlo#107
…s from the latest upstream/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
@valassi
Copy link
Member Author

valassi commented Jul 16, 2024

Ok this sounds good.

Just like I said in #905, I'm confused why you forbid any activation the openmp mode. Maybe the best is to "remove that particular" commit and then merge the rest. Or wait that #905 is accepted before merging this... (or any other solution)

I put this as approve anyway,

Olivier

Thanks Olivier :-)

The clang #905 is now modified as agreed, and merged.

I have updated this, I will merge it when the CI has been executed

@valassi
Copy link
Member Author

valassi commented Jul 16, 2024

The CI completed with
163 successful and 6 failing checks
This is as expected

Merging now

@valassi valassi merged commit 328581c into madgraph5:master Jul 16, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 16, 2024
…aster with clang madgraph5#905, OMP madgraph5#900 and submod madgraph5#897) into gtest

Fix conflicts in epochX/cudacpp/gg_tt.mad/CODEGEN_mad_gg_tt_log.txt
  git checkout clang gg_tt.mad/CODEGEN_mad_gg_tt_log.txt

Note: MG5AMC has been updated including mg5amcnlo#107
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 17, 2024
…ion, removing the attempts to add two tests madgraph5#896

My last commit was showing the segfault issue madgraph5#907 solved in upcoming PR madgraph5#909 (and bits of madgraph5#908).
I will cherry pick the CODEGEN from madgraph5#909 (and madgraph5#908) first and try again.

git checkout 3eb4c29 gg_tt.mad/SubProcesses/runTest.cc
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jul 17, 2024
…ng PR madgraph5#905, constexpr_math.h PR madgraph5#908 and runTest/cudaDeviceReset PR madgraph5#909

Add valgrind.h and its symlink in the repo for gg_tt.mad

The new runTest.cc template now has a (commented out) proof of concept for including two tests (with/without multichannel) madgraph5#896, I will resume from there

After building bldall, the following succeeds
for bck in none sse4 avx2 512y 512z cuda; do echo $bck; ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done

This instead is crashing (again?) for some AVX values
for bck in none sse4 avx2 512y 512z cuda; do echo $bck; valgrind ./build.${bck}_d_inl0_hrd0/runTest_*.exe; done
On closer inspection, this is because valgrind does not support AVX512, so this is ok
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

testmisc.cc tests require different tolerances under valgrind segfault in constexpr_sin_quad when running runTest.exe through valgrind

2 participants