Skip to content

Conversation

@peressounko
Copy link
Collaborator

Empty TF protection; CalibDiff combined to one vector. Fill out-of-range fixed

@peressounko peressounko requested a review from kharlov as a code owner May 26, 2021 19:09
@peressounko peressounko requested a review from shahor02 May 26, 2021 19:09
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? CellID is deterministic and using at() has overhead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

@peressounko peressounko force-pushed the dev branch 2 times, most recently from 69953c8 to 7a8bb12 Compare May 27, 2021 10:59
@peressounko
Copy link
Collaborator Author

Dear @shahor02, tests failed for PHOS-unrelated reasons. Could you review?

@shahor02
Copy link
Collaborator

Hi @peressounko
Actually, I did already and had some questions, see above. For the CI: I see that the fullCU fails on PHOS digit->raw conversion. I don't see what in your PR could affect it, but to be on the safe side, could you run the FST locally? Just run from some tmp. directory

alienv load O2/latest AEGIS/latest
ulimit -n 3000
$O2_ROOT/prodtests/full_system_test.sh

@peressounko
Copy link
Collaborator Author

Dear @shahor02, I can not simulate locally any more. Any simulation qed or pythia ends with
[INFO] DISTRIBUTING EVENT : 1
[INFO] Process 673 EXITED WITH CODE 0 SIGNALED 1 SIGNAL 6
[INFO] Problem detected
[INFO] KILLING 673
[INFO] KILLING 675
[INFO] KILLING 676
[FATAL] ABORTING DUE TO ABORT IN COMPONENT
For later analysis we write a core dump to core_dump_669
Aborted (core dumped)
and error in
Thread 3 (Thread 0x7f2df1489700 (LWP 672) "ZMQbg/IO/0"):
#0 0x00007f2df45eca47 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2df32b0291 in zmq::epoll_t::loop() () from /home/prsnko/alice/sw/ubuntu1804_x86-64/ZeroMQ/v4.3.3-3/lib/libzmq.so.5
#2 0x00007f2df32e7c48 in thread_routine () from /home/prsnko/alice/sw/ubuntu1804_x86-64/ZeroMQ/v4.3.3-3/lib/libzmq.so.5
#3 0x00007f2dfc1db6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f2df45ec71f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f2df1c8a700 (LWP 671) "ZMQbg/Reaper"):
#0 0x00007f2df45eca47 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2df32b0291 in zmq::epoll_t::loop() () from /home/prsnko/alice/sw/ubuntu1804_x86-64/ZeroMQ/v4.3.3-3/lib/libzmq.so.5
#2 0x00007f2df32e7c48 in thread_routine () from /home/prsnko/alice/sw/ubuntu1804_x86-64/ZeroMQ/v4.3.3-3/lib/libzmq.so.5
#3 0x00007f2dfc1db6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f2df45ec71f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f2df2559080 (LWP 669) "o2-sim"):
#0 0x00007f2df45af492 in waitpid () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2df451a177 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f2df5b8b94c in TUnixSystem::Exec (shellcmd=, this=0x480880) at /home/prsnko/alice/sw/SOURCES/ROOT/v6-20-08-alice1/v6-20-08-alice1/core/unix/src/TUnixSystem.cxx:2107
#3 TUnixSystem::StackTrace (this=0x480880) at /home/prsnko/alice/sw/SOURCES/ROOT/v6-20-08-alice1/v6-20-08-alice1/core/unix/src/TUnixSystem.cxx:2397
#4 0x00007f2df4eee3f9 in FairLogger::LogFatalMessage () at /home/prsnko/alice/sw/SOURCES/FairRoot/v18.4.1/v18.4.1/fairtools/FairLogger.cxx:195
#5 0x00007f2df4f18da5 in std::function<void ()>::operator()() const (this=0x7f2df4f40700 fair::Logger::fFatalCallback) at /home/prsnko/alice/sw/ubuntu1804_x86-64/GCC-Toolchain/v10.2.0-alice2-1/include/c++/10.2.0/bits/std_function.h:248
#6 fair::Logger::~Logger (this=0x7fffb63f26f0, __in_chrg=) at /home/prsnko/alice/sw/SOURCES/FairLogger/v1.9.1/v1.9.1/logger/Logger.cxx:267
#7 0x000000000040b40a in main (argc=, argv=) at /home/prsnko/alice/sw/ubuntu1804_x86-64/FairLogger/v1.9.1-3/include/fairlogger/Logger.h:305

Probably this is related to my switch to gcc20 because of missing in my installation. Can this be the reason:
Possible C++ standard library mismatch, compiled with GLIBCXX '20191114'
Extraction of runtime standard library version was: '20200723'
Dmitri

@shahor02
Copy link
Collaborator

Hi @peressounko
Yes, libraries mismatch can do any weird thing. The best if you update the alibuild and rebuild the O2.
BTW, now you can install the alibuild as auto-updatable package, on the ubuntu:

sudo add-apt-repository ppa:alisw/ppa  
sudo apt update
sudo apt install python3-alibuild

@peressounko peressounko force-pushed the dev branch 2 times, most recently from d2d0ba1 to 0fd4600 Compare June 3, 2021 19:42
@peressounko
Copy link
Collaborator Author

Dear @shahor02,
I recompiled O2 from scratch now library mismatch is gone, but error remains the same as above. qed part crushes: o2-sim --seed -1 -j 2 -n1000 -m PIPE ITS MFT FT0 FV0 FDD -g extgen --configKeyValues "GeneratorExternal.fileName=$O2_ROOT/share/Generators/external/QEDLoader.C;QEDGenParam.yMin=-7;QEDGenParam.yMax=7;QEDGenParam.ptMin=0.001;QEDGenParam.ptMax=1.;Diamond.width[2]=6."
while pythia simulation with PHOS is OK.
I do not think the problem is related to PHOS. Checks also breaks in phos-unrelated places.

@shahor02
Copy link
Collaborator

shahor02 commented Jun 4, 2021

Hi @peressounko
Do you have the AEGIS installed? It has to be loaded for the QED, so you should build it either separately or via o2sim package.
If the problem is not this, could you post o2sim_mergerlog, o2sim_serverlog and o2sim_workerlog0 from qed dir.?

@peressounko
Copy link
Collaborator Author

peressounko commented Jun 4, 2021

Thanks @shahor02, indeed installation of AEGIS fixed the problem

@shahor02
Copy link
Collaborator

shahor02 commented Jun 4, 2021

so, does it pass the test?

@peressounko
Copy link
Collaborator Author

It passed QED test and second simulation. But then is used 16G of ram +16 G of swap and completely hanged laptop

@shahor02
Copy link
Collaborator

shahor02 commented Jun 4, 2021

OK, thanks, but did phos digit->raw passed? BTW, you did not the address the comments I left before..

@peressounko
Copy link
Collaborator Author

It passed Digitization with o2-sim-digitizer-workflow --onlyDet PHS

But in o2-phos-digi2raw --file-for link -o raw/PHS I see:
#0 0x00007ffff084226d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff08441cc in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff1003fd5 in operator new (sz=sz@entry=1048576) at ../../../../gcc/libstdc++-v3/libsupc++/new_op.cc:50
#3 0x00007ffff7c46213 in __gnu_cxx::new_allocator::allocate (__n=1048576, this=)
at /home/prsnko/alice/sw/ubuntu1804_x86-64/GCC-Toolchain/v10.2.0-alice2-local1/include/c++/10.2.0/ext/new_allocator.h:103
#4 std::allocator_traits<std::allocator >::allocate (__n=1048576, __a=...)
at /home/prsnko/alice/sw/ubuntu1804_x86-64/GCC-Toolchain/v10.2.0-alice2-local1/include/c++/10.2.0/bits/alloc_traits.h:460
#5 std::_Vector_base<char, std::allocator >::_M_allocate (this=0x3db0250, __n=1048576)
at /home/prsnko/alice/sw/ubuntu1804_x86-64/GCC-Toolchain/v10.2.0-alice2-local1/include/c++/10.2.0/bits/stl_vector.h:346
#6 std::vector<char, std::allocator >::reserve (__n=1048576, this=0x3db0250)
at /home/prsnko/alice/sw/ubuntu1804_x86-64/GCC-Toolchain/v10.2.0-alice2-local1/include/c++/10.2.0/bits/vector.tcc:78
#7 o2::raw::RawFileWriter::registerLink (this=this@entry=0x39354b0, fee=fee@entry=0, cru=cru@entry=0, link=link@entry=0 '\000', endpoint=endpoint@entry=0 '\000',
outFileNameV=...) at /home/prsnko/alice/sw/SOURCES/O2/dev/0/Detectors/Raw/src/RawFileWriter.cxx:168
#8 0x00007ffff7fd66b1 in o2::phos::RawWriter::init (this=this@entry=0x7fffffff6460)
at /home/prsnko/alice/sw/SOURCES/O2/dev/0/Detectors/PHOS/simulation/src/RawWriter.cxx:45
#9 0x000000000040d96a in main (argc=, argv=) at /home/prsnko/alice/sw/SOURCES/O2/dev/0/Detectors/PHOS/simulation/src/RawCreator.cxx:106

Concerning your comments: I replied there: I propose to keep this small overhead while we test system to avoid repeating situation where I can not find the reason of errors.
Dmitri

@shahor02
Copy link
Collaborator

shahor02 commented Jun 4, 2021

Hi @peressounko

But in o2-phos-digi2raw --file-for link -o raw/PHS I see:
#0 0x00007ffff084226d in ?? () from /lib/x86_64-linux-gnu/libc.so.6

This is probably because the change of static constexpr short NCHANNELS from 14337 to 14336 leads to memory corruption in the Mapping.cxx: with such dump

diff --git a/Detectors/PHOS/base/src/Mapping.cxx b/Detectors/PHOS/base/src/Mapping.cxx
index d9d142062e4..a41ed766c16 100644
--- a/Detectors/PHOS/base/src/Mapping.cxx
+++ b/Detectors/PHOS/base/src/Mapping.cxx
@@ -205,6 +205,7 @@ Mapping::ErrorStatus Mapping::setMapping()
 
         mAbsId[ddl][hwAddress] = absId;
         mCaloFlag[ddl][hwAddress] = (CaloFlag)caloFlag;
+       LOG(INFO) << "MM " << absId << " " << caloFlag << " " << ddl << " " << hwAddress ;
         mAbsToHW[absId][caloFlag][0] = ddl;
         mAbsToHW[absId][caloFlag][1] = hwAddress;
       }

the valgrind o2-phos-digi2raw --file-for link -o raw/PHS shows

...
[INFO] MM 14335 1 13 1864
[INFO] MM 14335 0 13 1865
[INFO] MM 14336 1 13 1866
==550467== Invalid write of size 2
==550467==    at 0x49458E1: o2::phos::Mapping::setMapping() (Mapping.cxx:209)
==550467==    by 0x494758A: o2::phos::Mapping::Instance() (Mapping.cxx:34)
==550467==    by 0x487A6CC: o2::phos::RawWriter::init() (RawWriter.cxx:31)

Either the channelID calculation is wrong or the constant....

Concerning your comments: I replied there: I propose to keep this small overhead while we test system to avoid repeating situation where I can not find the reason of errors.

Strange, I did not find any answer. For the at() vs []: why don't you use assert, if you suspect the index is wrongly calculated?
Then you can always enable check by placing #undef NDEBUG, while the check will not be done in the standard built?

@shahor02
Copy link
Collaborator

shahor02 commented Jun 7, 2021

Hi @peressounko
any update on this?

@peressounko
Copy link
Collaborator Author

Hi @shahor02, sorry I was distracted with another stuff. Now return to this issue.

@peressounko
Copy link
Collaborator Author

Dear @shahor02, test build/O2/fullCI runs ~10 hours, may be it should be restarted?

@peressounko
Copy link
Collaborator Author

Dear @shahor02, I think we can merge this PR. build/O2/fullCI failed at QED test where PHOS does not participate. Do we have similar problems in other PRs?

@shahor02
Copy link
Collaborator

FST passed locally, merging

@shahor02 shahor02 merged commit 5057ba8 into AliceO2Group:dev Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants