Skip to content

Conversation

@martenole
Copy link
Contributor

Hi @davidrohr
thanks a lot for providing the fix for the standalone build! The floating point exception was caused by a function I had implemented to check the chi 2 calculation. This function is not needed, so I simply removed it.
Unfortunately, currently I can only run the tracking with the standalone framework. When I try to execute the macro GPU/GPUTracking/TRDTracking/macros/run_trd_tracker.C in O2 I get a segfault and an error from o2::gpu::GPUChain::AllocateIOMemoryHelper. Did the memory allocation mechanism for the reconstruction chain change? I paste the full error log below. In AliRoot I see the same error.
Cheers,
Ole

#6 0x00007ff99d0bf666 in o2::gpu::GPUChain::AllocateIOMemoryHelpero2::tpc::ClusterNative (this=0x55dadb078220, u=std::unique_ptr<struct o2::tpc::ClusterNative []> = {...}, ptr=, n=<error reading variable: Cannot access memory at address 0x156a0>) at /home/oschmidt/alice/sw/SOURCES/O2/v1.2.0/0/GPU/GPUTracking/Global/GPUChain.h:125
#7 o2::gpu::GPUChainTracking::AllocateIOMemory (this=0x55dadb078220) at /home/oschmidt/alice/sw/SOURCES/O2/v1.2.0/0/GPU/GPUTracking/Global/GPUChainTracking.cxx:462
#8 0x00007ff9926f045c in run_trd_tracker(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) () from /home/oschmidt/alice/O2/GPU/GPUTracking/TRDTracking/macros/run_trd_tracker_C.so

sawenzel
sawenzel previously approved these changes Sep 16, 2020
@davidrohr
Copy link
Collaborator

@martenole : Continuing here: with the latest commit I can run it but I cannot reproduce the segfault. Could you perhaps provide the input data that trigger the segfault somewhere? Otherwise I can try tomorrow with address sanitizers.

@davidrohr
Copy link
Collaborator

OK, using root.exe instead of root I get the segfault, no idea what is the difference. Investigating...

@martenole
Copy link
Contributor Author

That is strange. I am not using root.exe, but simply root..
In case you still need it, the input data I am using is here: https://cernbox.cern.ch/index.php/s/808EB1mDbKfp7R0
Cheers,
Ole

@davidrohr
Copy link
Collaborator

@martenole : Sorry, that was quite stupid, fixed in #4359 ....

Anyhow, it still segfaults for me transversing the TRD Geometry when querrying the material budget:

#0  0x00007fffeaec60f8 in TObjArray::UncheckedAt (this=0x0, this=0x0, i=0) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/core/cont/inc/TObjArray.h:90
#1  TGeoVolume::GetNode (this=<optimized out>, i=0) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/inc/TGeoVolume.h:178
#2  TGeoNode::GetDaughter (this=<optimized out>, ind=ind@entry=0) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/inc/TGeoNode.h:85
#3  TGeoNodeCache::CdDown (this=0x55555787fbb0, index=index@entry=0) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/src/TGeoCache.cxx:204
#4  0x00007fffeaf0d19c in TGeoNavigator::CdDown (this=this@entry=0x555562cbf100, index=0) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/src/TGeoNavigator.cxx:410
#5  0x00007fffeaf0ed54 in TGeoNavigator::SearchNode (this=0x555562cbf100, downwards=<optimized out>, skipnode=0x0) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/src/TGeoNavigator.cxx:2064
#6  0x00007fffeaf0ed6e in TGeoNavigator::SearchNode (this=0x555562cbf100, downwards=<optimized out>, skipnode=0x0) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/src/TGeoNavigator.cxx:2066
#7  0x00007fffeaf0efc8 in TGeoNavigator::FindNode (this=0x555562cbf100, safe_start=safe_start@entry=true) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/src/TGeoNavigator.cxx:1593
#8  0x00007fffeaf0f035 in TGeoNavigator::InitTrack (this=<optimized out>, point=point@entry=0x7fffdac89720, dir=dir@entry=0x7fffdac89740) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/src/TGeoNavigator.cxx:1663
#9  0x00007fffeaeeca17 in TGeoManager::InitTrack (this=<optimized out>, point=point@entry=0x7fffdac89720, dir=dir@entry=0x7fffdac89740) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/geom/geom/src/TGeoManager.cxx:2633
#10 0x00007fffe62b247b in o2::base::GeometryManager::meanMaterialBudget (x0=<optimized out>, y0=<optimized out>, z0=<optimized out>, x1=<optimized out>, y1=<optimized out>, z1=<optimized out>) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Base/src/GeometryManager.cxx:385
#11 0x00007fffe62bd96e in o2::base::GeometryManager::meanMaterialBudget (end=..., start=...) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Base/include/DetectorsBase/GeometryManager.h:97
#12 o2::base::Propagator::getMatBudget (this=<optimized out>, corrType=o2::base::Propagator::MatCorrType::USEMatCorrTGeo, p1=..., p0=...) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Base/src/Propagator.cxx:512
#13 o2::base::Propagator::getMatBudget (this=<optimized out>, corrType=<optimized out>, p0=..., p1=...) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Base/src/Propagator.cxx:510
#14 0x00007fffe62bdc72 in o2::base::Propagator::PropagateToXBxByBz (this=0x7fffe15eef60 <o2::base::Propagator::Instance()::instance>, track=..., xToGo=299.975372, mass=mass@entry=0.139569998, maxSnp=maxSnp@entry=0.800000012, maxStep=maxStep@entry=2, matCorr=matCorr@entry=o2::base::Propagator::MatCorrType::USEMatCorrTGeo, tofInfo=tofInfo@entry=0x0, signCorr=-1, signCorr@entry=0) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Base/src/Propagator.cxx:91
#15 0x00007fffe1497776 in o2::gpu::propagatorInterface<o2::base::Propagator>::propagateToX (this=0x7fffdac8aa80, this=0x7fffdac8aa80, maxStep=2, maxSnp=0.800000012, x=<optimized out>) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/GPU/GPUTracking/TRDTracking/GPUTRDInterfaces.h:203
#16 o2::gpu::GPUTRDTracker_t<o2::gpu::GPUTRDTrack_t<o2::gpu::trackInterface<o2::dataformats::TrackTPCITS> >, o2::gpu::propagatorInterface<o2::base::Propagator> >::FollowProlongation (this=this@entry=0x555564525170, prop=prop@entry=0x7fffdac8aa80, t=t@entry=0x7fffdac8ab20, threadId=threadId@entry=6, collisionId=collisionId@entry=8) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/GPU/GPUTracking/TRDTracking/GPUTRDTracker.cxx:639
#17 0x00007fffe14990e2 in o2::gpu::GPUTRDTracker_t<o2::gpu::GPUTRDTrack_t<o2::gpu::trackInterface<o2::dataformats::TrackTPCITS> >, o2::gpu::propagatorInterface<o2::base::Propagator> >::DoTrackingThread (this=this@entry=0x555564525170, iTrk=iTrk@entry=6, threadId=threadId@entry=6) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/GPU/GPUTracking/TRDTracking/GPUTRDInterfaces.h:206
#18 0x00007fffe14994c0 in o2::gpu::GPUTRDTracker_t<o2::gpu::GPUTRDTrack_t<o2::gpu::trackInterface<o2::dataformats::TrackTPCITS> >, o2::gpu::propagatorInterface<o2::base::Propagator> >::_ZN2o23gpu15GPUTRDTracker_tINS0_13GPUTRDTrack_tINS0_14trackInterfaceINS_11dataformats11TrackTPCITSEEEEENS0_19propagatorInterfaceINS_4base10PropagatorEEEE10DoTrackingEPNS0_16GPUChainTrackingE._omp_fn.0(void) () at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/GPU/GPUTracking/TRDTracking/GPUTRDTracker.cxx:251
#19 0x00007fffe0ed85fe in gomp_thread_start (xdata=<optimized out>) at /var/tmp/portage/sys-devel/gcc-10.2.0/work/gcc-10.2.0/libgomp/team.c:123
#20 0x00007ffff7448ef7 in start_thread () from /lib64/libpthread.so.0
#21 0x00007ffff775fbaf in clone () from /lib64/libc.so.6

On another note: Running with Run 2 data in standalone mode, we were discussing ~6 weeks ago two issues: 1 fpe and 1 segfault. Do you have any update on these? (sorry if I missed it)

@martenole
Copy link
Contributor Author

@davidrohr Thanks a lot for the fix! I am compiling it at the moment. I was also looking at these lines, but somehow I missed the obvious solution.

The FPE should be fixed with this commit. I was using a function accidentally which I had added as a test but which was not supposed to be used yet. Since it is not needed at all I removed it completely.
Not sure if this also fixes the segfault, but I will try it after when the compilation is done.

About the geometry I am not sure what the status is there, I need to check with Tom and Sean why the material lookup fails.

Cheers,
Ole

@martenole
Copy link
Contributor Author

Hi @shahor02
above David reported the error when querying the material budget. Probably the way I am initializing the geometry is wrong.

//-------- init geometry and field --------//
  o2::base::GeometryManager::loadGeometry(path);
  o2::base::Propagator::initFieldFromGRP(path + inputGRP);

  auto geo = o2::trd::TRDGeometry::instance();
  geo->createPadPlaneArray();
  geo->createClusterMatrixArray();
  const o2::trd::TRDGeometryFlat geoFlat(*geo);

But I am not sure how to do it right. You can reproduce the error by running the macro
https://github.com/martenole/AliceO2/blob/trdtracker/GPU/GPUTracking/TRDTracking/macros/run_trd_tracker.C
You need to execute the macro in a folder with input ITS-TPC tracks and TRD tracklets. I have uploaded my simulations files which you could use here:
https://cernbox.cern.ch/index.php/s/808EB1mDbKfp7R0
Cheers,
Ole

@martenole
Copy link
Contributor Author

Hi,
the segfault in O2 due to the material correction is fixed (for now simply by disabling the material correction in the O2 propagator). The TRD tracking is now ready for time frames. Still, I cannot test this, since the tracklets in O2 need more work. But this will probably take a bit, so I would like to have merge this now. Of course I am open to comments/suggestions!

@shahor02
Copy link
Collaborator

@martenole just saw your week old question, sorry (please ping me when I don't respond). The general geometry / field init is ok.
I can merge it once the test passed, later will check the problem with mat. correction.

@martenole
Copy link
Contributor Author

Hi @shahor02 ,
thanks a lot for taking a look. Next to the field and geometry init I saw in the ITS-TPC matching spec that the mat. LUT is provided from a file (TPCITSMatchingSpec.cxx lines 79-88). Maybe I am missing this? I could not find a matbud file in my simulation folder though.
As I said above, currently I have simply disabled the material correction (a176447).
Cheers,
Ole

@martenole
Copy link
Contributor Author

@davidrohr the error with build/O2/gpu g++: fatal error: Killed signal terminated program cc1plus I saw also on my laptop when I ran out of memory during compilation. Happens to me quite frequently when I try to use all threads.
Can it be that the test server / VM or what exactly it is that runs this test does not have enough memory?

@davidrohr
Copy link
Collaborator

@martenole : Yes, the problem with the GPU CI is that it runs out of memory during CMake compilation (even though it is not clear to me why it tries to recompile CMake in the first place, CMake hasn't changed since a while). I have already asked @ktf about it.

@davidrohr
Copy link
Collaborator

LUT is provided from a file (TPCITSMatchingSpec.cxx lines 79-88). Maybe I am missing this? I could not find a matbud file in my simulation folder though.

The LUT is not created in the simulation by default. It should be provided via the CCDB eventually, but for now you have to create one by yourself, or just disable it. For creating it, see https://github.com/AliceO2Group/AliceO2/blob/dev/Detectors/Base/test/README.md.

@shahor02
Copy link
Collaborator

@martenole since the mat. LUT generation is quite time consuming, it has to be pregenerated, I put one in
https://cernbox.cern.ch/index.php/s/89gtHpGJ2jZOz5s

But its absence should not be a problem, the Propagator will use it only if the LUT is attached to it (see e.g.

std::string matLUTPath = ic.options().get<std::string>("material-lut-path");
std::string matLUTFile = o2::base::NameConf::getMatLUTFileName(matLUTPath);
if (o2::base::NameConf::pathExists(matLUTFile)) {
auto* lut = o2::base::MatLayerCylSet::loadFromFile(matLUTFile);
o2::base::Propagator::Instance()->setMatLUT(lut);
LOG(INFO) << "Loaded material LUT from " << matLUTFile;
} else {
LOG(INFO) << "Material LUT " << matLUTFile << " file is absent, only TGeo can be used";
}
) otherwise it will automatically use TGeo query.

@davidrohr
Copy link
Collaborator

Indeed, the absense of the matlut is no problem. The crash I have reported above is actually during the TGeo query.

@martenole
Copy link
Contributor Author

Ok, but then I don't understand why the segfault disappeared :/

Anyhow, it still segfaults for me transversing the TRD Geometry when querrying the material budget:

#0  0x00007fffeaec60f8 in TObjArray::UncheckedAt (this=0x0, this=0x0, i=0) at /usr/src/debug/sci-physics/root-6.22.02/root-6.22.02/core/cont/inc/TObjArray.h:90

@shahor02
Copy link
Collaborator

You have set mat.correction to none, so there is no material surely at all

Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified GPU compilation locally, might need some follow-up PR, but this fixes some things and has no regression, merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants