This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Segfault of test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #17341
Copy link
Copy link
Closed
Closed
Copy link
Labels
Description
Description
Maybe not that flaky. I met the crash in my MKL-DNN upgrading PR (#17313) which seems to be not related to this test.
Put this issue here to see if anyone else meets the same problem and hope someone familiar with threaded engine can take a look.
Occurrences
What have you tried to solve it?
Back trace:
(gdb) bt
#0 0x00007f68610b898d in pthread_join (threadid=140079671015168, thread_return=0x0) at pthread_join.c:90
#1 0x00007f68575f4793 in std::thread::join() () from target:/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f6853460407 in mxnet::engine::ThreadPool::~ThreadPool (this=0x20b8ce0, __in_chrg=<optimized out>) at src/engine/./thread_pool.h:84
#3 std::default_delete<mxnet::engine::ThreadPool>::operator() (this=<optimized out>, __ptr=0x20b8ce0) at /usr/include/c++/5/bits/unique_ptr.h:76
#4 std::unique_ptr<mxnet::engine::ThreadPool, std::default_delete<mxnet::engine::ThreadPool> >::~unique_ptr (this=0x2c20bf8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/unique_ptr.h:236
#5 mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>::~ThreadWorkerBlock (this=0x2c20b30, __in_chrg=<optimized out>) at src/engine/threaded_engine_perdevice.cc:214
#6 std::_Sp_counted_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
at /usr/include/c++/5/bits/shared_ptr_base.h:374
#7 0x00007f684f65a3ea in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x21e2ce0) at /usr/include/c++/5/bits/shared_ptr_base.h:150
#8 0x00007f685345c50b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#9 std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>)
at /usr/include/c++/5/bits/shared_ptr_base.h:925
#10 std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>::operator=(std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>&&) (__r=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr_base.h:1000
#11 std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >::operator=(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >&&) (__r=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr.h:294
#12 mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >::Clear (this=this@entry=0x1f230f8) at src/engine/../common/lazy_alloc_array.h:149
#13 0x00007f685345fb2c in mxnet::engine::ThreadedEnginePerDevice::StopNoWait (this=0x1f22ff0) at src/engine/threaded_engine_perdevice.cc:67
#14 mxnet::engine::ThreadedEnginePerDevice::Stop (this=0x1f22ff0) at src/engine/threaded_engine_perdevice.cc:74
#15 0x00007f685357dfb6 in mxnet::LibraryInitializer::atfork_prepare (this=<optimized out>) at src/initialize.cc:196
- Add
DEBUG=1to the make line can get rid of the problem; - Did not observe the problem when running the single test or the single test file of test_gluon_data.py.