Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Segfault of test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #17341

@TaoLv

Description

@TaoLv

Description

Maybe not that flaky. I met the crash in my MKL-DNN upgrading PR (#17313) which seems to be not related to this test.
Put this issue here to see if anyone else meets the same problem and hope someone familiar with threaded engine can take a look.

Occurrences

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17313/2/pipeline/299

What have you tried to solve it?

Back trace:

(gdb) bt
#0  0x00007f68610b898d in pthread_join (threadid=140079671015168, thread_return=0x0) at pthread_join.c:90
#1  0x00007f68575f4793 in std::thread::join() () from target:/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f6853460407 in mxnet::engine::ThreadPool::~ThreadPool (this=0x20b8ce0, __in_chrg=<optimized out>) at src/engine/./thread_pool.h:84
#3  std::default_delete<mxnet::engine::ThreadPool>::operator() (this=<optimized out>, __ptr=0x20b8ce0) at /usr/include/c++/5/bits/unique_ptr.h:76
#4  std::unique_ptr<mxnet::engine::ThreadPool, std::default_delete<mxnet::engine::ThreadPool> >::~unique_ptr (this=0x2c20bf8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/unique_ptr.h:236
#5  mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>::~ThreadWorkerBlock (this=0x2c20b30, __in_chrg=<optimized out>) at src/engine/threaded_engine_perdevice.cc:214
#6  std::_Sp_counted_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
    at /usr/include/c++/5/bits/shared_ptr_base.h:374
#7  0x00007f684f65a3ea in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x21e2ce0) at /usr/include/c++/5/bits/shared_ptr_base.h:150
#8  0x00007f685345c50b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#9  std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>)
    at /usr/include/c++/5/bits/shared_ptr_base.h:925
#10 std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>::operator=(std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>&&) (__r=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr_base.h:1000
#11 std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >::operator=(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >&&) (__r=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr.h:294
#12 mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >::Clear (this=this@entry=0x1f230f8) at src/engine/../common/lazy_alloc_array.h:149
#13 0x00007f685345fb2c in mxnet::engine::ThreadedEnginePerDevice::StopNoWait (this=0x1f22ff0) at src/engine/threaded_engine_perdevice.cc:67
#14 mxnet::engine::ThreadedEnginePerDevice::Stop (this=0x1f22ff0) at src/engine/threaded_engine_perdevice.cc:74
#15 0x00007f685357dfb6 in mxnet::LibraryInitializer::atfork_prepare (this=<optimized out>) at src/initialize.cc:196
  1. Add DEBUG=1 to the make line can get rid of the problem;
  2. Did not observe the problem when running the single test or the single test file of test_gluon_data.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions