Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[Python2] Occasional and random Illegal Memory Access Error | Cached Op Input Context Error #10240

@chengdazhi

Description

@chengdazhi

Hi all,

I am with visual computing group of microsoft research asia. We are currently working on GLUON based Faster RCNN and are dealing with a really trick problem:

Description

System randomly reports Illegal Memory Access error(np and mxnet random seeds are fixed and data is not shuffled). In hybridize mode the error becomes "Inputs not on the same context" reported by CachedOp. It is strange that one of the two inputs of this CachedOp reported in log is always xxx_weight or xxx_moving_mean and should live on all contexts. Please refer to error section below for the details.

The error typically happens after tens or hundreds of batchs, with a batchsize of four images.

Environment info (Required)

----------Python Info----------
('Version :', '2.7.14')
('Compiler :', 'GCC 7.2.0')
('Build :', ('default', 'Dec 7 2017 17:05:42'))
('Arch :', ('64bit', ''))
------------Pip Info-----------
('Version :', '9.0.1')
('Directory :', '/data1/home/v-dachen/anaconda2/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
('Platform :', 'Linux-4.4.0-104-generic-x86_64-with-debian-stretch-sid')
('system :', 'Linux')
('node :', 'MSRAVCG04')
('release :', '4.4.0-104-generic')
('version :', '#127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping: 4
CPU MHz: 2600.000
CPU max MHz: 3400.0000
CPU min MHz: 1200.0000
BogoMIPS: 5202.06
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0012 sec, LOAD: 1.1022 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0673 sec, LOAD: 0.3463 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.4961 sec, LOAD: 0.7315 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0689 sec, LOAD: 0.3608 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1838 sec, LOAD: 0.3668 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1662 sec, LOAD: 0.9060 sec.

Package used (Python/R/Scala/Julia):
Python 2.7

Build info (Required if built from source)

build from mxnet repo v1.1.0 branch, commit #9999

Build config:
(Paste the content of config.mk, or the build command.)

Error Message:

NOTE: this Illegal Memory Access problem is only triggered when hybridize() is not called, it is not always triggered by stream_gpu-inl.h, sometimes it's pooled_storage_manager.h:107

terminate called after throwing an instance of 'dmlc::Error'
what(): [20:04:58] src/engine/./threaded_engine.h:359: [20:04:58] /data1/home/v-dachen/external/mxnet/mxnet_1.0.0/incubator-mxnet/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: an illegal memory access was encountered

Stack trace returned 10 entries:
[bt] (0) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5a) [0x7f7cb0c6aeaa]
[bt] (1) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f7cb0c6ba48]
[bt] (2) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mshadow::Streammshadow::gpu::Wait()+0xd8) [0x7f7cb32358d8]
[bt] (3) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x39c) [0x7f7cb3244e5c]
[bt] (4) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(+0x3348c6b) [0x7f7cb36ccc6b]
[bt] (5) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x93) [0x7f7cb36d9db3]
[bt] (6) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)+0xcb) [0x7f7cb36e20db]
[bt] (7) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent&&)+0x63) [0x7f7cb36e22d3]
[bt] (8) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> >::_M_run()+0x4a) [0x7f7cb36dc4aa]
[bt] (9) /data1/home/v-dachen/anaconda2/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f7cc4776c5c]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 9 entries:
[bt] (0) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5a) [0x7f7cb0c6aeaa]
[bt] (1) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f7cb0c6ba48]
[bt] (2) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x332) [0x7f7cb36da052]
[bt] (3) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)+0xcb) [0x7f7cb36e20db]
[bt] (4) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent&&)+0x63) [0x7f7cb36e22d3]
[bt] (5) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> (std::shared_ptrmxnet::engine::ThreadPool::SimpleEvent)> >::_M_run()+0x4a) [0x7f7cb36dc4aa]
[bt] (6) /data1/home/v-dachen/anaconda2/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f7cc4776c5c]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f7ccad766ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f7cca39c3dd]

Aborted (core dumped)

And the following CachedOp related problem is triggered when hybridize() is called, note that the Block that triggered this error changes each time.

Traceback (most recent call last):
File "experiments/faster_rcnn/rcnn_end2end_train_test_gluon.py", line 27, in
train_end2end.main()
File "experiments/faster_rcnn/../../faster_rcnn/train_end2end_gluon.py", line 187, in main
config.TRAIN.lr, config.TRAIN.lr_step)
File "experiments/faster_rcnn/../../faster_rcnn/train_end2end_gluon.py", line 180, in train_net
batch_end_callback, epoch_end_callback)
File "experiments/faster_rcnn/../../faster_rcnn/core/module_gluon.py", line 71, in train
conv4_feat = net.resnet_conv4(data) (NOTE: this varies each time)
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/gluon/block.py", line 360, in call
return self.forward(*args)
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/gluon/block.py", line 664, in forward
return self._call_cached_op(x, args)
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/gluon/block.py", line 477, in _call_cached_op
out = self._cached_op(cargs)
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/_ctypes/ndarray.py", line 149, in call
ctypes.byref(out_stypes)))
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/base.py", line 146, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [15:53:24] src/imperative/cached_op.cc:351: Check failed: inputs[i]->ctx() == default_ctx (gpu(4) vs. gpu(7)) CachedOp requires all inputs to live on the same context. But data is on gpu(7) while bn3a_branch2c_moving_mean is on gpu (This is very strange since moving_mean should live on all contexts)
Stack trace returned 10 entries:
[bt] (0) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5a) [0x7f24aba98eaa]
[bt] (1) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f24aba99a48]
[bt] (2) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::Imperative::CachedOp::Forward(std::shared_ptr<mxnet::Imperative::CachedO
p> const&, std::vector<mxnet::NDArray
, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray
, std::allocatormxnet::NDArray* > const&)+0x620) [0x7f24ae03a1f0]
[bt] (3) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(MXInvokeCachedOp+0xd94) [0x7f24ae5355c4]
[bt] (4) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(MXInvokeCachedOpEx+0x3ee) [0x7f24ae5365de]
[bt] (5) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f24bc19aec0]
[bt] (6) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f24bc19a87d]
[bt] (7) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4d6) [0x7f24bc3b0736]
[bt] (8) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(+0x8a61) [0x7f24bc3a6a61]
[bt] (9) /data1/home/v-dachen/anaconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f24c5e02773]

When I turn on NaïveEngine as suggested, it say that naïve engine only supports synchronize push, and it happens in a random manner, no matter hybridized or not. We do not use multi-threading explicitly, since the problem seems to be threading related.

Exception in thread Thread-2:
Traceback (most recent call last):
File "/data1/home/v-dachen/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/data1/home/v-dachen/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(self.__args, **self.__kwargs)
File "experiments/faster_rcnn/../../faster_rcnn/../lib/utils/PrefetchingIter.py", line 194, in prefetch_func
self.next_batch[i][0] = self.iters[0].next()
File "experiments/faster_rcnn/../../faster_rcnn/core/loader.py", line 924, in next
data, label = self.get_batch_individual(cur_from)
File "experiments/faster_rcnn/../../faster_rcnn/core/loader.py", line 1064, in get_batch_individual
data = [[mx.nd.array(data[key]) for key in self.data_name] for data in all_data]
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/utils.py", line 146, in array
return _array(source_array, ctx=ctx, dtype=dtype)
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/ndarray.py", line 2245, in array
arr[:] = source_array
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/ndarray.py", line 437, in setitem
self._set_nd_basic_indexing(key, value)
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/ndarray.py", line 698, in _set_nd_basic_indexing
self._sync_copyfrom(value)
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/ndarray.py", line 863, in _sync_copyfrom
ctypes.c_size_t(source_array.size)))
File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/base.py", line 146, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
**MXNetError: [15:54:16] src/engine/naive_engine.cc:168: Check failed: this->req_completed
NaiveEngine only support synchronize Push so far **
Stack trace returned 10 entries:
[bt] (0) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5a) [0x7fd15f9b2eaa]
[bt] (1) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::engine::NaiveEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var
, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, mxnet::FnProperty, int, char const*)+0x393) [0x7fd162407a03]
[bt] (2) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::NDArray::SyncCopyFromCPU(void const*, unsigned long) const+0x6c8) [0x7fd161fefea8]
[bt] (3) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(MXNDArraySyncCopyFromCPU+0x6) [0x7fd16247b446]
[bt] (4) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fd1700b4ec0]
[bt] (5) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fd1700b487d]
[bt] (6) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4d6) [0x7fd1702ca736]
[bt] (7) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(+0x8a61) [0x7fd1702c0a61]
[bt] (8) /data1/home/v-dachen/anaconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7fd179d1c773]
[bt] (9) /data1/home/v-dachen/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x190b) [0x7fd179db053b]

Minimum reproducible example

Our code is a simple FasterRCNN, providing a reproducible example would be very demanding, so I listed some key designs as follow:

  • Our network is implemented using HybridBlocks and Blocks
  • DataLoader is implemented based on DataIter.
  • We pad input images to a fixed size, so it doesn't change across GPU.

If you have any thoughts on debugging, please comment and I'll report the results.

What have you tried to solve it?

  1. update nvidia driver to 384, cuda to 9.0 and cudnn to 7.1.2
  2. gradually reducing the building blocks of the architecture, and found that the time taken for the error to show gradually increases. So it is unlikely that the behavior is triggered by one operator mis-usage.
  3. changing DataLoader to single-threaded version.
  4. removing callbacks(like speedometer) and metrics.
  5. building mxnet with flag DMSHADOW_USE_PASCAL turned on, as instructed by this issue: CUDA illegal memory access when training SSD #6592, which didn't work.
  6. turning python gc on and off in each epoch, as instructed by this issue: CTC ERROR WITH CUDA ILLEGAL MEMORY ACCESS ERROR #7710 and didn't work.
  7. some other similar issues exist, but failed to provide any solutions: Unable to use multi-gpu for training #7000, Fix cmake build #6799, !!!GPU performance of Cmake built mxnet is worse than Make built one #6685
  8. changing to different commits: update navbar model zoo link #9749(07a83a), [MXNET-16] Move mshadow/ps-lite/dlpack to 3rdparty #10132 #10138 (3e2fb6), remove MKL_EXPERIMENTAL and update make files for MKL-DNN #9810 (5c5a90). The problem still stands, but happens less often on remove MKL_EXPERIMENTAL and update make files for MKL-DNN #9810 (still happens, just takes longer)
  9. posting a issue here...

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions