This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
[v1.x][CI]Flaky tests on Python3:GPU and cpp package GPU Makefile test suites #20011
Copy link
Copy link
Open
Labels
Description
Description
unix-gpu has some flaky tests on Python3:GPU and cpp package GPU Makefile they fail quite frequenty even without any code that touches them.
Occurrences
Python3:GPU failing test:
[2021-03-11T18:04:29.187Z] test_operator_gpu.test_kernel_error_checking ... [18:04:24] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[2021-03-11T18:04:32.459Z] Process SpawnProcess-1:
[2021-03-11T18:04:32.460Z] Traceback (most recent call last):
[2021-03-11T18:04:32.460Z] File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
[2021-03-11T18:04:32.460Z] self.run()
[2021-03-11T18:04:32.460Z] File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
[2021-03-11T18:04:32.460Z] self._target(*self._args, **self._kwargs)
[2021-03-11T18:04:32.460Z] File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2238, in kernel_error_check_imperative
[2021-03-11T18:04:32.460Z] c = (a / b).asnumpy()
[2021-03-11T18:04:32.460Z] File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 354, in __truediv__
[2021-03-11T18:04:32.460Z] return divide(self, other)
[2021-03-11T18:04:32.460Z] File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 3820, in divide
[2021-03-11T18:04:32.460Z] _internal._rdiv_scalar)
[2021-03-11T18:04:32.460Z] File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 3576, in _ufunc_helper
[2021-03-11T18:04:32.460Z] return fn_array(lhs, rhs)
[2021-03-11T18:04:32.460Z] File "<string>", line 52, in broadcast_div
[2021-03-11T18:04:32.460Z] File "mxnet/cython/ndarray.pyx", line 219, in mxnet._cy3.ndarray._imperative_invoke
[2021-03-11T18:04:32.460Z] File "mxnet/cython/./base.pyi", line 58, in mxnet._cy3.ndarray.CALL
[2021-03-11T18:04:32.460Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2021-03-11T18:04:32.460Z] [bt] (9) /usr/local/bin/python3(_PyEval_EvalFrameDefault+0x44b2) [0x561b1fe37ac2]
[2021-03-11T18:04:32.460Z] [bt] (8) /usr/local/bin/python3(_PyCFunction_FastCallKeywords+0x20) [0x561b1fdc3de0]
[2021-03-11T18:04:32.460Z] [bt] (7) /usr/local/bin/python3(_PyMethodDef_RawFastCallKeywords+0x250) [0x561b1fdc4050]
[2021-03-11T18:04:32.460Z] [bt] (6) /work/mxnet/tests/python/unittest/../../../python/mxnet/_cy3/ndarray.cpython-37m-x86_64-linux-gnu.so(+0x14699) [0x7eff14049699]
[2021-03-11T18:04:32.460Z] [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x8b) [0x7eff8be0653b]
[2021-03-11T18:04:32.460Z] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0x543) [0x7eff8be04c73]
[2021-03-11T18:04:32.460Z] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0xe6) [0x7eff8b566836]
[2021-03-11T18:04:32.460Z] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0x140e) [0x7eff8b560b6e]
[2021-03-11T18:04:32.460Z] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::BinaryBroadcastShape(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x38e) [0x7eff86af62ae]
[2021-03-11T18:04:32.460Z] [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7eff8682df82]
[2021-03-11T18:04:32.460Z] File "src/operator/numpy/linalg/./../../tensor/elemwise_binary_broadcast_op.h", line 68
[2021-03-11T18:04:32.460Z] MXNetError: Check failed: l == 1 || r == 1: operands could not be broadcast together with shapes [3] [0]
[2021-03-11T18:04:32.460Z] [18:04:28] src/engine/naive_engine.cc:74: Engine shutdown
[2021-03-11T18:04:34.985Z] [18:04:30] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[2021-03-11T18:04:38.257Z] Process SpawnProcess-2:
[2021-03-11T18:04:38.257Z] Traceback (most recent call last):
[2021-03-11T18:04:38.257Z] File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
[2021-03-11T18:04:38.257Z] self.run()
[2021-03-11T18:04:38.257Z] File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
[2021-03-11T18:04:38.257Z] self._target(*self._args, **self._kwargs)
[2021-03-11T18:04:38.257Z] File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2247, in kernel_error_check_symbolic
[2021-03-11T18:04:38.257Z] 'b':mx.nd.array([],ctx=mx.gpu(0))})
[2021-03-11T18:04:38.257Z] File "/work/mxnet/tests/python/unittest/../../../python/mxnet/symbol/symbol.py", line 2119, in bind
[2021-03-11T18:04:38.257Z] ctypes.byref(handle)))
[2021-03-11T18:04:38.257Z] File "/work/mxnet/tests/python/unittest/../../../python/mxnet/base.py", line 246, in check_call
[2021-03-11T18:04:38.257Z] raise get_last_ffi_error()
[2021-03-11T18:04:38.257Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2021-03-11T18:04:38.257Z] [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0x8f5) [0x7f1e070e99f5]
[2021-03-11T18:04:38.257Z] [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Executor::Bind(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*)+0x219) [0x7f1e071f1139]
[2021-03-11T18:04:38.257Z] [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x120c) [0x7f1e071e4a0c]
[2021-03-11T18:04:38.257Z] [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::InferShape(nnvm::Graph&&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x69) [0x7f1e071c08a9]
[2021-03-11T18:04:38.257Z] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f05d99) [0x7f1e071bdd99]
[2021-03-11T18:04:38.257Z] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f0242b) [0x7f1e071ba42b]
[2021-03-11T18:04:38.257Z] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<2, 1>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x5ab) [0x7f1e0266305b]
[2021-03-11T18:04:38.257Z] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::ElemwiseAttrHelper<mxnet::TShape, &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, &mxnet::op::shape_string[abi:cxx11], -1, -1>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, mxnet::TShape const&)::{lambda(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*)#1}::operator()(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*) const+0x1276) [0x7f1e01bc6126]
[2021-03-11T18:04:38.257Z] [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7f1e01b59f82]
[2021-03-11T18:04:38.257Z] MXNetError: Error in operator _div0: [18:04:33] src/operator/numpy/linalg/./../../tensor/../elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node _div0 at 1-th input: expected [3], got [0]
[2021-03-11T18:04:38.257Z] ok (11.0016s)
cpp package GPU Makefile failing test:
[2021-03-11T18:29:20.262Z] [18:29:15] cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z] Segmentation fault: 11
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z] Segmentation fault: 11
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z] Segmentation fault: 11
Next Steps
Since they are blocking the PRs and making CI unstable. Immediate action is to disable them and investigate