Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x][CI]Flaky tests on Python3:GPU and cpp package GPU Makefile test suites #20011

@access2rohit

Description

@access2rohit

Description

unix-gpu has some flaky tests on Python3:GPU and cpp package GPU Makefile they fail quite frequenty even without any code that touches them.

Occurrences

Python3:GPU failing test:

[2021-03-11T18:04:29.187Z] test_operator_gpu.test_kernel_error_checking ... [18:04:24] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine

[2021-03-11T18:04:32.459Z] Process SpawnProcess-1:

[2021-03-11T18:04:32.460Z] Traceback (most recent call last):

[2021-03-11T18:04:32.460Z]   File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

[2021-03-11T18:04:32.460Z]     self.run()

[2021-03-11T18:04:32.460Z]   File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run

[2021-03-11T18:04:32.460Z]     self._target(*self._args, **self._kwargs)

[2021-03-11T18:04:32.460Z]   File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2238, in kernel_error_check_imperative

[2021-03-11T18:04:32.460Z]     c = (a / b).asnumpy()

[2021-03-11T18:04:32.460Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 354, in __truediv__

[2021-03-11T18:04:32.460Z]     return divide(self, other)

[2021-03-11T18:04:32.460Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 3820, in divide

[2021-03-11T18:04:32.460Z]     _internal._rdiv_scalar)

[2021-03-11T18:04:32.460Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 3576, in _ufunc_helper

[2021-03-11T18:04:32.460Z]     return fn_array(lhs, rhs)

[2021-03-11T18:04:32.460Z]   File "<string>", line 52, in broadcast_div

[2021-03-11T18:04:32.460Z]   File "mxnet/cython/ndarray.pyx", line 219, in mxnet._cy3.ndarray._imperative_invoke

[2021-03-11T18:04:32.460Z]   File "mxnet/cython/./base.pyi", line 58, in mxnet._cy3.ndarray.CALL

[2021-03-11T18:04:32.460Z] mxnet.base.MXNetError: Traceback (most recent call last):

[2021-03-11T18:04:32.460Z]   [bt] (9) /usr/local/bin/python3(_PyEval_EvalFrameDefault+0x44b2) [0x561b1fe37ac2]

[2021-03-11T18:04:32.460Z]   [bt] (8) /usr/local/bin/python3(_PyCFunction_FastCallKeywords+0x20) [0x561b1fdc3de0]

[2021-03-11T18:04:32.460Z]   [bt] (7) /usr/local/bin/python3(_PyMethodDef_RawFastCallKeywords+0x250) [0x561b1fdc4050]

[2021-03-11T18:04:32.460Z]   [bt] (6) /work/mxnet/tests/python/unittest/../../../python/mxnet/_cy3/ndarray.cpython-37m-x86_64-linux-gnu.so(+0x14699) [0x7eff14049699]

[2021-03-11T18:04:32.460Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x8b) [0x7eff8be0653b]

[2021-03-11T18:04:32.460Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0x543) [0x7eff8be04c73]

[2021-03-11T18:04:32.460Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0xe6) [0x7eff8b566836]

[2021-03-11T18:04:32.460Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0x140e) [0x7eff8b560b6e]

[2021-03-11T18:04:32.460Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::BinaryBroadcastShape(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x38e) [0x7eff86af62ae]

[2021-03-11T18:04:32.460Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7eff8682df82]

[2021-03-11T18:04:32.460Z]   File "src/operator/numpy/linalg/./../../tensor/elemwise_binary_broadcast_op.h", line 68

[2021-03-11T18:04:32.460Z] MXNetError: Check failed: l == 1 || r == 1: operands could not be broadcast together with shapes [3] [0]

[2021-03-11T18:04:32.460Z] [18:04:28] src/engine/naive_engine.cc:74: Engine shutdown

[2021-03-11T18:04:34.985Z] [18:04:30] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine

[2021-03-11T18:04:38.257Z] Process SpawnProcess-2:

[2021-03-11T18:04:38.257Z] Traceback (most recent call last):

[2021-03-11T18:04:38.257Z]   File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

[2021-03-11T18:04:38.257Z]     self.run()

[2021-03-11T18:04:38.257Z]   File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run

[2021-03-11T18:04:38.257Z]     self._target(*self._args, **self._kwargs)

[2021-03-11T18:04:38.257Z]   File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2247, in kernel_error_check_symbolic

[2021-03-11T18:04:38.257Z]     'b':mx.nd.array([],ctx=mx.gpu(0))})

[2021-03-11T18:04:38.257Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/symbol/symbol.py", line 2119, in bind

[2021-03-11T18:04:38.257Z]     ctypes.byref(handle)))

[2021-03-11T18:04:38.257Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/base.py", line 246, in check_call

[2021-03-11T18:04:38.257Z]     raise get_last_ffi_error()

[2021-03-11T18:04:38.257Z] mxnet.base.MXNetError: Traceback (most recent call last):

[2021-03-11T18:04:38.257Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0x8f5) [0x7f1e070e99f5]

[2021-03-11T18:04:38.257Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Executor::Bind(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*)+0x219) [0x7f1e071f1139]

[2021-03-11T18:04:38.257Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x120c) [0x7f1e071e4a0c]

[2021-03-11T18:04:38.257Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::InferShape(nnvm::Graph&&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x69) [0x7f1e071c08a9]

[2021-03-11T18:04:38.257Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f05d99) [0x7f1e071bdd99]

[2021-03-11T18:04:38.257Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f0242b) [0x7f1e071ba42b]

[2021-03-11T18:04:38.257Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<2, 1>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x5ab) [0x7f1e0266305b]

[2021-03-11T18:04:38.257Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::ElemwiseAttrHelper<mxnet::TShape, &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, &mxnet::op::shape_string[abi:cxx11], -1, -1>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, mxnet::TShape const&)::{lambda(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*)#1}::operator()(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*) const+0x1276) [0x7f1e01bc6126]

[2021-03-11T18:04:38.257Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7f1e01b59f82]

[2021-03-11T18:04:38.257Z] MXNetError: Error in operator _div0: [18:04:33] src/operator/numpy/linalg/./../../tensor/../elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node _div0 at 1-th input: expected [3], got [0]
[2021-03-11T18:04:38.257Z] ok (11.0016s)

cpp package GPU Makefile failing test:

[2021-03-11T18:29:20.262Z] [18:29:15] cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] Segmentation fault: 11

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] Segmentation fault: 11

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] Segmentation fault: 11

Next Steps

Since they are blocking the PRs and making CI unstable. Immediate action is to disable them and investigate

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions