This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Same Network can hybridize on CPU but can not hybridize on GPU. #19264
Copy link
Copy link
Closed
Labels
Description
Description
Hello, I wrote a network with a list as inputs, it works OK if I hybridize it on CPU or not hybridize and just run it on GPU.
But once I try to hybridize it on GPU, it tell me something like Check failed: it != node2index_.end() && it->first == e.node.get():. I have tried to set MXNET_ENGINE_TYPE to NaiveEngine but it does not give me any useful information.
Error Message
(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)
libluajit.so
Traceback (most recent call last):
File "/data2/kohill/jye_sanka/mx-detection/models/backbones/hrnet/cls_hrnet_mx_seg_fault.py", line 76, in <module>
y_hat = model([mx.nd.random.randn(1, 32, 56, 56, ctx=ctx), mx.nd.random.randn(1, 64, 28, 28, ctx=ctx)])
File "/data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 682, in __call__
out = self.forward(*args)
File "/data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 1244, in forward
return self._call_cached_op(x, *args)
File "/data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 1028, in _call_cached_op
out = self._cached_op(*cargs)
File "/data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py", line 154, in __call__
ctypes.byref(out_stypes)))
File "/data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
[bt] (9) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXInvokeCachedOpEx+0x3e) [0x7f067c064b3e]
[bt] (8) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXInvokeCachedOp+0x601) [0x7f067c064571]
[bt] (7) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::CachedOp::Forward(std::shared_ptr<mxnet::CachedOp> const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x16b) [0x7f067b80d21b]
[bt] (6) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::CachedOp::GetCachedOpState(mxnet::Context const&)+0x179) [0x7f067b809899]
[bt] (5) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::CachedOp::CachedOpState::CachedOpState(mxnet::Context const&, nnvm::Graph const&, nnvm::Graph const&, bool)+0x1c6f) [0x7f067b808e6f]
[bt] (4) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::exec::FusePointwiseBackward(nnvm::Graph&&)+0xca) [0x7f067c0d90ba]
[bt] (3) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(nnvm::Graph::indexed_graph() const+0x30) [0x7f0683705480]
[bt] (2) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(nnvm::IndexedGraph::IndexedGraph(nnvm::Graph const&)+0xaf8) [0x7f0683704918]
[bt] (1) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0xf4e4598) [0x7f0683703598]
[bt] (0) /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2723218) [0x7f0676942218]
File "src/core/graph.cc", line 101
MXNetError: Check failed: it != node2index_.end() && it->first == e.node.get(): To Reproduce
import os
os.environ["DMLC_LOG_STACK_TRACE_DEPTH"]="10"
import mxnet as mx
import mxnet.gluon as gluon
class nn(object):
@staticmethod
def Sequential(*args):
bl = gluon.nn.HybridSequential()
for a in args:
bl.add(a)
return bl
@staticmethod
def Upsample(scale_factor, mode):
# return BilinearResize2D(scale_factor=scale_factor)
return mx.gluon.nn.HybridLambda(lambda F, x: F.contrib.BilinearResize2D(x, scale_width=scale_factor,
scale_height=scale_factor, name="fwd"))
class HighResolutionModule(gluon.nn.HybridBlock):
def __init__(self):
super(HighResolutionModule, self).__init__()
self.relu = mx.gluon.nn.Activation("relu")
self.fff = nn.Sequential(
mx.gluon.nn.Conv2D(in_channels=64, channels=32, kernel_size=3, padding=1),
nn.Upsample(scale_factor=2, mode="nearest")
)
self.fff1 = nn.Sequential(
mx.gluon.nn.Conv2D(in_channels=32, channels=64, kernel_size=3, padding=1, strides=2),
mx.gluon.nn.BatchNorm(axis=1, momentum=.9, in_channels=32)
)
def hybrid_forward(self, F, x, *args, **kwargs):
y0 = self.relu(x[0] + self.fff(x[1]))
y1 = self.relu(self.fff1(x[0]) + x[1])
return [y0, y1]
class HighResolutionNet(gluon.nn.HybridBlock):
def __init__(self):
super(HighResolutionNet, self).__init__()
self.stage2 = self._make_stage()
def _make_stage(self):
modules = []
for i in range(2):
modules.append(
HighResolutionModule()
)
return nn.Sequential(*modules)
def hybrid_forward(self, F, x_list):
y_list = self.stage2(x_list)
return y_list
def get_cls_net():
model = HighResolutionNet()
return model
if __name__ == '__main__':
import easydict
ctx = mx.gpu()
args = easydict.EasyDict()
model = get_cls_net()
model.initialize()
model.collect_params().reset_ctx(ctx)
model.hybridize()
y_hat = model([mx.nd.random.randn(1, 32, 56, 56, ctx=ctx), mx.nd.random.randn(1, 64, 28, 28, ctx=ctx)])Steps to reproduce
Just run the above script, noting that everything is good if ctx is set to mx.cpu.
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python
# paste outputs here
----------Python Info----------
Version : 3.6.5
Compiler : GCC 7.2.0
Build : ('default', 'Apr 29 2018 16:14:56')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 20.2.2
Directory : /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
None
/data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so
/data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
libuuid.so.1
libluajit.so
Version : 1.7.0
Directory : /data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet
Commit Hash : 64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
64f737cdd59fe88d2c5b479f25d011c5156b6a8a
Library : ['/data2/kohill/jye_sanka/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
No runtime build feature info available
----------System Info----------
Platform : Linux-4.13.0-36-generic-x86_64-with-debian-buster-sid
system : Linux
node : a76c618855c0
release : 4.13.0-36-generic
version : #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz
Stepping: 2
CPU MHz: 2494.534
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 4989.06
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti retpoline intel_ppin spec_ctrl tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0307 sec, LOAD: 3.8286 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 6.2298 sec, LOAD: 1.5923 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>, DNS finished in 0.396883487701416 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.7994 sec, LOAD: 10.9164 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0293 sec, LOAD: 2.1483 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.19745945930480957 sec.