This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
[v1.x] CU102 CD Failure due to Cuda/Cudnn/CuBlas mismatch #19929
Copy link
Copy link
Closed
Labels
Description
[2021-02-18T21:59:21.985Z] what(): [21:59:18] /work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:126: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed
[2021-02-18T21:59:21.985Z] Stack trace:
[2021-02-18T21:59:21.985Z] [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27ed308) [0x7fb4b07e3308]
[2021-02-18T21:59:21.985Z] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77e1879) [0x7fb4b57d7879]
[2021-02-18T21:59:21.985Z] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77e1e36) [0x7fb4b57d7e36]
[2021-02-18T21:59:21.985Z] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)1>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x1c7) [0x7fb4b57f7097]
[2021-02-18T21:59:21.985Z] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7fb4b57f7346]
[2021-02-18T21:59:21.985Z] [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77f92b4) [0x7fb4b57ef2b4]
[2021-02-18T21:59:21.985Z] [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fb555711c80]
[2021-02-18T21:59:21.985Z] [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fb55d8886ba]
[2021-02-18T21:59:21.985Z] [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb55ca6b4dd]
This has been happening for a while now. #19506 attempted to fix it but the error stayed/came back. I think this is most likely a cuda/cudnn/cublas version mismatch issue. I have created a branch with
ENV CUDA_VERSION=10.2.89
ENV CUDNN_VERSION=8.0.4.30
COPY install/ubuntu_cudnn.sh /work/
RUN /work/ubuntu_cudnn.sh
this section in file (https://github.com/apache/incubator-mxnet/blob/v1.x/ci/docker/Dockerfile.build.ubuntu_gpu_cu102) removed altogether and kicked off a run on that branch to observe if this solves the issue.