Skip to content

problem when using Cuda to accelerate the dp  #650

@343333333

Description

@343333333

when i use cpu version ,it gose well. but problem happen, when i change the version to gpu after updating the relying program .
it seems the cuda is too old to start , but i write the "module load cuda/10.2" in the submission script, and it dose have the cuda10.2 in the service.
the log file :

# DEEPMD: installed to:         /tmp/pip-req-build-8l1_0ns9/_skbuild/linux-x86_64-3.7/cmake-install
# DEEPMD: source :              v1.3.3
# DEEPMD: source brach:         HEAD
# DEEPMD: source commit:        3a59596
# DEEPMD: source commit at:     2021-03-20 00:53:44 +0800
# DEEPMD: build float prec:     double
# DEEPMD: build with tf inc:    /work/Software/miniconda3/lib/python3.7/site-packages/tensorflow/include;/work/Software/miniconda3/lib/python3.7/site-packages/tensorflow/include
# DEEPMD: build with tf lib:    
# DEEPMD: running on:           gpu03
# DEEPMD: CUDA_VISIBLE_DEVICES: unset
# DEEPMD: num_intra_threads:    0
# DEEPMD: num_inter_threads:    0
# DEEPMD: -----------------------------------------------------------------
# DEEPMD: 
2021-05-21 10:25:15.310984: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-21 10:25:15.343416: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-21 10:25:16.227336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:2f:00.0 name: Tesla V100-PCIE-16GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-05-21 10:25:16.228063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:86:00.0 name: Tesla V100-PCIE-16GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-05-21 10:25:16.228101: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-05-21 10:25:16.407775: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-05-21 10:25:16.407922: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-05-21 10:25:16.518689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-21 10:25:16.718146: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-21 10:25:16.877558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-05-21 10:25:17.027248: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-05-21 10:25:17.272541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-05-21 10:25:17.275906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
2021-05-21 10:25:17.275974: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-05-21 10:25:17.276270: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
2021-05-21 10:25:17.276289: E tensorflow/c/c_api.cc:2184] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "/work/chem-wangyg/Software/miniconda3/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/main.py", line 73, in main
    train(args)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/train.py", line 87, in train
    _do_work(jdata, run_opt)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/train.py", line 91, in _do_work
    model = NNPTrainer (jdata, run_opt = run_opt)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/Trainer.py", line 49, in __init__
    self._init_param(jdata)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/Trainer.py", line 62, in _init_param
    self.descrpt = DescrptSeA(descrpt_param)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/DescrptSeA.py", line 87, in __init__
    self.sub_sess = tf.Session(graph = sub_graph, config=default_tf_session_config)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1596, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 711, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

this is my submission script:

#!/bin/bash
#BSUB -J dpmd
#BSUB -q gpu
#BSUB -n 12
#BSUB -e %J.err
#BSUB -o %J.out
#BSUB -R "span[ptile=24]"

module load cuda/10.2

cd 000
test $? -ne 0 && exit 1

if [ ! -f tag_0_finished ] ;then
  { if [ ! -f model.ckpt.index ]; then ~/Software/miniconda3/bin/dp train input.json; else ~/Software/miniconda3/bin/dp train input.json --restart model.ckpt; fi }  1>> train.log 2>> train.log 
  if test $? -ne 0; then exit 1; else touch tag_0_finished; fi 
fi

cd /work/dpgen/test/temp/160837f9-78bd-426e-8d22-3af727ea0ca4
test $? -ne 0 && exit 1

wait

cd 000
test $? -ne 0 && exit 1

if [ ! -f tag_1_finished ] ;then
  ~/Software/miniconda3/bin/dp freeze  1>> train.log 2>> train.log 
  if test $? -ne 0; then exit 1; else touch tag_1_finished; fi 
fi

cd /work/dpgen/test/temp/160837f9-78bd-426e-8d22-3af727ea0ca4
test $? -ne 0 && exit 1

wait


touch 160837f9-78bd-426e-8d22-3af727ea0ca4_tag_finished

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions