Skip to content

Replace PS-Worker mode with multi-worker one.#892

Merged
amcadmus merged 11 commits intodeepmodeling:develfrom
shishaochen:devel
Jul 30, 2021
Merged

Replace PS-Worker mode with multi-worker one.#892
amcadmus merged 11 commits intodeepmodeling:develfrom
shishaochen:devel

Conversation

@shishaochen
Copy link
Collaborator

@shishaochen shishaochen commented Jul 26, 2021

Before this pull request, DeePMD-Kit 2.0 Preview offers capability of distributed training with PS-Worker mode on tf.train.SyncReplicasOptimizer. The old implementation:

  • lacks throughput & efficiency when the cluster scales up.
  • introduces complexity to set up a TensorFlow cluster.
  • loses flexibility compared to single-worker mode, like evaluation and warm start.

Thus, here comes a simpler but faster implentation based on Horovod. As the table shows, sample throughput of examples/water/se_e2_a scales linearly when running on a 8-GPU host:

Num of GPU cards Seconds every 100 samples Samples per second Speed up
1 1.6116 62.05 1.00
2 1.6310 61.31 1.98
4 1.6168 61.85 3.99
8 1.6212 61.68 7.95

There is no break change of user interface after this pull request. Key changes behind are:

  • learning_rate is scaled by the number of workers for better convergence as the global batch size is larger.
  • To avoid different GPU device mapping across training & infererence & sub sessions in one process, all sub sessions in descriptor, neighbor_stat, type_embedding and infer.* are limited to CPU device only.
  • with_distrib argument is deleted from INPUT config. Instead, we decide multi-worker training or not according to MPI context.

@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2021

Codecov Report

Merging #892 (7bc10d7) into devel (4985932) will decrease coverage by 9.60%.
The diff coverage is n/a.

❗ Current head 7bc10d7 differs from pull request most recent head cf99b98. Consider uploading reports for the commit cf99b98 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##            devel     #892      +/-   ##
==========================================
- Coverage   73.88%   64.28%   -9.61%     
==========================================
  Files          85        5      -80     
  Lines        6805       14    -6791     
==========================================
- Hits         5028        9    -5019     
+ Misses       1777        5    -1772     
Impacted Files Coverage Δ
deepmd/entrypoints/test.py
source/op/_prod_virial_se_r_grad.py
deepmd/cluster/local.py
deepmd/utils/argcheck.py
source/op/_prod_force_grad.py
deepmd/op/__init__.py
deepmd/entrypoints/doc.py
deepmd/descriptor/loc_frame.py
deepmd/__main__.py
deepmd/fit/ener.py
... and 63 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4985932...cf99b98. Read the comment docs.

@shishaochen
Copy link
Collaborator Author

shishaochen commented Jul 27, 2021

@njzjz The only failure in CI was caused by environment. Exit code 2 usually means "file not found". In this case, maybe pip is not available in the Docker container?
image

Besides, as talked with @denghuilu yesterday, unit testes on distributed training will be added in future when CI environment is ready.

@njzjz
Copy link
Member

njzjz commented Jul 27, 2021

This failure is fixed in #889 - you can ignore it.

@njzjz njzjz requested a review from denghuilu July 29, 2021 01:59
Copy link
Member

@amcadmus amcadmus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please write a subsection like "distributed training" in the "train a model" section of "getting started" to introduce the users on the distributed training? Thanks!

@shishaochen
Copy link
Collaborator Author

@amcadmus Document is added now.
image

@shishaochen shishaochen requested a review from njzjz July 29, 2021 09:09
To experience this powerful feature, please intall Horovod first. For better performance on GPU, please follow tuning steps in [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.rst).
```bash
# By default, MPI is used as communicator.
HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra requirement can be added into setup.py instead. And this may be added into installation part.

Copy link
Collaborator Author

@shishaochen shishaochen Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether horovod should be installed together with deepmd-kit in default. One reason is that, optimal build options can be different among cluster/host environments.
@amcadmus @denghuilu What's your options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean setup in default. You can add "horovod": ["horovod", "mpi4py"], here:

deepmd-kit/setup.py

Lines 136 to 140 in 953621f

extras_require={
"test": ["dpdata>=0.1.9", "ase", "pytest", "pytest-cov", "pytest-sugar"],
"docs": ["sphinx<4.1.0", "recommonmark", "sphinx_rtd_theme", "sphinx_markdown_tables", "myst-parser", "breathe", "exhale"],
**extras_require,
},

Then users can install with

HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip install .[horovod]

@shishaochen shishaochen requested a review from njzjz July 29, 2021 10:09
Copy link
Member

@denghuilu denghuilu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the default serial training mode, the training speed indicates that the GPU device is not enabled for training, although the GPU device has been detected by TensorFlow:

root se_e2_a $ dp train input.json
2021-07-29 22:08:21.243635: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
DEEPMD INFO     _____               _____   __  __  _____           _     _  _
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /tmp/pip-req-build-e2bfdasy/_skbuild/linux-x86_64-3.6/cmake-install
DEEPMD INFO    source :              v2.0.0.b2-43-gad444c3
DEEPMD INFO    source brach:         devel
DEEPMD INFO    source commit:        ad444c3
DEEPMD INFO    source commit at:     2021-07-29 21:28:37 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build with tf inc:    /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include;/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           iZ2zeedzsx4jorjze9gyq7Z
DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
2021-07-29 22:08:22.873108: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-29 22:08:22.874062: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-29 22:08:22.875101: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-29 22:08:23.548699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:23.549803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:07.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-29 22:08:23.549837: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:08:23.553827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-29 22:08:23.553948: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-29 22:08:23.555190: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-29 22:08:23.555509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-29 22:08:23.557718: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-29 22:08:23.558629: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-29 22:08:23.558829: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-29 22:08:23.558951: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:23.560063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:23.561086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-07-29 22:08:23.561123: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:08:24.224048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-29 22:08:24.224091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-07-29 22:08:24.224101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-07-29 22:08:24.224335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:24.225467: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:24.226541: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:24.227571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30129 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 3 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 1 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
2021-07-29 22:08:24.263356: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-07-29 22:08:24.264028: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
2021-07-29 22:08:28.109074: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-29 22:08:28.109125: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-29 22:08:28.109134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
DEEPMD INFO    batch     100 training time 10.11 s, testing time 0.17 s
DEEPMD INFO    batch     200 training time 8.90 s, testing time 0.17 s
DEEPMD INFO    batch     300 training time 8.92 s, testing time 0.17 s
DEEPMD INFO    batch     400 training time 8.90 s, testing time 0.17 s
DEEPMD INFO    batch     500 training time 8.88 s, testing time 0.17 s
DEEPMD INFO    batch     600 training time 8.86 s, testing time 0.17 s
DEEPMD INFO    batch     700 training time 8.88 s, testing time 0.17 s
DEEPMD INFO    batch     800 training time 8.90 s, testing time 0.17 s
DEEPMD INFO    batch     900 training time 8.86 s, testing time 0.17 s
DEEPMD INFO    batch    1000 training time 8.88 s, testing time 0.17 s

By the way, when I set CUDA_VISIBLE_DEVICES manually, everything works fine:

root se_e2_a $ export CUDA_VISIBLE_DEVICES=0
root se_e2_a $ dp train input.json
2021-07-29 22:17:17.239849: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
DEEPMD INFO     _____               _____   __  __  _____           _     _  _
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /tmp/pip-req-build-e2bfdasy/_skbuild/linux-x86_64-3.6/cmake-install
DEEPMD INFO    source :              v2.0.0.b2-43-gad444c3
DEEPMD INFO    source brach:         devel
DEEPMD INFO    source commit:        ad444c3
DEEPMD INFO    source commit at:     2021-07-29 21:28:37 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build with tf inc:    /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include;/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           iZ2zeedzsx4jorjze9gyq7Z
DEEPMD INFO    CUDA_VISIBLE_DEVICES: ['0']
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
2021-07-29 22:17:18.880828: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-29 22:17:18.881757: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-29 22:17:18.882870: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-29 22:17:19.569019: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:19.570151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:07.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-29 22:17:19.570188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:17:19.574174: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-29 22:17:19.574287: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-29 22:17:19.575548: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-29 22:17:19.575895: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-29 22:17:19.578154: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-29 22:17:19.579071: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-29 22:17:19.579304: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-29 22:17:19.579443: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:19.580599: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:19.581638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-07-29 22:17:19.581678: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:17:20.230706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-29 22:17:20.230751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-07-29 22:17:20.230761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-07-29 22:17:20.231000: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:20.232145: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:20.233195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:20.234241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30129 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 3 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 1 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
2021-07-29 22:17:20.268792: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-07-29 22:17:20.269495: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
2021-07-29 22:17:24.066846: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-29 22:17:24.067087: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.067507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:07.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-29 22:17:24.067549: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:17:24.067622: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-29 22:17:24.067641: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-29 22:17:24.067658: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-29 22:17:24.067675: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-29 22:17:24.067691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-29 22:17:24.067706: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-29 22:17:24.067723: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-29 22:17:24.067798: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.068155: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.068453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-07-29 22:17:24.068487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-29 22:17:24.068495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-07-29 22:17:24.068502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-07-29 22:17:24.068584: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.068932: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.069243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30129 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
2021-07-29 22:17:24.827416: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-29 22:17:25.278037: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
DEEPMD INFO    batch     100 training time 2.92 s, testing time 0.02 s
DEEPMD INFO    batch     200 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch     300 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch     400 training time 1.41 s, testing time 0.02 s
DEEPMD INFO    batch     500 training time 1.41 s, testing time 0.02 s
DEEPMD INFO    batch     600 training time 1.41 s, testing time 0.02 s
DEEPMD INFO    batch     700 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch     800 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch     900 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch    1000 training time 1.39 s, testing time 0.02 s
DEEPMD INFO    saved checkpoint model.ckpt

@shishaochen
Copy link
Collaborator Author

shishaochen commented Jul 29, 2021

@denghuilu Fixed by commit Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset.

When executing dp train input.json, the output of command nvidia-smi is:

Thu Jul 29 23:53:23 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:0E:00.0 Off |                    0 |
| N/A   40C    P0   103W / 400W |  38398MiB / 40537MiB |     43%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:13:00.0 Off |                    0 |
| N/A   36C    P0    79W / 400W |    570MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:4B:00.0 Off |                    0 |
| N/A   35C    P0    72W / 400W |    570MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:51:00.0 Off |                    0 |
| N/A   38C    P0    73W / 400W |    570MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2352106      C   /usr/bin/python3                38395MiB |
|    1   N/A  N/A   2352106      C   /usr/bin/python3                  567MiB |
|    2   N/A  N/A   2352106      C   /usr/bin/python3                  567MiB |
|    3   N/A  N/A   2352106      C   /usr/bin/python3                  567MiB |
+-----------------------------------------------------------------------------+

@shishaochen shishaochen requested a review from denghuilu July 29, 2021 16:02
@denghuilu
Copy link
Member

Now serial training can be performed correctly, but when I use two GPUs for training, an error occurs:

(tensorflow_venv) LuDh se_e2_a $ horovodrun -np 2 dp train input.json
2021-07-30 08:57:06.104608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2021-07-30 08:57:08.735608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,1]<stderr>:2021-07-30 08:57:08.809397: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:WARNING:tensorflow:From /home/LuDh/dp-devel/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,0]<stderr>:Instructions for updating:
[1,0]<stderr>:non-resource variables are not supported in the long term
[1,1]<stderr>:WARNING:tensorflow:From /home/LuDh/dp-devel/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,1]<stderr>:Instructions for updating:
[1,1]<stderr>:non-resource variables are not supported in the long term
[1,0]<stderr>:DEEPMD INFO     _____               _____   __  __  _____           _     _  _
[1,0]<stderr>:DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
[1,0]<stderr>:DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
[1,0]<stderr>:DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[1,0]<stderr>:DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
[1,0]<stderr>:DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[1,0]<stderr>:DEEPMD INFO    Please read and cite:
[1,0]<stderr>:DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,0]<stderr>:DEEPMD INFO    installed to:         /tmp/pip-req-build-_btka69i/_skbuild/linux-x86_64-3.7/cmake-install
[1,0]<stderr>:DEEPMD INFO    source :              v2.0.0.b2-44-gcf99b98
[1,0]<stderr>:DEEPMD INFO    source brach:         devel
[1,0]<stderr>:DEEPMD INFO    source commit:        cf99b98
[1,0]<stderr>:DEEPMD INFO    source commit at:     2021-07-29 23:49:58 +0800
[1,0]<stderr>:DEEPMD INFO    build float prec:     double
[1,0]<stderr>:DEEPMD INFO    build with tf inc:    /home/LuDh/dp-devel/tensorflow_venv/lib/python3.7/site-packages/tensorflow/include;/home/LuDh/dp-devel/tensorflow_venv/lib/python3.7/site-packages/tensorflow/include
[1,0]<stderr>:DEEPMD INFO    build with tf lib:
[1,0]<stderr>:DEEPMD INFO    ---Summary of the training---------------------------------------
[1,0]<stderr>:DEEPMD INFO    distributed
[1,0]<stderr>:DEEPMD INFO    world size:              2
[1,0]<stderr>:DEEPMD INFO    my rank:              0
[1,0]<stderr>:DEEPMD INFO    node list:          ['ludh-ubuntu']
[1,0]<stderr>:DEEPMD INFO    running on:           ludh-ubuntu
[1,0]<stderr>:DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
[1,0]<stderr>:DEEPMD INFO    num_intra_threads:    0
[1,0]<stderr>:DEEPMD INFO    num_inter_threads:    0
[1,0]<stderr>:DEEPMD INFO    -----------------------------------------------------------------
[1,0]<stderr>:2021-07-30 08:57:10.913867: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
[1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]<stderr>:2021-07-30 08:57:10.915006: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
[1,1]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:2021-07-30 08:57:10.915493: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1,0]<stderr>:2021-07-30 08:57:10.915699: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
[1,1]<stderr>:2021-07-30 08:57:10.916563: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1,1]<stderr>:2021-07-30 08:57:10.916718: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
[1,1]<stderr>:2021-07-30 08:57:11.513681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:73:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,1]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
[1,1]<stderr>:2021-07-30 08:57:11.515266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:a6:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,1]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
[1,1]<stderr>:2021-07-30 08:57:11.515309: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2021-07-30 08:57:11.517185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:73:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,0]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
[1,0]<stderr>:2021-07-30 08:57:11.518064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:a6:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,0]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
[1,0]<stderr>:2021-07-30 08:57:11.518113: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,1]<stderr>:2021-07-30 08:57:11.518816: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,1]<stderr>:2021-07-30 08:57:11.518907: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1,1]<stderr>:2021-07-30 08:57:11.520421: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1,1]<stderr>:2021-07-30 08:57:11.520825: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1,0]<stderr>:2021-07-30 08:57:11.521579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,0]<stderr>:2021-07-30 08:57:11.521688: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1,0]<stderr>:2021-07-30 08:57:11.523201: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1,0]<stderr>:2021-07-30 08:57:11.523589: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1,1]<stderr>:2021-07-30 08:57:11.524260: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1,1]<stderr>:2021-07-30 08:57:11.524995: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1,1]<stderr>:2021-07-30 08:57:11.525212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1,0]<stderr>:2021-07-30 08:57:11.526943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1,0]<stderr>:2021-07-30 08:57:11.527675: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1,0]<stderr>:2021-07-30 08:57:11.527908: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1,1]<stderr>:2021-07-30 08:57:11.528677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-07-30 08:57:11.528737: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2021-07-30 08:57:11.533266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-07-30 08:57:11.533328: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,1]<stderr>:2021-07-30 08:57:13.048635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-07-30 08:57:13.048709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
[1,1]<stderr>:2021-07-30 08:57:13.048724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
[1,1]<stderr>:2021-07-30 08:57:13.048732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
[1,1]<stderr>:2021-07-30 08:57:13.052828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9125 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:73:00.0, compute capability: 6.1)
[1,0]<stderr>:2021-07-30 08:57:13.054032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-07-30 08:57:13.054124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
[1,0]<stderr>:2021-07-30 08:57:13.054140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
[1,0]<stderr>:2021-07-30 08:57:13.054149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
[1,1]<stderr>:2021-07-30 08:57:13.055609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10035 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a6:00.0, compute capability: 6.1)
[1,0]<stderr>:2021-07-30 08:57:13.062760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9125 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:73:00.0, compute capability: 6.1)
[1,0]<stderr>:2021-07-30 08:57:13.066599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10035 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a6:00.0, compute capability: 6.1)
[1,0]<stderr>:DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
[1,0]<stderr>:DEEPMD INFO    found 3 system(s):
[1,0]<stderr>:DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
[1,0]<stderr>:DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
[1,0]<stderr>:DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
[1,0]<stderr>:DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
[1,0]<stderr>:DEEPMD INFO    --------------------------------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
[1,0]<stderr>:DEEPMD INFO    found 1 system(s):
[1,0]<stderr>:DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
[1,0]<stderr>:DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
[1,0]<stderr>:DEEPMD INFO    --------------------------------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO    training without frame parameter
[1,1]<stderr>:2021-07-30 08:57:13.115900: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
[1,0]<stderr>:2021-07-30 08:57:13.130318: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
[1,0]<stderr>:2021-07-30 08:57:13.131814: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1700000000 Hz
[1,1]<stderr>:2021-07-30 08:57:13.136277: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1700000000 Hz
[1,1]<stderr>:2021-07-30 08:57:13.177561: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 8.91G (9569109248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.179350: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 8.02G (8612197376 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.181074: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 7.22G (7750977536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.182822: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.50G (6975879680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.184574: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 5.85G (6278291456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.185958: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 5.26G (5650462208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.187202: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.74G (5085415936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.188456: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.26G (4576873984 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.189719: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.84G (4119186432 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.191019: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.45G (3707267584 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.192271: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.11G (3336540672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.193518: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.80G (3002886400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.194762: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.52G (2702597632 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.195924: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.26G (2432337920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.197057: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.04G (2189104128 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.198179: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.83G (1970193664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.199280: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.65G (1773174272 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.200429: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.49G (1595856896 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.201546: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.34G (1436271360 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.202648: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.20G (1292644352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.203754: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.08G (1163379968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.204870: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 998.54M (1047042048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.206000: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 898.68M (942337792 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.207101: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 808.81M (848104192 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.208225: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 727.93M (763293952 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.209330: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 655.14M (686964736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.210474: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 589.63M (618268416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:DEEPMD INFO    built lr
[1,0]<stderr>:DEEPMD INFO    built network
[1,1]<stderr>:2021-07-30 08:57:18.249933: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1,1]<stderr>:2021-07-30 08:57:18.251206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:73:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,1]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
[1,1]<stderr>:2021-07-30 08:57:18.252085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:a6:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,1]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
[1,1]<stderr>:2021-07-30 08:57:18.252192: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,1]<stderr>:2021-07-30 08:57:18.252304: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,1]<stderr>:2021-07-30 08:57:18.252330: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1,1]<stderr>:2021-07-30 08:57:18.252355: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1,1]<stderr>:2021-07-30 08:57:18.252379: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1,1]<stderr>:2021-07-30 08:57:18.252405: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1,1]<stderr>:2021-07-30 08:57:18.252429: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1,1]<stderr>:2021-07-30 08:57:18.252454: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1,1]<stderr>:2021-07-30 08:57:18.254984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-07-30 08:57:18.255070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-07-30 08:57:18.255082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
[1,1]<stderr>:2021-07-30 08:57:18.255092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
[1,1]<stderr>:2021-07-30 08:57:18.255100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
[1,1]<stderr>:2021-07-30 08:57:18.256932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9125 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:73:00.0, compute capability: 6.1)
[1,1]<stderr>:2021-07-30 08:57:18.257750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10035 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a6:00.0, compute capability: 6.1)
[1,0]<stderr>:DEEPMD INFO    built training
[1,0]<stderr>:2021-07-30 08:57:18.870066: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1,0]<stderr>:2021-07-30 08:57:18.871711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:73:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,0]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
[1,0]<stderr>:2021-07-30 08:57:18.872610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:a6:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,0]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
[1,0]<stderr>:2021-07-30 08:57:18.872691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2021-07-30 08:57:18.872907: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,0]<stderr>:2021-07-30 08:57:18.872937: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1,0]<stderr>:2021-07-30 08:57:18.872964: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1,0]<stderr>:2021-07-30 08:57:18.872989: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1,0]<stderr>:2021-07-30 08:57:18.873017: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1,0]<stderr>:2021-07-30 08:57:18.873045: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1,0]<stderr>:2021-07-30 08:57:18.873070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1,0]<stderr>:2021-07-30 08:57:18.875614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-07-30 08:57:18.875745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-07-30 08:57:18.875760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
[1,0]<stderr>:2021-07-30 08:57:18.875771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
[1,0]<stderr>:2021-07-30 08:57:18.875780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
[1,0]<stderr>:2021-07-30 08:57:18.881571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9125 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:73:00.0, compute capability: 6.1)
[1,0]<stderr>:2021-07-30 08:57:18.882419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10035 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a6:00.0, compute capability: 6.1)
[1,0]<stderr>:DEEPMD INFO    initialize model from scratch
[1,0]<stderr>:DEEPMD INFO    broadcast global variables to other tasks
[1,0]<stderr>:DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
[1,0]<stderr>:2021-07-30 08:57:21.157894: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,0]<stderr>:2021-07-30 08:57:21.517267: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.547284: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.576404: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.580951: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.585777: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.590074: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.594341: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.598706: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.603329: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.607626: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.611900: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.616184: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.621737: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.626009: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.630461: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.635283: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.640270: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.645006: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.649965: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.654716: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.661190: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.665979: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.670924: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.675751: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.681153: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.685980: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.690832: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.695618: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.701377: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.706185: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.710967: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.715727: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.720754: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.725545: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.730347: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.735566: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.740563: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.745348: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.750132: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.755007: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:cuda assert: out of memory /tmp/pip-req-build-_btka69i/source/lib/include/gpu_cuda.h 122
[1,0]<stderr>:Your memory is not enough, thus an error has been raised above. You need to take the following actions:
[1,0]<stderr>:1. Check if the network size of the model is too large.
[1,0]<stderr>:2. Check if the batch size of training or testing is too large. You can set the training batch size to `auto`.
[1,0]<stderr>:3. Check if the number of atoms is too large.
[1,0]<stderr>:4. Check if another program is using the same GPU by execuating `nvidia-smi`. The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` environment variable.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[1,1]<stderr>:2021-07-30 08:57:22.771616: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[46465,1],0]
  Exit code:    2
--------------------------------------------------------------------------

(tensorflow_venv) LuDh dp-devel $ nvidia-smi
Fri Jul 30 09:01:04 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:73:00.0 Off |                  N/A |
| 25%   40C    P8     8W / 250W |    967MiB / 11175MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:A6:00.0 Off |                  N/A |
| 25%   42C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                689MiB |
|    0   N/A  N/A      3741      G   compiz                            275MiB |
+-----------------------------------------------------------------------------+

@shishaochen
Copy link
Collaborator Author

shishaochen commented Jul 30, 2021

@denghuilu I have explained in document that:

Need to mention, environment variable `CUDA_VISIBLE_DEVICES` must be set to control parallelism on the occupied host where one process is bound to one GPU card.

It is not a good practice that run GPU program without explicit declaration of CUDA_VISBILE_DEVICES as TensorFlow will always use lower IDs of GPU cards which may conflict with other users on the same host.

This is also the reason why I prefer not using GPU if CUDA_VISIBLE_DEVICES is unset.

Copy link
Member

@denghuilu denghuilu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the horovod training process in the CUDA environment, and there's no problem

@amcadmus amcadmus merged commit 31f1ef6 into deepmodeling:devel Jul 30, 2021
gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this pull request Sep 2, 2021
* Replace PS-Worker mode with multi-worker one.

* Remove deprecated `try_distrib` argument in tests.

* Limit reference of mpi4py to logger.py.

* Add tutorial on parallel training.

* Refine words & tokens used.

* Only limit sub sessions to CPU when distributed training.

* Add description of `mpi4py` in tutorial.

* Explain linear relationship between batch size and learning rate.

* Fine documents & comments.

* Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset.

Co-authored-by: Han Wang <amcadmus@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants