Replace PS-Worker mode with multi-worker one. by shishaochen · Pull Request #892 · deepmodeling/deepmd-kit

shishaochen · 2021-07-26T11:31:04Z

Before this pull request, DeePMD-Kit 2.0 Preview offers capability of distributed training with PS-Worker mode on tf.train.SyncReplicasOptimizer. The old implementation:

lacks throughput & efficiency when the cluster scales up.
introduces complexity to set up a TensorFlow cluster.
loses flexibility compared to single-worker mode, like evaluation and warm start.

Thus, here comes a simpler but faster implentation based on Horovod. As the table shows, sample throughput of examples/water/se_e2_a scales linearly when running on a 8-GPU host:

Num of GPU cards	Seconds every 100 samples	Samples per second	Speed up
1	1.6116	62.05	1.00
2	1.6310	61.31	1.98
4	1.6168	61.85	3.99
8	1.6212	61.68	7.95

There is no break change of user interface after this pull request. Key changes behind are:

learning_rate is scaled by the number of workers for better convergence as the global batch size is larger.
To avoid different GPU device mapping across training & infererence & sub sessions in one process, all sub sessions in descriptor, neighbor_stat, type_embedding and infer.* are limited to CPU device only.
with_distrib argument is deleted from INPUT config. Instead, we decide multi-worker training or not according to MPI context.

deepmd/train/run_options.py

codecov-commenter · 2021-07-27T02:25:04Z

Codecov Report

Merging #892 (7bc10d7) into devel (4985932) will decrease coverage by 9.60%.
The diff coverage is n/a.

❗ Current head 7bc10d7 differs from pull request most recent head cf99b98. Consider uploading reports for the commit cf99b98 to get more accurate results

@@            Coverage Diff             @@
##            devel     #892      +/-   ##
==========================================
- Coverage   73.88%   64.28%   -9.61%     
==========================================
  Files          85        5      -80     
  Lines        6805       14    -6791     
==========================================
- Hits         5028        9    -5019     
+ Misses       1777        5    -1772

Impacted Files	Coverage Δ
deepmd/entrypoints/test.py
source/op/_prod_virial_se_r_grad.py
deepmd/cluster/local.py
deepmd/utils/argcheck.py
source/op/_prod_force_grad.py
deepmd/op/__init__.py
deepmd/entrypoints/doc.py
deepmd/descriptor/loc_frame.py
deepmd/__main__.py
deepmd/fit/ener.py
... and 63 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4985932...cf99b98. Read the comment docs.

shishaochen · 2021-07-27T02:51:44Z

@njzjz The only failure in CI was caused by environment. Exit code 2 usually means "file not found". In this case, maybe pip is not available in the Docker container?

Besides, as talked with @denghuilu yesterday, unit testes on distributed training will be added in future when CI environment is ready.

njzjz · 2021-07-27T03:11:48Z

This failure is fixed in #889 - you can ignore it.

amcadmus

Could you please write a subsection like "distributed training" in the "train a model" section of "getting started" to introduce the users on the distributed training? Thanks!

deepmd/env.py

shishaochen · 2021-07-29T09:04:25Z

@amcadmus Document is added now.

njzjz · 2021-07-29T09:10:59Z

doc/getting-started.md

+To experience this powerful feature, please intall Horovod first. For better performance on GPU, please follow tuning steps in [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.rst).
+```bash
+# By default, MPI is used as communicator.
+HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod


This extra requirement can be added into setup.py instead. And this may be added into installation part.

I'm not sure whether horovod should be installed together with deepmd-kit in default. One reason is that, optimal build options can be different among cluster/host environments.
@amcadmus @denghuilu What's your options?

I don't mean setup in default. You can add "horovod": ["horovod", "mpi4py"], here:

deepmd-kit/setup.py

Lines 136 to 140 in 953621f

extras_require={

"test": ["dpdata>=0.1.9", "ase", "pytest", "pytest-cov", "pytest-sugar"],

"docs": ["sphinx<4.1.0", "recommonmark", "sphinx_rtd_theme", "sphinx_markdown_tables", "myst-parser", "breathe", "exhale"],

**extras_require,

},

Then users can install with

HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip install .[horovod]

deepmd/train/run_options.py

deepmd/env.py

denghuilu

In the default serial training mode, the training speed indicates that the GPU device is not enabled for training, although the GPU device has been detected by TensorFlow:

root se_e2_a $ dp train input.json
2021-07-29 22:08:21.243635: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
DEEPMD INFO     _____               _____   __  __  _____           _     _  _
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /tmp/pip-req-build-e2bfdasy/_skbuild/linux-x86_64-3.6/cmake-install
DEEPMD INFO    source :              v2.0.0.b2-43-gad444c3
DEEPMD INFO    source brach:         devel
DEEPMD INFO    source commit:        ad444c3
DEEPMD INFO    source commit at:     2021-07-29 21:28:37 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build with tf inc:    /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include;/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           iZ2zeedzsx4jorjze9gyq7Z
DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
2021-07-29 22:08:22.873108: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-29 22:08:22.874062: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-29 22:08:22.875101: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-29 22:08:23.548699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:23.549803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:07.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-29 22:08:23.549837: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:08:23.553827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-29 22:08:23.553948: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-29 22:08:23.555190: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-29 22:08:23.555509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-29 22:08:23.557718: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-29 22:08:23.558629: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-29 22:08:23.558829: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-29 22:08:23.558951: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:23.560063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:23.561086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-07-29 22:08:23.561123: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:08:24.224048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-29 22:08:24.224091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-07-29 22:08:24.224101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-07-29 22:08:24.224335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:24.225467: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:24.226541: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:08:24.227571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30129 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 3 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 1 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
2021-07-29 22:08:24.263356: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-07-29 22:08:24.264028: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
2021-07-29 22:08:28.109074: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-29 22:08:28.109125: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-29 22:08:28.109134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
DEEPMD INFO    batch     100 training time 10.11 s, testing time 0.17 s
DEEPMD INFO    batch     200 training time 8.90 s, testing time 0.17 s
DEEPMD INFO    batch     300 training time 8.92 s, testing time 0.17 s
DEEPMD INFO    batch     400 training time 8.90 s, testing time 0.17 s
DEEPMD INFO    batch     500 training time 8.88 s, testing time 0.17 s
DEEPMD INFO    batch     600 training time 8.86 s, testing time 0.17 s
DEEPMD INFO    batch     700 training time 8.88 s, testing time 0.17 s
DEEPMD INFO    batch     800 training time 8.90 s, testing time 0.17 s
DEEPMD INFO    batch     900 training time 8.86 s, testing time 0.17 s
DEEPMD INFO    batch    1000 training time 8.88 s, testing time 0.17 s

By the way, when I set CUDA_VISIBLE_DEVICES manually, everything works fine:

root se_e2_a $ export CUDA_VISIBLE_DEVICES=0
root se_e2_a $ dp train input.json
2021-07-29 22:17:17.239849: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
DEEPMD INFO     _____               _____   __  __  _____           _     _  _
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /tmp/pip-req-build-e2bfdasy/_skbuild/linux-x86_64-3.6/cmake-install
DEEPMD INFO    source :              v2.0.0.b2-43-gad444c3
DEEPMD INFO    source brach:         devel
DEEPMD INFO    source commit:        ad444c3
DEEPMD INFO    source commit at:     2021-07-29 21:28:37 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build with tf inc:    /root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include;/root/dp-devel/tensorflow_venv/lib/python3.6/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           iZ2zeedzsx4jorjze9gyq7Z
DEEPMD INFO    CUDA_VISIBLE_DEVICES: ['0']
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
2021-07-29 22:17:18.880828: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-29 22:17:18.881757: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-29 22:17:18.882870: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-29 22:17:19.569019: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:19.570151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:07.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-29 22:17:19.570188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:17:19.574174: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-29 22:17:19.574287: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-29 22:17:19.575548: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-29 22:17:19.575895: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-29 22:17:19.578154: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-29 22:17:19.579071: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-29 22:17:19.579304: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-29 22:17:19.579443: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:19.580599: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:19.581638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-07-29 22:17:19.581678: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:17:20.230706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-29 22:17:20.230751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-07-29 22:17:20.230761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-07-29 22:17:20.231000: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:20.232145: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:20.233195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:20.234241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30129 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 3 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 1 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
2021-07-29 22:17:20.268792: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-07-29 22:17:20.269495: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
2021-07-29 22:17:24.066846: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-29 22:17:24.067087: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.067507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:07.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-29 22:17:24.067549: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 22:17:24.067622: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-29 22:17:24.067641: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-29 22:17:24.067658: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-29 22:17:24.067675: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-29 22:17:24.067691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-29 22:17:24.067706: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-29 22:17:24.067723: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-29 22:17:24.067798: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.068155: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.068453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-07-29 22:17:24.068487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-29 22:17:24.068495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-07-29 22:17:24.068502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-07-29 22:17:24.068584: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.068932: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 22:17:24.069243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30129 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
2021-07-29 22:17:24.827416: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-29 22:17:25.278037: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
DEEPMD INFO    batch     100 training time 2.92 s, testing time 0.02 s
DEEPMD INFO    batch     200 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch     300 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch     400 training time 1.41 s, testing time 0.02 s
DEEPMD INFO    batch     500 training time 1.41 s, testing time 0.02 s
DEEPMD INFO    batch     600 training time 1.41 s, testing time 0.02 s
DEEPMD INFO    batch     700 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch     800 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch     900 training time 1.40 s, testing time 0.02 s
DEEPMD INFO    batch    1000 training time 1.39 s, testing time 0.02 s
DEEPMD INFO    saved checkpoint model.ckpt

shishaochen · 2021-07-29T15:53:05Z

@denghuilu Fixed by commit Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset.

When executing dp train input.json, the output of command nvidia-smi is:

Thu Jul 29 23:53:23 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:0E:00.0 Off |                    0 |
| N/A   40C    P0   103W / 400W |  38398MiB / 40537MiB |     43%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:13:00.0 Off |                    0 |
| N/A   36C    P0    79W / 400W |    570MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:4B:00.0 Off |                    0 |
| N/A   35C    P0    72W / 400W |    570MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:51:00.0 Off |                    0 |
| N/A   38C    P0    73W / 400W |    570MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2352106      C   /usr/bin/python3                38395MiB |
|    1   N/A  N/A   2352106      C   /usr/bin/python3                  567MiB |
|    2   N/A  N/A   2352106      C   /usr/bin/python3                  567MiB |
|    3   N/A  N/A   2352106      C   /usr/bin/python3                  567MiB |
+-----------------------------------------------------------------------------+

denghuilu · 2021-07-30T01:02:10Z

Now serial training can be performed correctly, but when I use two GPUs for training, an error occurs:

(tensorflow_venv) LuDh se_e2_a $ horovodrun -np 2 dp train input.json
2021-07-30 08:57:06.104608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2021-07-30 08:57:08.735608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,1]<stderr>:2021-07-30 08:57:08.809397: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:WARNING:tensorflow:From /home/LuDh/dp-devel/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,0]<stderr>:Instructions for updating:
[1,0]<stderr>:non-resource variables are not supported in the long term
[1,1]<stderr>:WARNING:tensorflow:From /home/LuDh/dp-devel/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,1]<stderr>:Instructions for updating:
[1,1]<stderr>:non-resource variables are not supported in the long term
[1,0]<stderr>:DEEPMD INFO     _____               _____   __  __  _____           _     _  _
[1,0]<stderr>:DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
[1,0]<stderr>:DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
[1,0]<stderr>:DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[1,0]<stderr>:DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
[1,0]<stderr>:DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[1,0]<stderr>:DEEPMD INFO    Please read and cite:
[1,0]<stderr>:DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,0]<stderr>:DEEPMD INFO    installed to:         /tmp/pip-req-build-_btka69i/_skbuild/linux-x86_64-3.7/cmake-install
[1,0]<stderr>:DEEPMD INFO    source :              v2.0.0.b2-44-gcf99b98
[1,0]<stderr>:DEEPMD INFO    source brach:         devel
[1,0]<stderr>:DEEPMD INFO    source commit:        cf99b98
[1,0]<stderr>:DEEPMD INFO    source commit at:     2021-07-29 23:49:58 +0800
[1,0]<stderr>:DEEPMD INFO    build float prec:     double
[1,0]<stderr>:DEEPMD INFO    build with tf inc:    /home/LuDh/dp-devel/tensorflow_venv/lib/python3.7/site-packages/tensorflow/include;/home/LuDh/dp-devel/tensorflow_venv/lib/python3.7/site-packages/tensorflow/include
[1,0]<stderr>:DEEPMD INFO    build with tf lib:
[1,0]<stderr>:DEEPMD INFO    ---Summary of the training---------------------------------------
[1,0]<stderr>:DEEPMD INFO    distributed
[1,0]<stderr>:DEEPMD INFO    world size:              2
[1,0]<stderr>:DEEPMD INFO    my rank:              0
[1,0]<stderr>:DEEPMD INFO    node list:          ['ludh-ubuntu']
[1,0]<stderr>:DEEPMD INFO    running on:           ludh-ubuntu
[1,0]<stderr>:DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
[1,0]<stderr>:DEEPMD INFO    num_intra_threads:    0
[1,0]<stderr>:DEEPMD INFO    num_inter_threads:    0
[1,0]<stderr>:DEEPMD INFO    -----------------------------------------------------------------
[1,0]<stderr>:2021-07-30 08:57:10.913867: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
[1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]<stderr>:2021-07-30 08:57:10.915006: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
[1,1]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:2021-07-30 08:57:10.915493: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1,0]<stderr>:2021-07-30 08:57:10.915699: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
[1,1]<stderr>:2021-07-30 08:57:10.916563: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1,1]<stderr>:2021-07-30 08:57:10.916718: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
[1,1]<stderr>:2021-07-30 08:57:11.513681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:73:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,1]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
[1,1]<stderr>:2021-07-30 08:57:11.515266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:a6:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,1]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
[1,1]<stderr>:2021-07-30 08:57:11.515309: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2021-07-30 08:57:11.517185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:73:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,0]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
[1,0]<stderr>:2021-07-30 08:57:11.518064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:a6:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,0]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
[1,0]<stderr>:2021-07-30 08:57:11.518113: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,1]<stderr>:2021-07-30 08:57:11.518816: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,1]<stderr>:2021-07-30 08:57:11.518907: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1,1]<stderr>:2021-07-30 08:57:11.520421: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1,1]<stderr>:2021-07-30 08:57:11.520825: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1,0]<stderr>:2021-07-30 08:57:11.521579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,0]<stderr>:2021-07-30 08:57:11.521688: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1,0]<stderr>:2021-07-30 08:57:11.523201: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1,0]<stderr>:2021-07-30 08:57:11.523589: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1,1]<stderr>:2021-07-30 08:57:11.524260: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1,1]<stderr>:2021-07-30 08:57:11.524995: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1,1]<stderr>:2021-07-30 08:57:11.525212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1,0]<stderr>:2021-07-30 08:57:11.526943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1,0]<stderr>:2021-07-30 08:57:11.527675: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1,0]<stderr>:2021-07-30 08:57:11.527908: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1,1]<stderr>:2021-07-30 08:57:11.528677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-07-30 08:57:11.528737: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2021-07-30 08:57:11.533266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-07-30 08:57:11.533328: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,1]<stderr>:2021-07-30 08:57:13.048635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-07-30 08:57:13.048709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
[1,1]<stderr>:2021-07-30 08:57:13.048724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
[1,1]<stderr>:2021-07-30 08:57:13.048732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
[1,1]<stderr>:2021-07-30 08:57:13.052828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9125 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:73:00.0, compute capability: 6.1)
[1,0]<stderr>:2021-07-30 08:57:13.054032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-07-30 08:57:13.054124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
[1,0]<stderr>:2021-07-30 08:57:13.054140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
[1,0]<stderr>:2021-07-30 08:57:13.054149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
[1,1]<stderr>:2021-07-30 08:57:13.055609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10035 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a6:00.0, compute capability: 6.1)
[1,0]<stderr>:2021-07-30 08:57:13.062760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9125 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:73:00.0, compute capability: 6.1)
[1,0]<stderr>:2021-07-30 08:57:13.066599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10035 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a6:00.0, compute capability: 6.1)
[1,0]<stderr>:DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
[1,0]<stderr>:DEEPMD INFO    found 3 system(s):
[1,0]<stderr>:DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
[1,0]<stderr>:DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
[1,0]<stderr>:DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
[1,0]<stderr>:DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
[1,0]<stderr>:DEEPMD INFO    --------------------------------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
[1,0]<stderr>:DEEPMD INFO    found 1 system(s):
[1,0]<stderr>:DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
[1,0]<stderr>:DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
[1,0]<stderr>:DEEPMD INFO    --------------------------------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO    training without frame parameter
[1,1]<stderr>:2021-07-30 08:57:13.115900: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
[1,0]<stderr>:2021-07-30 08:57:13.130318: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
[1,0]<stderr>:2021-07-30 08:57:13.131814: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1700000000 Hz
[1,1]<stderr>:2021-07-30 08:57:13.136277: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1700000000 Hz
[1,1]<stderr>:2021-07-30 08:57:13.177561: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 8.91G (9569109248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.179350: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 8.02G (8612197376 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.181074: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 7.22G (7750977536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.182822: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.50G (6975879680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.184574: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 5.85G (6278291456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.185958: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 5.26G (5650462208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.187202: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.74G (5085415936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.188456: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.26G (4576873984 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.189719: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.84G (4119186432 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.191019: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.45G (3707267584 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.192271: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.11G (3336540672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.193518: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.80G (3002886400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.194762: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.52G (2702597632 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.195924: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.26G (2432337920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.197057: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.04G (2189104128 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.198179: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.83G (1970193664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.199280: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.65G (1773174272 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.200429: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.49G (1595856896 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.201546: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.34G (1436271360 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.202648: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.20G (1292644352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.203754: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.08G (1163379968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.204870: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 998.54M (1047042048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.206000: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 898.68M (942337792 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.207101: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 808.81M (848104192 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.208225: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 727.93M (763293952 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.209330: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 655.14M (686964736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,1]<stderr>:2021-07-30 08:57:13.210474: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 589.63M (618268416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:DEEPMD INFO    built lr
[1,0]<stderr>:DEEPMD INFO    built network
[1,1]<stderr>:2021-07-30 08:57:18.249933: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1,1]<stderr>:2021-07-30 08:57:18.251206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:73:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,1]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
[1,1]<stderr>:2021-07-30 08:57:18.252085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:a6:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,1]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
[1,1]<stderr>:2021-07-30 08:57:18.252192: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,1]<stderr>:2021-07-30 08:57:18.252304: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,1]<stderr>:2021-07-30 08:57:18.252330: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1,1]<stderr>:2021-07-30 08:57:18.252355: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1,1]<stderr>:2021-07-30 08:57:18.252379: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1,1]<stderr>:2021-07-30 08:57:18.252405: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1,1]<stderr>:2021-07-30 08:57:18.252429: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1,1]<stderr>:2021-07-30 08:57:18.252454: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1,1]<stderr>:2021-07-30 08:57:18.254984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-07-30 08:57:18.255070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-07-30 08:57:18.255082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
[1,1]<stderr>:2021-07-30 08:57:18.255092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
[1,1]<stderr>:2021-07-30 08:57:18.255100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
[1,1]<stderr>:2021-07-30 08:57:18.256932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9125 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:73:00.0, compute capability: 6.1)
[1,1]<stderr>:2021-07-30 08:57:18.257750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10035 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a6:00.0, compute capability: 6.1)
[1,0]<stderr>:DEEPMD INFO    built training
[1,0]<stderr>:2021-07-30 08:57:18.870066: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1,0]<stderr>:2021-07-30 08:57:18.871711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:73:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,0]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
[1,0]<stderr>:2021-07-30 08:57:18.872610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:a6:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
[1,0]<stderr>:coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
[1,0]<stderr>:2021-07-30 08:57:18.872691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2021-07-30 08:57:18.872907: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,0]<stderr>:2021-07-30 08:57:18.872937: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1,0]<stderr>:2021-07-30 08:57:18.872964: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1,0]<stderr>:2021-07-30 08:57:18.872989: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1,0]<stderr>:2021-07-30 08:57:18.873017: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1,0]<stderr>:2021-07-30 08:57:18.873045: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1,0]<stderr>:2021-07-30 08:57:18.873070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1,0]<stderr>:2021-07-30 08:57:18.875614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-07-30 08:57:18.875745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-07-30 08:57:18.875760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
[1,0]<stderr>:2021-07-30 08:57:18.875771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
[1,0]<stderr>:2021-07-30 08:57:18.875780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
[1,0]<stderr>:2021-07-30 08:57:18.881571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9125 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:73:00.0, compute capability: 6.1)
[1,0]<stderr>:2021-07-30 08:57:18.882419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10035 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:a6:00.0, compute capability: 6.1)
[1,0]<stderr>:DEEPMD INFO    initialize model from scratch
[1,0]<stderr>:DEEPMD INFO    broadcast global variables to other tasks
[1,0]<stderr>:DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
[1,0]<stderr>:2021-07-30 08:57:21.157894: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1,0]<stderr>:2021-07-30 08:57:21.517267: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.547284: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.576404: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.580951: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.585777: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.590074: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.594341: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.598706: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.603329: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.607626: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.611900: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.616184: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.621737: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.626009: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.630461: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.635283: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.640270: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.645006: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.649965: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.654716: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.661190: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.665979: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.670924: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.675751: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.681153: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.685980: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.690832: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.695618: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.701377: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.706185: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.710967: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.715727: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.720754: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.725545: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.730347: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.735566: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.740563: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.745348: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.750132: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-07-30 08:57:21.755007: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:cuda assert: out of memory /tmp/pip-req-build-_btka69i/source/lib/include/gpu_cuda.h 122
[1,0]<stderr>:Your memory is not enough, thus an error has been raised above. You need to take the following actions:
[1,0]<stderr>:1. Check if the network size of the model is too large.
[1,0]<stderr>:2. Check if the batch size of training or testing is too large. You can set the training batch size to `auto`.
[1,0]<stderr>:3. Check if the number of atoms is too large.
[1,0]<stderr>:4. Check if another program is using the same GPU by execuating `nvidia-smi`. The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` environment variable.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[1,1]<stderr>:2021-07-30 08:57:22.771616: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[46465,1],0]
  Exit code:    2
--------------------------------------------------------------------------

(tensorflow_venv) LuDh dp-devel $ nvidia-smi
Fri Jul 30 09:01:04 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:73:00.0 Off |                  N/A |
| 25%   40C    P8     8W / 250W |    967MiB / 11175MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:A6:00.0 Off |                  N/A |
| 25%   42C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                689MiB |
|    0   N/A  N/A      3741      G   compiz                            275MiB |
+-----------------------------------------------------------------------------+

shishaochen · 2021-07-30T01:23:57Z

@denghuilu I have explained in document that:

Need to mention, environment variable `CUDA_VISIBLE_DEVICES` must be set to control parallelism on the occupied host where one process is bound to one GPU card.

It is not a good practice that run GPU program without explicit declaration of CUDA_VISBILE_DEVICES as TensorFlow will always use lower IDs of GPU cards which may conflict with other users on the same host.

This is also the reason why I prefer not using GPU if CUDA_VISIBLE_DEVICES is unset.

denghuilu

I have tested the horovod training process in the CUDA environment, and there's no problem

* Replace PS-Worker mode with multi-worker one. * Remove deprecated `try_distrib` argument in tests. * Limit reference of mpi4py to logger.py. * Add tutorial on parallel training. * Refine words & tokens used. * Only limit sub sessions to CPU when distributed training. * Add description of `mpi4py` in tutorial. * Explain linear relationship between batch size and learning rate. * Fine documents & comments. * Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset. Co-authored-by: Han Wang <amcadmus@gmail.com>

Replace PS-Worker mode with multi-worker one.

efa844f

njzjz requested changes Jul 26, 2021

View reviewed changes

deepmd/train/run_options.py Show resolved Hide resolved

Remove deprecated try_distrib argument in tests.

4689d19

njzjz requested a review from denghuilu July 29, 2021 01:59

amcadmus reviewed Jul 29, 2021

View reviewed changes

deepmd/env.py Outdated Show resolved Hide resolved

shishaochen added 2 commits July 29, 2021 17:02

Limit reference of mpi4py to logger.py.

f492780

Add tutorial on parallel training.

90b9544

shishaochen force-pushed the devel branch from 78eab14 to 90b9544 Compare July 29, 2021 09:02

shishaochen requested a review from njzjz July 29, 2021 09:09

njzjz reviewed Jul 29, 2021

View reviewed changes

shishaochen added 2 commits July 29, 2021 17:46

Refine words & tokens used.

ac1dc48

Only limit sub sessions to CPU when distributed training.

cca5fce

shishaochen force-pushed the devel branch from bb32e54 to cca5fce Compare July 29, 2021 10:01

Add description of mpi4py in tutorial.

9ae4102

shishaochen requested a review from njzjz July 29, 2021 10:09

Explain linear relationship between batch size and learning rate.

bf6caad

shishaochen force-pushed the devel branch from 953a5f8 to bf6caad Compare July 29, 2021 13:00

amcadmus approved these changes Jul 29, 2021

View reviewed changes

Fine documents & comments.

ad444c3

shishaochen force-pushed the devel branch from 13c6ab0 to ad444c3 Compare July 29, 2021 13:29

denghuilu reviewed Jul 29, 2021

View reviewed changes

Let TensorFlow choose device when CUDA_VISIBLE_DEVICES is unset.

cf99b98

shishaochen requested a review from denghuilu July 29, 2021 16:02

njzjz approved these changes Jul 29, 2021

View reviewed changes

denghuilu approved these changes Jul 30, 2021

View reviewed changes

Merge branch 'devel' into devel

dbff381

amcadmus merged commit 31f1ef6 into deepmodeling:devel Jul 30, 2021

shishaochen mentioned this pull request Aug 1, 2021

Fix the expanding logic of SLURM_JOB_NODELIST and add unit tests for parallel training. #913

Merged

njzjz mentioned this pull request Aug 22, 2021

give a clear message if model.get_ntypes()<data.get_ntypes() #1016

Merged

	extras_require={
	"test": ["dpdata>=0.1.9", "ase", "pytest", "pytest-cov", "pytest-sugar"],
	"docs": ["sphinx<4.1.0", "recommonmark", "sphinx_rtd_theme", "sphinx_markdown_tables", "myst-parser", "breathe", "exhale"],
	**extras_require,
	},

Conversation

shishaochen commented Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shishaochen commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njzjz commented Jul 27, 2021

Uh oh!

amcadmus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shishaochen commented Jul 29, 2021

Uh oh!

njzjz Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

shishaochen Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njzjz Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

denghuilu left a comment

Choose a reason for hiding this comment

Uh oh!

shishaochen commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

denghuilu commented Jul 30, 2021

Uh oh!

shishaochen commented Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

denghuilu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shishaochen commented Jul 26, 2021 •

edited

Loading

codecov-commenter commented Jul 27, 2021 •

edited

Loading

shishaochen commented Jul 27, 2021 •

edited

Loading

shishaochen Jul 29, 2021 •

edited

Loading

shishaochen commented Jul 29, 2021 •

edited

Loading

shishaochen commented Jul 30, 2021 •

edited

Loading