-
Notifications
You must be signed in to change notification settings - Fork 599
Labels
Description
Summary
We may have some troubles with the mixed precision training:
root water $ cd se_e2_a_mixed_prec
root se_e2_a_mixed_prec $ ls
input.json
root se_e2_a_mixed_prec $ dp train input.json
WARNING:tensorflow:From /root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:101: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD INFO training data with min nbor dist: 0.8854385688525511
DEEPMD INFO training data with max nbor size: [38, 72]
DEEPMD INFO _____ _____ __ __ _____ _ _ _
DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| |
DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_
DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __|
DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_
DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__|
DEEPMD INFO Please read and cite:
DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO installed to: /root/denghui/deepmd-kit/_skbuild/linux-x86_64-3.7/cmake-install
DEEPMD INFO source : v2.0.2-80-g0d8fe0a-dirty
DEEPMD INFO source brach: devel
DEEPMD INFO source commit: 0d8fe0a
DEEPMD INFO source commit at: 2022-02-04 11:32:01 +0800
DEEPMD INFO build float prec: double
DEEPMD INFO build with tf inc: /root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/include
DEEPMD INFO build with tf lib:
DEEPMD INFO ---Summary of the training---------------------------------------
DEEPMD INFO running on: VM-0-4-centos
DEEPMD INFO computing device: gpu:0
DEEPMD INFO CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO Count of visible GPU: 1
DEEPMD INFO num_intra_threads: 0
DEEPMD INFO num_inter_threads: 0
DEEPMD INFO -----------------------------------------------------------------
DEEPMD INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD INFO found 3 system(s):
DEEPMD INFO system natoms bch_sz n_bch prob pbc
DEEPMD INFO ../data/data_0/ 192 1 80 0.250 T
DEEPMD INFO ../data/data_1/ 192 1 160 0.500 T
DEEPMD INFO ../data/data_2/ 192 1 80 0.250 T
DEEPMD INFO --------------------------------------------------------------------------------------
DEEPMD INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD INFO found 1 system(s):
DEEPMD INFO system natoms bch_sz n_bch prob pbc
DEEPMD INFO ../data/data_3 192 1 80 1.000 T
DEEPMD INFO --------------------------------------------------------------------------------------
DEEPMD INFO training without frame parameter
DEEPMD INFO data stating... (this step may take long time)
DEEPMD INFO built lr
Traceback (most recent call last):
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 522, in _apply_op_helper
preferred_dtype=default_dtype)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
return func(*args, **kwargs)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1535, in convert_to_tensor
(dtype.name, value.dtype.name, value))
ValueError: Tensor conversion requested dtype float32 for Tensor with dtype float16: <tf.Tensor 'filter_type_0/Reshape_5:0' shape=(?, 46, 100) dtype=float16>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/denghui/tensorflow_venv/bin/dp", line 33, in <module>
sys.exit(load_entry_point('deepmd-kit==2.0.3.dev80+g0d8fe0a.d20220209', 'console_scripts', 'dp')())
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/main.py", line 442, in main
train_dp(**dict_args)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/train.py", line 106, in train
_do_work(jdata, run_opt, is_compress)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/train.py", line 162, in _do_work
model.build(train_data, stop_batch)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/train/trainer.py", line 316, in build
self._build_network(data)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/train/trainer.py", line 348, in _build_network
reuse = False)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/model/ener.py", line 169, in build
reuse = reuse)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 476, in build
trainable = self.trainable)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 561, in _pass_filter
layer, qmat = self._filter(inputs_i, type_i, name='filter_type_'+str(type_i)+suffix, natoms=natoms, reuse=reuse, trainable = trainable, activation_fn = self.filter_activation_fn)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/common.py", line 564, in wrapper
**{kk: safe_cast_tensor(vv, GLOBAL_TF_FLOAT_PRECISION, self.precision) for kk, vv in kwargs.items()},
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 799, in _filter
suffix = "_"+str(type_i))
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 748, in _filter_lower
return tf.matmul(tf.reshape(inputs_i, [natom, shape_i[1]//4, 4]), xyz_scatter, transpose_a = True)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 3608, in matmul
a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1525, in batch_mat_mul_v2
"BatchMatMulV2", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 558, in _apply_op_helper
inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'BatchMatMulV2' Op has type float16 that does not match type float32 of argument 'x'.
root se_e2_a_mixed_prec $ client_loop: send disconnect: Broken pipe
Deepmd-kit version, installation way, input file, running commands, error log, etc.
The latest devel branch. Also I think we should add an UT for the mixed precision training.
Steps to Reproduce
As shown above.
Further Information, Files, and Links
None.
Reactions are currently unavailable