[BUG] Mixed precision training

**Summary**

We may have some troubles with the mixed precision training:
```
root water $ cd se_e2_a_mixed_prec
root se_e2_a_mixed_prec $ ls
input.json
root se_e2_a_mixed_prec $ dp train input.json
WARNING:tensorflow:From /root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:101: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD INFO    training data with min nbor dist: 0.8854385688525511
DEEPMD INFO    training data with max nbor size: [38, 72]
DEEPMD INFO     _____               _____   __  __  _____           _     _  _
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /root/denghui/deepmd-kit/_skbuild/linux-x86_64-3.7/cmake-install
DEEPMD INFO    source :              v2.0.2-80-g0d8fe0a-dirty
DEEPMD INFO    source brach:         devel
DEEPMD INFO    source commit:        0d8fe0a
DEEPMD INFO    source commit at:     2022-02-04 11:32:01 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build with tf inc:    /root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           VM-0-4-centos
DEEPMD INFO    computing device:     gpu:0
DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO    Count of visible GPU: 1
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 3 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 1 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
DEEPMD INFO    data stating... (this step may take long time)
DEEPMD INFO    built lr
Traceback (most recent call last):
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 522, in _apply_op_helper
    preferred_dtype=default_dtype)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1535, in convert_to_tensor
    (dtype.name, value.dtype.name, value))
ValueError: Tensor conversion requested dtype float32 for Tensor with dtype float16: <tf.Tensor 'filter_type_0/Reshape_5:0' shape=(?, 46, 100) dtype=float16>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/denghui/tensorflow_venv/bin/dp", line 33, in <module>
    sys.exit(load_entry_point('deepmd-kit==2.0.3.dev80+g0d8fe0a.d20220209', 'console_scripts', 'dp')())
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/main.py", line 442, in main
    train_dp(**dict_args)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/train.py", line 106, in train
    _do_work(jdata, run_opt, is_compress)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/train.py", line 162, in _do_work
    model.build(train_data, stop_batch)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/train/trainer.py", line 316, in build
    self._build_network(data)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/train/trainer.py", line 348, in _build_network
    reuse = False)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/model/ener.py", line 169, in build
    reuse = reuse)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 476, in build
    trainable = self.trainable)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 561, in _pass_filter
    layer, qmat = self._filter(inputs_i, type_i, name='filter_type_'+str(type_i)+suffix, natoms=natoms, reuse=reuse, trainable = trainable, activation_fn = self.filter_activation_fn)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/common.py", line 564, in wrapper
    **{kk: safe_cast_tensor(vv, GLOBAL_TF_FLOAT_PRECISION, self.precision) for kk, vv in kwargs.items()},
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 799, in _filter
    suffix = "_"+str(type_i))
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 748, in _filter_lower
    return tf.matmul(tf.reshape(inputs_i, [natom, shape_i[1]//4, 4]), xyz_scatter, transpose_a = True)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 3608, in matmul
    a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1525, in batch_mat_mul_v2
    "BatchMatMulV2", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 558, in _apply_op_helper
    inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'BatchMatMulV2' Op has type float16 that does not match type float32 of argument 'x'.
root se_e2_a_mixed_prec $ client_loop: send disconnect: Broken pipe
```


**Deepmd-kit version, installation way, input file, running commands, error log, etc.**

The latest devel branch. Also I think we should add an UT for the mixed precision training.





**Steps to Reproduce**


As shown above.

**Further Information, Files, and Links**


None.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Mixed precision training #1469

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Mixed precision training #1469

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions