Skip to content

[BUG] Mixed precision training #1469

@denghuilu

Description

@denghuilu

Summary

We may have some troubles with the mixed precision training:

root water $ cd se_e2_a_mixed_prec
root se_e2_a_mixed_prec $ ls
input.json
root se_e2_a_mixed_prec $ dp train input.json
WARNING:tensorflow:From /root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:101: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD INFO    training data with min nbor dist: 0.8854385688525511
DEEPMD INFO    training data with max nbor size: [38, 72]
DEEPMD INFO     _____               _____   __  __  _____           _     _  _
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /root/denghui/deepmd-kit/_skbuild/linux-x86_64-3.7/cmake-install
DEEPMD INFO    source :              v2.0.2-80-g0d8fe0a-dirty
DEEPMD INFO    source brach:         devel
DEEPMD INFO    source commit:        0d8fe0a
DEEPMD INFO    source commit at:     2022-02-04 11:32:01 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build with tf inc:    /root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           VM-0-4-centos
DEEPMD INFO    computing device:     gpu:0
DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO    Count of visible GPU: 1
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 3 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 1 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
DEEPMD INFO    data stating... (this step may take long time)
DEEPMD INFO    built lr
Traceback (most recent call last):
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 522, in _apply_op_helper
    preferred_dtype=default_dtype)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1535, in convert_to_tensor
    (dtype.name, value.dtype.name, value))
ValueError: Tensor conversion requested dtype float32 for Tensor with dtype float16: <tf.Tensor 'filter_type_0/Reshape_5:0' shape=(?, 46, 100) dtype=float16>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/denghui/tensorflow_venv/bin/dp", line 33, in <module>
    sys.exit(load_entry_point('deepmd-kit==2.0.3.dev80+g0d8fe0a.d20220209', 'console_scripts', 'dp')())
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/main.py", line 442, in main
    train_dp(**dict_args)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/train.py", line 106, in train
    _do_work(jdata, run_opt, is_compress)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/entrypoints/train.py", line 162, in _do_work
    model.build(train_data, stop_batch)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/train/trainer.py", line 316, in build
    self._build_network(data)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/train/trainer.py", line 348, in _build_network
    reuse = False)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/model/ener.py", line 169, in build
    reuse = reuse)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 476, in build
    trainable = self.trainable)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 561, in _pass_filter
    layer, qmat = self._filter(inputs_i, type_i, name='filter_type_'+str(type_i)+suffix, natoms=natoms, reuse=reuse, trainable = trainable, activation_fn = self.filter_activation_fn)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/common.py", line 564, in wrapper
    **{kk: safe_cast_tensor(vv, GLOBAL_TF_FLOAT_PRECISION, self.precision) for kk, vv in kwargs.items()},
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 799, in _filter
    suffix = "_"+str(type_i))
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/deepmd_kit-2.0.3.dev80+g0d8fe0a.d20220209-py3.7-linux-x86_64.egg/deepmd/descriptor/se_a.py", line 748, in _filter_lower
    return tf.matmul(tf.reshape(inputs_i, [natom, shape_i[1]//4, 4]), xyz_scatter, transpose_a = True)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 3608, in matmul
    a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1525, in batch_mat_mul_v2
    "BatchMatMulV2", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/root/denghui/tensorflow_venv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 558, in _apply_op_helper
    inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'BatchMatMulV2' Op has type float16 that does not match type float32 of argument 'x'.
root se_e2_a_mixed_prec $ client_loop: send disconnect: Broken pipe

Deepmd-kit version, installation way, input file, running commands, error log, etc.

The latest devel branch. Also I think we should add an UT for the mixed precision training.

Steps to Reproduce

As shown above.

Further Information, Files, and Links

None.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions