Enable mixed precision support for deepmd-kit#1285
Enable mixed precision support for deepmd-kit#1285wanghan-iapcm merged 17 commits intodeepmodeling:develfrom
Conversation
|
@wanghan-iapcm an import error is caught in the latest dpdata |
Codecov Report
@@ Coverage Diff @@
## devel #1285 +/- ##
=======================================
Coverage 64.28% 64.28%
=======================================
Files 5 5
Lines 14 14
=======================================
Hits 9 9
Misses 5 5 Continue to review full report at Codecov.
|
deepmd/train/trainer.py
Outdated
| optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate) | ||
| if DP_ENABLE_MIXED_PRECISION: | ||
| # enable dynamic loss scale of the gradients | ||
| optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer) |
There was a problem hiding this comment.
This function has been moved to tf.mixed_precision.enable_mixed_precision_graph_rewrite. https://www.tensorflow.org/api_docs/python/tf/compat/v1/mixed_precision/enable_mixed_precision_graph_rewrite What TF version do you use? Do you know if it is supported in all TF versions?
There was a problem hiding this comment.
This function was found in Nvidia's official documentation. I have tested it with the TF-1.14.0 and TF-2.6.0 environment. Since it is a deprecated function, I will use the new tf.mixed_precision.enable_mixed_precision_graph_rewrite function.
There was a problem hiding this comment.
The method was available since v1.12 (tensorflow/tensorflow@02730dc) and then was renamed in v2.4 (tensorflow/tensorflow@0112286). We may need to raise an error for TF<1.12.
pymatgen... could you please help fix it? thanks! |
|
|
There are some problems in the mixed precision training on the descriptors of se_r and se_t types, which are under investigation. |
deepmd/train/trainer.py
Outdated
| optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate) | ||
| if self.mixed_prec is not None: | ||
| # check the TF_VERSION, when TF < 1.12, mixed precision is not allowed | ||
| if TF_VERSION < "1.12": |
|
Can you also support hybrid? |
As we said, there's still some errors when using the |
|
It will be useful to |
deepmd/utils/network.py
Outdated
| trainable = True, | ||
| trainable = False, |
There was a problem hiding this comment.
Why introduce this change?
There was a problem hiding this comment.
typo for debug, I'll fix it
deepmd/utils/network.py
Outdated
| trainable = trainable) | ||
| variable_summaries(b, 'bias') | ||
|
|
||
| if mixed_prec is not None and outputs_size != 1: |
There was a problem hiding this comment.
I do not like this idea.
For dipole and polar, the size of output layer is not 1, but they are using fp16, which is not what we want.
deepmd/utils/network.py
Outdated
| if mixed_prec is not None and outputs_size != 1: | ||
| idt = tf.cast(idt, get_precision(mixed_prec['compute_prec'])) |
There was a problem hiding this comment.
Again outputs_size != 1 may not be a good idea.
| if self.mixed_prec is not None: | ||
| inputs = tf.cast(inputs, get_precision(self.mixed_prec['compute_prec'])) |
There was a problem hiding this comment.
Do we need this line? the inputs are anyway cast to compute_prec in networks.one_layer or networks.embedding_net
There was a problem hiding this comment.
- There's matrix multiplication outside the embedding net, we need to cast the inputs to match the dtype of the embedding net output.
- Half precision slicing will be more efficient.
deepmd/descriptor/descriptor.py
Outdated
| def enable_mixed_precision(self, mixed_prec : dict = None) -> None: | ||
| """ | ||
| Reveive the mixed precision setting. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| mixed_prec | ||
| The mixed precision setting used in the embedding net | ||
|
|
||
| Notes | ||
| ----- | ||
| This method is called by others when the descriptor supported compression. | ||
| """ | ||
| raise NotImplementedError( | ||
| "Descriptor %s doesn't support mixed precision training!" % type(self).__name__) | ||
|
|
||
|
|
deepmd/train/trainer.py
Outdated
| else: | ||
| optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate) | ||
| if self.mixed_prec is not None: | ||
| TF_VERSION_LIST = [int(item) for item in TF_VERSION.split('.')] |
There was a problem hiding this comment.
int(item) will cause an error if the version is a pre-release, e.g. v2.6.0-rc1. See https://github.com/tensorflow/tensorflow/blob/ff68385595088304cf772086b9a259a65b007622/tensorflow/core/public/version.h#L35-L37
| Argument("output_prec", str, optional=True, default="float32", doc=doc_output_prec), | ||
| Argument("compute_prec", str, optional=False, default="float16", doc=doc_compute_prec), |
There was a problem hiding this comment.
The default behavior is to enable mixed precision?
There was a problem hiding this comment.
The mixed_precision session is optional within the training session(see line 617), so it's false by default. However, when one have set the mixed_precision session, one must provide the compute_prec key.
This PR has enabled the mixed-precision training as well as the mixed precision inference process for deepmd-kit. Without any change of the input script, one can easily enable the mixed precision training by simply setting the environment variable
DP_ENABLE_MIXED_PRECtofp16.Main changes:
DP_ENABLE_MIXED_PRECenvironmental variable for the control of mixed precision training. Note currently onlytf.float16precision is enabled with the mixed precision setting.argcheck.pyaccording to the environment variableDP_INTERFACE_PREC.According to our example water benchmark system, with
TF-2.6.0,CUDA-11.0andNVIDIA-V100 GPUenvironment, the speed of the dp training process decreased slightly, while the inference process with 12288 atoms has gained a speedup by a factor of 3.It is strongly recommended to enable the mixed precision settings with CUDA-11.0 or above CUDA-toolkit.